Nowadays, the new advanced remote sensing techniques for earth observation generate increasingly available high-resolution image data obtained from satellites, airplanes and unmanned aerial vehicles. The improvement of image resolution, resulting in more and more ground details clearly seen, calls for urgent development of automatic and accurate interpretation of numerous data sources. In this context, the classification of high-resolution remote sensing images turns from pixel- and object-level classification to scene-level semantic classification [1, 2].
Scene classification of remote sensing images aims to categorize the given scene images into a set of semantically meaningful classes predefined according to human interpretation. Here, the scene images refers to local patches extracted from large remote sensing images, containing specific semantic information, e.g., airport, park and residential area. Scene classification plays a significant role in a wide range of applications, e.g., urban planning and geographic mapping, and thus it has become a hot research topic in the remote sensing community.
Since the scene images usually cover multiple land-cover types or ground objects which vary in different scales, shapes, orientations and spatial distributions, scene images from different categories may share very similar content while images from the same category often exhibit high diversity in appearance. Such inter-class similarity and intra-class diversity make the scene classification a challenging task. Therefore, building robust and discriminative feature representations for describing the semantic content of scenes is the core component in the scene classification.
Over the past years, there has been growing interest in developing various methods for scene classification in remote sensing imagery [1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. From the pioneering work of introducing bag-of-visual-words (BoVW) model [1, 3] to current state-of-the-art methods that employ the high-capacity deep convolutional neural networks (CNN) [13, 14, 15]
, the performance of scene classification has gained dramatic breakthrough. However, despite the impressive results achieved by using modern CNNs, the accuracies on existing public datasets have almost reached saturation due to the limitations of these datasets. Recently, it seems that the studies on scene classification tend to blindly pursing accuracy improvement rather than providing enlightening ways to essentially address some problems still retained in scene classification task. Hence, in this paper, we would like to discuss some heuristic open directions for the sake of expanding the research scope of remote sensing image scene classification. We also wish these potential directions will arouse extensive attention from not only remote sensing community but also computer vision community.
2 A Brief Review on Scene Classification Methods
. It models an image as a histogram of visual words constructed by vector quantizing hand-crafted low-level features (e.g., the structural, textural or color features) with a clustering method. Due to the simplicity and good performance, the BOW model have also received many concerns in scene classification of remote sensing images. Motivated by the special spatial relationship among objects in scene images, several researchers have proposed improved variants of BOW by incorporating spatial information. However, the performance of BoVW and its variants is severely limited by the poor description ability of the hand-crafted features. To remedy this problem, several researchers resort to unsupervised feature learning (UFL) methods[2, 7, 16], which generate encoded features based on a codebook (also known as dictionary) learned from a large amount of unlabeled data. The UFL methods is capable of automatically learning discriminative features better adaptable to remote sensing scene images instead of relying on extraction of hand-crafted features.
More recently, the deep CNNs have enjoyed great success in various challenging visual recognition benchmarks, with substantially improving the state-of-the-art performance, owing to the large-scale image datasets and efficient use of high-performance GPUs [13, 14, 15, 17]
. However, for a new visual recognition task with limited amount of training data, training a deep CNN that usually contains millions of parameters is infeasible. A lot of studies have discovered that intermediate features extracted from a deep CNN pre-trained on sufficiently large-scale dataset, such as ImageNet, can be employed as a generic image representation for many applications, e.g., scene classification, object detection and image retrieval. All these works focus on how to obtain strong image representations by transferring a pre-trained CNN to their tasks. For scene classification of remote sensing images, a few works[8, 18] fine-tune the pretrained CNNs on the target scene datasets, and achieve promising performance; some works directly take the CNN activations from the fully-connected layers as image representations; the others [7, 10, 11, 12] build discriminative feature representations by encoding CNN activations from convolutional layers in feature coding scenarios, where the convolutional features maps are viewed as a 2-D array of local features, obtaining the state-of-the-art performance on existing scene datasets. In general, the remarkable performance of those methods exploiting CNNs undoubtedly demonstrate the strong generalization capability of the pretrained CNN models.
To date, performance of CNN-based methods have gradually reach saturation, due to the serious limitations of existing datasets, such as the small amount of labeled training samples, the low-coverage and low-diversity. Hence, it is necessary to discuss more future directions in scene classification task so as to promote the development of understanding scenes in remote sensing imagery.
3 Discussion on the Open Directions of Scene Classification
In this section, we discuss several potential directions for scene classification in remote sensing imagery. They are inspired by the limitations of current scene classification methods and existing datasets. Here, we just present three examples.
3.1 Building larger-scale datasets for scene classification
Deep CNNs, as the most prevalent deep learning model, have demonstrated breakthrough accuracies and is now the dominant approach for almost all classification tasks. The availability of large-scale well-annotated datasets is one of the critical factors to the success of deep CNNs, because a considerable amount of training samples of wide coverage and great diversity are particularly beneficial to avoid overfitting and strengthen the generalization ability of CNN models. However, the existing datasets exhibit notable limitations, such as small number of image samples and scene categories, the less diversity within a categories, making it impracticable to fully train a deep CNN model from scratch. Because of these limitations, a majority of state-of-the-art methods resort to transferring deep CNNs that have been successfully pretrained on a large-scale natural image dataset (e.g., ImageNet) for scene classification of remote sensing images, where the pre-trained CNNs are either considered as fixed feature extractors or fine-tuned (adjusting the parameters from a certain number of layers) on the target dataset. These kinds of transferring solutions perform fairly well on existing scene datasets with limited samples, but they are not optimal choice in comparison with training a deep CNN model from scratch, since fully trained deep networks is capable of learning more specific features perfectly adapting to the target datasets when the dataset is large enough . In other words, building large-scale datasets for scene classification is reasonably desirable, which can reach the full potentials of deep CNNs on one hand, and accelerate the widespread use of new-generation deep learning models on the other hand.
Very recently, it is noteworthy that tremendous efforts have been made to build large-scale benchmark datasets for scene classification., such as the AID , NWPU-RESISC45  and RSI-CB , the total number of samples being , , and respectively. Compared with previous prevalent datasets, these new datasets occupy obvious superiority both in the total number and diversity of image samples. But so far, even the largest dataset is far too small to effectively train a deep CNN. A common deep CNN architecture with millions of parameters will dramatically overfit the tens of thousands of training samples. Therefore, a potential direction is to further enlarge the scale of current datasets to hundreds of thousands or even multiple millions.
Although it is convenient for us to get access to large amounts of remote sensing images with the rapid development of advanced remote sensing techniques, collecting image samples of good quality (here, “good quality” means the true category of each sample is purified and easily distinguished) is a very expensive and time-consuming task. It requires costly manual annotation from researchers with expert knowledge in the field of remote sensing image interpretation. Without any filed trips or survey, expertise and research experience are essential to ensure the accurate labeling.
Considering that it is challenging to well annotate a high-coverage and high-diversity dataset (time-consuming and call for expert domain knowledge), labeling scene images using geographic crowdsource data, e.g., the Open Street Map, is a feasible way, since the crowdsource data labeled by the public are massive in quantity, highly real-time and acquired in low cost. We expect that more researchers will focus on establishing larger-scale high-quality datasets for scene classification on the basis of the crowdsource data. The future large-scale dataset will in turn promote the community to develop new deep models specially appropriate for remote sensing scenes.
3.2 Better describing the content of scenes
The well-defined scene classification task is to label scene images with specific semantic categories. In other words, the terms of semantic categories (e.g., commercial area and residential area) summarily describe the semantic content of scenes in high-level abstraction. Such kind of static classification task cannot fully describe the attributes of ground objects and their relationship in scenes. Thus, in order to show more detailed content in scene images, it is potential to move scene classification further to scene caption, which aims to describe the content of scenes in a finer semantic level using meaningful sentences. Thereby, a scene image is depicted by dynamic sentences summarizing the main content of scenes instead of a static single term of category.
The methods developed for natural image caption may not adapt well to scene caption in remote sensing images, due to ambiguous semantic information (objects show variations in scale, orientation and geometric structure) and complicated spatial distribution of objects. To follow the direction of scene caption, a well-annotated scene caption dataset is also necessary. Researchers have presented a few exemplary works on remote sensing image caption [23, 24], and have constructed a large-scale dataset under specific annotated instructions in consideration of characteristics of remote sensing images, e.g., not using words that represent the concept of “direction” and “vague”. We believe that the scene caption will be a new chance to generate better description of scenes in remote sensing images and will receive more concerns from remote sensing community.
Besides, the online geo-tagged pictures collected from social websites and location-based geographic resources (e.g., the global positioning system) have great potentials for recognizing remote sensing scenes, since they are able to supply various information, e.g., finer details of all kinds of ground objects with extremely high resolution, vertical views rather than overhead views, where the ground objects are located and what happens at that time . This kind of crowdsourcing information can be employed to train new models together with remote sensing dataset, and thus will be valuable for better understanding the content of remote sensing scenes.
3.3 Dealing with images from different domains
As described above, the current studies have proven that the CNNs are valuable tools for constructing “black box” architectures to accurately interpret remote sensing image scenes. In particular, the successful pretrained CNNs generalise well to remote sensing scene datasets, and have been reported with impressive classification performance when used to build discriminative holistic representations from intermediate layers or to perform fine-tuning of model parameters on target datasets. However, these strategies of exploiting pretrained CNNs may be not effective enough for images acquired under different conditions (e.g., the labeled training images and target images are obtained from different sensors varying greatly in spatial and spectral resolutions). In fact, the remote sensing images are inevitably affected by various natural and human factors, such as the sensors, perspective of camera, season and weather conditions, geographic locations, and so on. Thus, the simple transferring strategies for pretrained CNNs are likely to achieve unsatisfactory results when the source dataset is far from similar to the target dataset (also called data shift). This problem can be well addressed by more complex approaches concerning domain adaptation .
To facilitate the generalization power of these CNN models when dealing with data shift , it is a potential opportunity of designing improved approaches based on domain adaption, wherein the feature representations from target and source domain are properly mapped in a uniform space while preserving original structures of data. It is feasible to develop additional adaptation layers (or even well-designed network modules) based on the output of a pretrained CNN, and to optimize the loss function with imposing auxiliary regularization terms to reduce the mismatch between target and source data distributions.
The development of scene classification methods of remote sensing images has hit a bottleneck in the context of revolutionary change caused by deep models. In this paper, we delve into the existing issues in scene classification, and put forward several open directions to broaden the scope of scene classification task. We believe these potential opportunities will receive increasing interests from both the remote sensing and computer vision community, and thereby will encourage the communities to develop new models.
-  Yi Yang and Shawn Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” in Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2010, pp. 270–279.
-  Anil M Cheriyadat, “Unsupervised feature learning for aerial scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 1, pp. 439–451, 2014.
-  Li-Jun Zhao, Ping Tang, and Lian-Zhi Huo, “Land-use scene classification using a concentric circle-structured multiscale bag-of-visual-words model,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 7, no. 12, pp. 4620–4631, 2014.
-  Shizhi Chen and YingLi Tian, “Pyramid of spatial relatons for scene-level land use classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 4, pp. 1947–1957, April 2015.
Fan Hu, Gui-Song Xia, Zifeng Wang, Xin Huang, Liangpei Zhang, and Hong Sun,
“Unsupervised feature learning via spectral clustering of multidimensional patches for remotely sensed scene classification,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 8, no. 5, pp. 2015–2030, May 2015.
Fan Zhang, Bo Du, and Liangpei Zhang,
“Scene classification via a gradient boosting random convolutional network framework,”IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 3, pp. 1793–1802, 2016.
-  Fan Hu, Gui-Song Xia, Jingwen Hu, and Liangpei Zhang, “Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery,” Remote Sensing, vol. 7, no. 11, pp. 14680–14707, 2015.
-  Keiller Nogueira, Otávio AB Penatti, and Jefersson A dos Santos, “Towards better exploiting convolutional neural networks for remote sensing scene classification,” Pattern Recognition, vol. 61, pp. 539–556, 2017.
O.A.B. Penatti, K. Nogueira, and J. A. dos Santos,
“Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?,”in IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2015, pp. 44–51.
-  Souleyman Chaib, Huan Liu, Yanfeng Gu, and Hongxun Yao, “Deep feature fusion for vhr remote sensing scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 8, pp. 4775–4784, 2017.
-  Guoli Wang, Bin Fan, Shiming Xiang, and Chunhong Pan, “Aggregating rich hierarchical features for scene classification in remote sensing imagery,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, no. 9, pp. 4104–4115, 2017.
-  Erzhu Li, Junshi Xia, Peijun Du, Cong Lin, and Alim Samat, “Integrating multilayer features of convolutional neural networks for remote sensing scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 10, pp. 5653–5665, 2017.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  Fan Zhang, Bo Du, and Liangpei Zhang, “Saliency-guided unsupervised feature learning for scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 4, pp. 2175–2184, 2015.
-  Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba, “Places: A 10 million image database for scene recognition,” IEEE transactions on pattern analysis and machine intelligence, 2017.
-  Marco Castelluccio, Giovanni Poggi, Carlo Sansone, and Luisa Verdoliva, “Land use classification in remote sensing images by convolutional neural networks,” arXiv preprint arXiv:1508.00092, 2015.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
-  Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu, “Aid: A benchmark data set for performance evaluation of aerial scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 7, pp. 3965–3981, 2017.
-  Gong Cheng, Junwei Han, and Xiaoqiang Lu, “Remote sensing image scene classification: Benchmark and state of the art,” Proceedings of the IEEE, vol. 105, no. 10, pp. 1865–1883, 2017.
-  Haifeng Li, Chao Tao, Zhixiang Wu, Jie Chen, Jianya Gong, and Min Deng, “Rsi-cb: A large scale remote sensing image classification benchmark via crowdsource data,” arXiv preprint arXiv:1705.10450, 2017.
-  Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li, “Exploring models and data for remote sensing image caption generation,” arXiv preprint arXiv:1712.07835, 2017.
-  Zhenwei Shi and Zhengxia Zou, “Can a machine generate humanlike language descriptions for a remote sensing image?,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 6, pp. 3623–3634, 2017.
-  Esam Othman, Yakoub Bazi, Farid Melgani, Haikel Alhichri, Naif Alajlan, and Mansour Zuair, “Domain adaptation network for cross-scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 8, pp. 4441–4456, 2017.