Land cover classification plays a crucial role in applications such as land resource management, urban planning and environmental protection . With the development of remote sensing technology, imaging satellites can supply remote sensing data that cover most of the Earth s surface, which provides new chances to land cover classification, while also bringing great challenges. In images captured from different areas and at different times, the photographic distortion, viewing angle, scale, illumination and observed content vary largely, which makes it difficult to find an efficient land cover classification method for different remote sensing images.
Up to now, land cover classification task has been widely investigated. Primary studies interpret the image information according to the spectrum of every individual pixel. However, this kind of methods is easy to be affected by intra-class spectral variability and spectral noise . More recent methods implement spectral-spatial classification, which integrates regional spatial information to boost the performance of classification. Spectral-spatial features, such as texture, shape and structure, can represent more characteristics of the images than pure spectral or spectral derivative features . However, neither spectrum nor spectral-spatial features have sufficient invariance to complex changes emerging in various RS data. When processing new images, new labeled data and algorithm adjustment are generally necessary, which reduces the efficiency of practical applications.
In recent years, as deep learning has achieved breakthrough in the field of visual recognition, new path has been set for image categorization in the remote sensing community 
. Convolutional Neural Networks (CNN) are the most representative deep learning models, which are constructed in deep hierarchical architectures and capable of extracting the intrinsic features of data[5, 6]
. Researchers have utilized deep features to promote the performance of land cover classification[7, 8]. Nevertheless, the recognition capacity of CNN models greatly depends on the size of annotated training data. The existing high-resolution land cover datasets either cover limited geographic areas with insufficient samples  or cover homogeneous areas with low intra-class variability  . These limitations incline to restrain the generalization ability of deep models.
In this paper, in order to efficiently classify remote sensing images captured under different conditions, we propose a novel method based on the combination of deep learning, hierarchical segmentation and multi-scale information fusion. We establish a large-scale land cover dataset with 150 Gaofen-2 (GF-2) imageries to support our approach. This dataset has high intra-class differences and low inter-class diversities and hence it can also serve as data resource to advance the state-of-the-art in land cover classification task. Our experiments achieve outstanding classification performance compared with traditional methods.
2 Dataset Description
2.1 Gaofen-2 Satellite Imagery
GF-2 satellite is configured with two Panchromatic and Multispectral CCD Camera Sensors (PMS), which have a resolution of 1 m panchromatic/4 m multispectral. It can provide a combined swath of 45 km, which is embodied as a size of pixels in multispectral imagery. The revisiting period of GF-2 is 5 days, hence it is able to capture detail information over a wide area at short intervals. Consider the combination of high resolution, wide imaging coverage, frequent revisit and high image quality, GF-2 imagery is an ideal data source for land cover classification.
2.2 Study Area
We annotate 150 GF-2 satellite images to construct a large-scale land cover dataset, which is named as Gaofen Image Dataset (GID). GID is widely distributed over the geographic areas covering more than 70,000 km. Benefit from the various acquisition locations and times, GID presents rich diversity in spectral response and morphological structure. Five representative land cover categories of application values are selected to be annotated: built-up, farmland, forest, meadow, and waters. Areas that do not belong to the above five categories or cannot be artificially recognized are labeled as unknown, which is represented using black color. Fig. 1 shows some samples and their corresponding label masks of GID.
In this section, we describe the proposed land cover classification algorithm in detail. Our method combines deep learning, hierarchical segmentation and multi-scale information fusion. Specifically, we firstly use convolutional neural networks (CNN) to classify GF-2 imageries in form of image patches, and simultaneously use segmentation method to obtain a series of homogeneous objects. Then we fuse the classification and the segmentation maps with voting strategy, which is illustrated in Fig. 2. Finally, multi-scale spatial information is collected to augment the context information and further promote the classification performance.
3.1 Patch-based CNN classification
Residual networks (ResNet)  has many advantages over the previous CNN models. It relieves the problem of the gradient disappearance and is easier to train when the net architecture is very deep. We re-train ResNet-50 model with GID to conduct patch-based classification. It has a total of 49 convolution layers, consisting 16 bottleneck structures. Each of these bottleneck structures has three convolutional layers, which first decrease and then elevate the dimension of feature maps to control the number of parameters.
When fine-tuning, we cut the original GF-2 imagery into compact square patches, and then randomly select patches as training samples. We remove the 1000-dimensional softmax layer of ResNet-50 and change it into a Gaussian distribution initialized 5-dimensional softmax layer, where 5 is the number of land cover categories in GID. We only fine-tune ResNet-50’s last three bottleneck structures and the last softmax layer. The hyper-parameters are set as follows: batch size is 32, epoch is 15, momentum is 0.9, and initial learning rate is set to be 0.1. During the iteration, when the error rate stops decreasing, we divide the learning rate by 10 and use this new value to update the parameters. In the experiment, the learning rate was reduced three times before the model is converged.
In the process of classification, we cut the testing imageries into square patches with the same size as the training samples, and then acquire their category distribution probability from re-trained ResNet-50’s softmax layer. The entire GF-2 imagery is thus classified in the form of compact patches.
3.2 Segmentation and Voting
Due to the input limitation of CNN s structure, we classify GF-2 imageries on the basis of image patches, which completely loses the boundary information of ground objects. Considering the above issue, we utilize selective search  segmentation as post-processing. Selective search segmentation employs a graph-based approach to obtain a variety of initial regions in different color spaces, and then iteratively merge the small regions into bigger ones with greedy algorithm. There are different consolidation strategies controlling the level of merging, so this method can accurately extract the object information in remote sensing images.
After obtaining the results of classification and segmentation, we use voting strategy to combine category and boundary information. For every segmented region, the number of pixels belonging to different categories is counted. Each pixel belonging to a same segmented region votes for its corresponding class, and the entire region is labeled with the category that gets the most votes. After completely voting for each segmented region in the whole GF-2 imagery, we obtain the final classification map.
3.3 Multi-scale Classification
Despite that CNN models have certain invariance to rotation, translation, and illumination change, their fixed input size and receptive fields limit their observable space. When identifying the category of an area in remote sensing image, not only the local information but also the spatial context information is crucial. Therefore, we fuse multi-scale spatial information extracted from the same locations to acquire the context information.
In consideration of the computational efficiency and the parameter size of deep model, we train a single model with multi-scale patches. We randomly select a certain number of sampling points from GF-2 imageries for each category and treat these points as square centers to cut image patches with different scales respectively. We warp them into a uniform size and use them to fine-tune ResNet-50 model. When testing, we sum up the classification probability vectors of different scales and then use the summed probability vector to identify land cover category.
4.1 Experiment settings
In GID, we randomly select 120 GF-2 imageries as training data and treat the rest 30 imageries as testing data. 30,000 sampling points are randomly selected for every category, constituting a training set of 150,000 samples at the total. We separately set the patch size as , , pixels. With the exception of -pixel patches, the other sizes are fixed into pixels before being put into ResNet. In pre-processing, we remove the near-infrared band from GF-2 images, and then re-quantize the response of the visible light band to 8-bit.
For performance comparison, we tested some traditional land cover classification methods. The examined features include color histogram (CH), gray-level co-occurrence matrix (GLCM), local binary patterns (LBP) and their fused features, the classifiers exploited include support vector machine (SVM) and random forest (RF). The method of feature fusion is normalized vector concatenation. In addition, we compare our performance with eCognition software. For traditional methods, the training and testing sets are sampled from a single image. We use overall accuracy (OA), average accuracy (AA) and kappa coefficient to quantitatively evaluate the experimental results.
4.2 Classification Results
Table. 1 illustrates the resulting kappa, OA, AA and each category accuracy of different sampling scales. For single-scale, it is obvious that the patches with the smallest scale generate the highest accuracy. However, their accuracies are generally lower than those of multi-scale. This is because multi-scale patches incorporate spatial context information from wider observable areas. Table. 2 compares the quantitative results of our method and other baseline methods. The accuracy of our method is significantly higher than that of the traditional methods.
In order to solve the problem of adaptability limitation of LULC classification methods to RS images captured under different conditions, we propose a classification method based on deep learning, hierarchical segmentation and multi-scale information fusion. Meanwhile, we introduce a large-scale land cover classification dataset consisting of 150 Gaofen-2 imageries labeled in 5 categories for model training and performance evaluation. It covers a large area and has high intra-class diversity, hence it can also provide the research community with a high-quality data resource for evaluating and advancing the state-of-the-art methods in land cover classification. The experiments show that, the proposed method can achieve remarkable performance, and the combination of spatial context information of different scales can make the contribution to classification accuracy.
-  Mathieu Fauvel, Yuliya Tarabalka, Jon Atli Benediktsson, Jocelyn Chanussot, and James C Tilton, “Advances in spectral-spatial classification of hyperspectral images,” Proceedings of the IEEE, vol. 101, no. 3, pp. 652–675, 2013.
-  Ursula C Benz, Peter Hofmann, Gregor Willhauck, Iris Lingenfelder, and Markus Heynen, “Multi-resolution, object-oriented fuzzy analysis of remote sensing data for gis-ready information,” ISPRS Journal of photogrammetry and remote sensing, vol. 58, no. 3-4, pp. 239–258, 2004.
-  S Giada, T De Groeve, D Ehrlich, and P Soille, “Information extraction from very high resolution satellite imagery over lukole refugee camp, tanzania,” International Journal of Remote Sensing, vol. 24, no. 22, pp. 4251–4266, 2003.
-  Xiao Xiang Zhu, Devis Tuia, Lichao Mou, Gui-Song Xia, Liangpei Zhang, Feng Xu, and Friedrich Fraundorfer, “Deep learning in remote sensing: A comprehensive review and list of resources,” IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 4, pp. 8–36, 2017.
Fan Hu, Gui-Song Xia, Jingwen Hu, and Liangpei Zhang,
“Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery,”Remote Sensing, vol. 7, no. 11, pp. 14680–14707, 2015.
-  Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu, “Aid: A benchmark data set for performance evaluation of aerial scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 7, pp. 3965–3981, 2017.
-  Wenzhi Zhao and Shihong Du, “Learning multiscale and deep representations for classifying remotely sensed imagery,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 113, pp. 155–165, 2016.
-  Sakrapee Paisitkriangkrai, Jamie Sherrah, Pranam Janney, and Anton van den Hengel, “Semantic labeling of aerial and satellite imagery,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 9, no. 7, pp. 2868–2881, 2016.
-  Lei Ma, Manchun Li, Xiaoxue Ma, Liang Cheng, Peijun Du, and Yongxue Liu, “A review of supervised object-based land-cover image classification,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 130, pp. 277–293, 2017.
-  Volodymyr Mnih, Machine learning for aerial image labeling, Ph.D. thesis, University of Toronto (Canada), 2013.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in , 2016, pp. 770–778.
-  Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders, “Selective search for object recognition,” International journal of computer vision, vol. 104, no. 2, pp. 154–171, 2013.