With the advancement of earth observation techniques, the resolution of remote sensing images (e.g., hyperspectral images [Plaza2011Parallel], SAR images [Cantalloube2013Airborne]
, etc.) has been greatly improved. This allows a more intelligent understanding of remote sensing images and drives a large number of applications, such as classification of land use and land cover (LULC), remote sensing image retrieval, geographic object detection, traffic planning, urban growth measurement, and hazards monitoring[cheng2017remote, cheng2014multi, wang2016three, zhu2016bag, estoque2015pixel]. These applications mostly demand an intelligent scene classification of remote sensing images, namely giving each scene image a specific label (e.g. freeway, river, grassland) [cheng2017remote]. Traditional pixel-level methods[blaschke2001s] can hardly satisfied the need of remote sensing images scene classification. Object-based image analysis (OBIA) [blaschke2010object] and geographic-object-based image analysis (GEOBIA) [blaschke2014geographic] partition scene image into several segments, but carrying little semantic information which is essential to describe the scenes. For the complexity of remote sensing scenes, how to extract good semantic information for scene classification is a challenging task.
It is well known that good feature descriptions or representations are critical to obtain semantic information from remote sensing images. In other words, feature engineering is one of the most important procedures in remote sensing scene classification. Handcrafted features(such as GIST [Oliva2001Modeling], SIFT [Lowe2004Distinctive], HOG [Dalal2005Histograms]) are earliest used for solving remote sensing scene classification tasks. It is found that designing well-performed handcrafted features is time-consuming and domain knowledge intensive. And there are various scenes in remote sensing images with different spatial resolutions and imaging conditions. Handcrafted features will restrict the capacity of feature representations for that they are universal and may not best fit certain remote sensing images. Recently, deep learning-based[lecun2015deep] methods especially CNN-based methods lead a paradigm shift from designing features manually to learning features from data automatically. And CNN-based methods have achieved better performance than traditional methods in most of the remote sensing images intelligent understanding tasks[scott2017training, chaib2017deep, li2017integrating, paoletti2018new]. CNN learns optimal task-special features from remote sensing images. These features fit data better, and provide high-level information of images which necessary for robust feature representation.
There are two main parts in the feature-learning procedure of CNN-based methods, one is designing the architecture of CNN and the other is training CNN with images. Nogueira et al.[nogueira2017towards] found that AlexNet[krizhevsky2012imagenet] performs better than VGG16[simonyan2014very] on the UCMerced Land-Use dataset[yang2010bag] but the result is opposite on the Brazilian Coffee Scenes Dataset[penatti2015deep]. These CNNs with different architectures are trained in the same condition, which indicates the performances are mostly depended on the architectures of CNNs. Note that, previous work often design new architecture of CNN to satisfy the need of remote sensing images intelligent understanding. However, the architectures designed manually are also enslaved to the domain knowledge of network designers. To design better architectures for a special task or data set, a paradigm shift from designing architectures to learning architectures for data is imperative. In the domain of understanding natural images, An attempt in remote sensing is that Chen et al. [chen2019automatic] proposed an automatic CNN design approach in remote sensing HSI image classification.
In this paper, we explore a paradigm shift from designing architectures manually to learning architectures automatically. We also introduced and investigated CNN architectures learned from data for remote sensing scene classification tasks. The major contributions of this article are as follows:
1) We proposed an architecture-learning procedure to design CNN for remote sensing scene classification tasks.
2) We add atrous convolution structure into the architecture space to catch a larger context for high-resolution aerial or satellite remote sensing images. Experiments show that all the best architecture learned by our framework contains atrous convolution.
3) Our experiments show that the best architectures outperform several classical hand-designing architectures in three remote sensing scene classification data sets. Architecture determines the function of CNN, this architecture-learned method may help to understand which representations are important for remote sensing scene classification tasks.
Ii Related Works
In this section, the related works of remote sensing scene classification and neural architecture search methods are briefly reviewed.
Ii-a Scene Classification
The previous works about remote sensing scene classification are in varied forms but can be roughly divided into the following three categories according to the features they used: handcrafted features, unsupervised learning features, and deep learning features[cheng2017remote]. The methods based on handcrafted features are the earliest in scene classification of a remote sensing image. Color histograms and texture descriptors [bhagavathy2006modeling], [dos2010evaluating]
are global features that can be sent to the classifier directly. Scale-invariant feature transform (SIFT)[yang2012geographic], [risojevic2012fusion] and histogram of oriented gradients [cheng2014multi], [cheng2015auto] are local features that usually need mid-level descriptor to generate the entire representation [yang2012geographic], [zhu2016bag]. Recently, a combination of multiple different features is considered as a promising approach to seek further promotion [risojevic2012fusion, zhu2016bag, zou2016scene]. For example, Zhu et al.[zhu2016bag] propose a local–global feature for bag-of-visual-words scene classifier, which can combine several features by a fusion operation at the histogram level. Nevertheless, how to design an effective model to combine these features is very difficult, and the representation ability of handcrafted features becomes weaker with the increasing challenge of this task.
Ii-B Architecture Learning Mechanism
To design better architectures for a special task or data set, a paradigm shift from designing architectures to learning architectures for data is imperative. In the domain of understanding natural images, a great number of different learning strategies, including random search, Bayesian optimization, reinforcement learning (RL), evolutionary methods, and gradient-based methods, have been used to learn the best neural architecture. Among those learning strategies, RL and evolutionary algorithms have attracted most attention and achieved remarkable progress. Zoph and Le[zoph2016neural]
proposed an RL-based learning strategy to learning the whole CNN architecture. In the RL-based method, an RNN is built as a controller to generate architectural hyperparameters of neural networks. A CNN with the generated architecture is developed and evaluated. The performance on the validation set is used as the reward to optimize the RNN controller[zoph2018learning]. Evolutionary-based algorithms have also been used for optimizing the neural architecture and achieved comparable results compared with RL-based methods [real2017large, real2019regularized]. In evolutionary-based learning methods, each neural network architecture (i.e., model or individual) is encoded as a sequence of numbers. Each model is trained and the performance on a validation set is used as the fitness of the model. According to the fitness, reproductions and mutations are performed, and then, new high-performance “children” (i.e., model) are generated. Recently, Liu et al. [liu2018darts] proposed a method that transforms a discrete architecture space to a continuous space, which enables the gradient-based optimization method to learn for a suitable neural architecture. Due to the simplicity and time-saving property of the gradient-based method, gradient-based learning strategies have become popular. An attempt in remote sensing is that Chen et al. [chen2019automatic] proposed an automatic CNN design approach in remote sensing HSI image classification.
Iii-a Architecture learning procedure
The available CNN architectures for remote sensing scene classification have been manually designed by experts in a trial and error way. In recent years, much progress has been achieved and CNN-based remote sensing scene classifiers have demonstrated a good classification performance. However, the manual design of a CNN classifier usually takes a long time. Moreover, deep architectures are data set dependent and they need to be changed and adapted from one data set to another. Therefore, how to automatically design a proper CNN architecture is an important direction in the remote sensing scene classification.
Designing a suitable architecture can be seen as a learning procedure. Fig. 1
shows the procedure of learning an architecture in an automatic way. There are three parts in the architecture-learning system: Architecture space, learning strategy, and classification performance estimation. Architecture space is a collection of CNN architectures that can be represented in principle. Appropriate Architecture space definition can reduce the size of architecture space and also leads to novel CNN architectures. The architecture space of our framework will be described at next section.The learning strategy is the core part of devising a well-performing CNN architecture. In general, it is a trade-off between speed and performance. For the task of remote sensing scene classification, here, we use a single graphical processing unit (GPU) card to train the classifier. Therefore, a fast algorithm is important. Evaluation of architectures is the last part of the learning procedure in Fig.1. In this paper, overall classification accuracy on validation samples is used to evaluate the performance of a specific CNN architecture.
Iii-B Architecture Space
We define a cell to be a small fully convolutional module, typically repeated multiple times to form the entire neural network. More specifically, a cell is a directed acyclic graph consisting of B blocks. Each block is a two-branch structure, mapping from 2 input tensors to 1 output tensor. Block i in cell l may be specified using a 5-tuple, where are selections of input tensors, are selections of layer types applied to the corresponding input tensor, and is the method used to combine the individual outputs of the two branches to form this block’s output tensor, . The cell’s output tensor is simply the concatenation of the blocks’ output tensors . The set of possible input tensors, , consists of the output of the previous cell , the output of the previous previous cell , and previous blocks’ output in the current cell . Therefore, as we add more blocks in the cell, the next block has more choices as potential source of input.
Iii-C Learning Strategy
In order to accelerate the learning procedure, gradient descent-based methods, which are useful in optimization problems (e.g., a neural network), are considered for the learning procedure. Due to the continuity of the architecture space that is required by the gradient descent-based method, the softmax operator is used over all possible operators to make the categorical choice of an operator continuous. Here, the output of an operator set is the weighted sum of the outputs of each operator in the operators set.
where is the coefficient of operation in the operation set between node and . The coefficients in the operation set sum to 1 and are obtained by the softmax operation. The coefficients are randomly initialized and optimized by the gradient descent method. Through the above described method, the architecture space changes from discrete to continuous, which can be solved using gradient descent. At the end of the learning procedure, the most likely operator,according to , is selected as the final operator, and then, the network architecture is determined.
Analogous to the procedure of the manually designed neural architecture where the performance on the validation data set is used to guide the architecture design, Auto-CNN uses the gradient descent method to update the architecture parameter based on the validation data set. The difference between the two architecture design methods is that the former needs lots of prior experience on behalf of the analyst to adjust the neural structure, while the latter is fully automatic and data-dependent. Now, let and be the training and validation losses, respectively. The architecture optimization and the training of a CNN are a bi-level optimization problem with as the architecture variables and as the weights in CNN.
After the optimization, the obtained and minimize the training and validation losses (i.e., and ), which are used for the architecture design and CNN, respectively. It can be summarized into the following two steps:
1. Update network weights by
2. Update network architecture by
When the architecture search process is finished, only one most likely operator on the connection between two nodes is preserved. Moreover, the output of each node is computed based on only the two strongest previous nodes. Here, the strength of the connection between two nodes is defined by the coefficient . Then, the architecture of Architecture-Learning CNN is determined. We train the Architecture-Learning CNN from scratch based on the training data set.
Iii-D Architecture Generator
Architecture Generator maps the set of parameters obtained by learning strategy to a computing cell shown as Fig. 2. The set of possible layer types, , consists of 7 operators(3 3 depthwise-separable conv, 5 5 depthwise-separable conv, 3 3 atrous conv with rate 2, 5 5 atrous conv with rate 2, 3 3 average pooling, 3
3 max pooling, skip connection) which all prevalent in modern CNNs. In addition, all the convolution operators are replaced by a triple illustrated as Fig.3.
To confirm the effectiveness of our proposed architecture learning procedure, we searched optimal architecture on three remote sensing scene classification data sets respectively and validated their performances. We will train architectures learned by our method and several classical CNN backbones with same parameter setting in our experiments, and compare the performance with our evaluation metrics. In addition, an ablation study is conducted to evaluate the effectiveness of atrous convolution in the architecture.
Iv-a Data Description
In this section, we will choose three widely used remote sensing scene classification data sets (UC Merced Land-Use [yang2010bag], AID [xia2017aid], and NWPU-RESISC45 [cheng2017remote]) to test the robustness and effectiveness of our proposed method.
Iv-A1 UC Merced Land-Use Data Set
The UC Merced Land-Use dataset is composed of 2100 aerial scene images divided into 21 land use scene classes. Each class contains 100 images with size of 256 256 pixels with a pixel spatial resolution of 0.3 m in the red green blue (RGB) color space. These images were selected from aerial orthoimagery downloaded from the United States Geological Survey (USGS) National Map of the following US regions: Birmingham, Boston, Buffalo, Columbus, Dallas, Harrisburg, Houston, Jacksonville, Las Vegas, Los Angeles, Miami, Napa, New York, Reno, San Diego, Santa Barbara, Seattle, Tampa, Tucson, and Ventura. It is not only the diversity of land-use categories contained in the dataset that makes it challenging. Some highly overlapped classes such as dense residential, medium residential and sparse residential are included in this dataset, which are mainly different in the density of structures and makes the dataset more difficult to classify. This dataset has been widely used for the task of remote sensing image scene classification.
Iv-A2 Aerial Image Data Set
AID is large-scale aerial image dataset, which was collected from Google Earth imagery and is a more challenging dataset compared with The UC Merced Land-Use dataset because of The following reasons. First, the AID dataset contains more scene types and images. In detail, it has 10,000 images with a fixed size of 600 600 pixels within 30 classes. Some similar classes make the interclass dissimilarity smaller, and the number of images of different scene types differs from 220 to 420. Moreover, AID images were chosen under different times and seasons and different imaging conditions, and from different countries And regions around the world. Finally, AID images have the property of multiresolution, changing from approximately 8 m to about half a meter.
Iv-A3 NWPU-RESISC45 Data Set
NWPU-RESISC45 dataset is more complex than UC Merced Land-Use and AID datasets and consists of a total of 31,500 remote sensing images divided into 45 scene classes. Each class includes 700 images with a size of 256 256 pixels in the RGB color space. This dataset was extracted from Google Earth by the experts in the field of remote sensing image interpretation. The spatial resolution varies from approximately 30 to 0.2 m per pixel. This dataset covers more than 100 countries and regions all over the world with developing, transitional, and highly developed economies.
Iv-B Implementation Details
In this section, we will explain the parameter setting for our experiments. Our experiments can be divided into two primary parts, an architecture learning part and an architecture evaluating part. Firstly, we search the optimal architecture through our architecture learning procedure, and then we evaulate the learned optimal task-special architecture on corresponding datasets and compare the performance with classical CNN models under the same training settings to explain the effectiveness of our method. In the architecture learning procedure, we divide a data set into two isometrical parts, i.e. training set and validation set. Training set is used to train our learning strategy, validation set is used to evaulate the performances of architecture. And the architecture with best performance on the validation set will be chosen to be the optimal architecture. And in the architecture evaluation, the division of data sets depends on the setting of experiments.
1) Architecture Space: The operators set used in our architecture space contains 3 3 and 5 5 separable convolution residual triplet, 3 3 and 5 5 atrous convolution residual triplet, 3 3 max pooling, 3 3 average pooling and identity. It is worth noting that atrous convolution[chen2017deeplab] (the same as dilated convolution[yu2015multi]) which is merely used in remote sensing scene classification task appears in our operators set for its potential to abtain larger context. Every set of parameters in the architecture space can be map to a computing cell. Illustrated as Fig.4 (b), the input of th cell is the output of th cell and the output of th cell. For an acceptable computing time of the architecture learning procedure, we set the number of weighted sum node to 4, and every weighted sum node only aggregates 2 output from previous nodes. The output of these four weighted sum nodes are concat[szegedy2015going] together as the output of cell. And we search two types of computing cells[zoph2018learning]
for a better performance in our exprtiments. A normal cell, the output resolution of feature map will stay the same with the input resolution for the stride of all oprators is set to 1, and a reduce cell, the stride of all oprators is set to 2. And the number of reduce cell used in our architecture is set to 2 in our experiments.
2) Architecture Learning Procedure: As described in section III, when the learning strategy find a set of parameters, architecture generator will map it to a cell, and then stack the cell one by one. As illustrated in Fig. 4, the input data pass two convolution layers (Conv1 is 3 3 with stride 2, and Conv2 is 3
3 with stride 1) in sequence, and normal cell and reduce cell are stacked by several times to complete the CNN. We tried multiple times to obtain the best number of cells for every dataset, i.e., the number of cells is set to 6 for UC Merced Land-Use data set, 7 for AID and 10 for NWPU-RESISC45. Then we train the CNN and architecture parameters alternatingly from scratch based on Pytorch with the NVIDIA 1080Ti. We resize the remote sensing images into 32
32 for the heavily memory consuming of architecture learning process. The training parameters are as follow. We train our CNN models using stochastic gradient descent, and set the batch size to 64 with the initial learning rate 0.025, momentum of 0.9 and a weight decay of
for 50 epochs. And we train our architecture parameters with the learning rate ofand weight decay .
3) Evaluation Procedure: In the architecture evaluation procedure, we train architectures learned by our method and classical CNN architectures with the same settings. The optimal cells are stacked the same times as the architecture learning procedure. Images are all resized to 224 224 with random horizontal flip before inputting CNNs. Our architectures are full-trained[nogueira2017towards] with a batch size of 64 using stochastic gradient descent, initial learning rate of 0.1 with a decay of 0.97 in every epoch, momentum of 0.9, and every model is trained for 150 epochs.
Iv-C Experiment Results
In this section, we will demonstrate the architectures learned by our method and compare the performances of these architectures with some classical CNN architectures being widely used in remote sensing scene classification. For our datasets have the similar numbers of images in each classes, average accuracy (AA) will close to overall accuracy (OA). Thus we exclude it from the evaluation metrics in our results. OA and confusion matrix(CM) are used for declaring the effectiveness of our method.
1) Architecture Learning Procedure: We apply our architecture learning procedure on the three data sets. As it shown in Fig. 5, we learn the optimal architectures for 50 epochs. During the architecture learning process, the training accuracy and valid accuracy continue to increase and finally approach convergence at the 50-th epoch. And we also obtian the optimal cells as shown in Fig. 6-8. The sum in pictures represent weighted sum operation. Sep 3 3 and 5 5 represent 3 3 and 5 5 separable convolution residual triplet. Atr 3 3 and 5 5 represent 3 3 and 5 5 atrous convolution residual triplet. Max Pool means 3 3 max pooling, and the black line without any text represent identity. We can find that all of the cells contain separable convolution and atrous convolution. Atrous convolution is known as its larger field of view of filters to catch larger context. Architectures with atrous convolution may get better performance by meeting the demand for remote sensing scene classification with larger context.
2) Architecture Evaluation Procedure: A performance comparison between the optimal architectures we learned and eight classical CNN architectures is performed, such as AlexNet[krizhevsky2012imagenet], VGG16[simonyan2014very] and ResNet-50[he2016deep]. We compare the numbers of parameters between these architectures, shown in Table I. For the optimal cells are different between datasets, and cells are stacked in different times, the parameters of our optimal architectures are different. It is easy to observe that the architectures we found have smaller numbers of parameters than the calssical ones. We train all the architectures in a full-trained strategy. Comparing with the fine-tuned strategy, this strategy is pruned to overfitting and will result in a performance decrease[nogueira2017towards].
For UC Merced Land-Use data set, we stack the learned optimal cells in 6 times, and train the architecture from scratch for 150 epochs. We train the dataset in two training ratio. The result is shown in Table II. Our method get the best performance in both the training ratio of 50% and 80 %. We also make a CM to further analyze the effect of the architecture learned by our architecture learning procedure, as shown in Fig. 9. It can be observed that the performance using 80% images for training is lower than using 50% images for training. The reason is that there are only 2100 images in UCM data set. More training data may destroy the robustness of the model.
|Optimal Architecture on UCM||2.626|
|Optimal Architecture on AID||3.745|
|Optimal Architecture on NWPU||5.180|
For AID, we stack the learned optimal cells in 7 times, and train the architecture from scratch for 150 epochs. We train the dataset in two training ratio. The result is shown in Table II. Our method get the best performance in both the training ratio of 50% and 80 %. We also make a CM to further analyze the effect of the architecture learned by our architecture learning procedure, as shown in Fig. 10.
For NWPU, we stack the learned optimal cells in 10 times, and train the architecture from scratch for 150 epochs. We train the dataset in two training ratio. The result is shown in Table II. Our method get the best performance in both the training ratio of 50% and 80 %. We also make a CM to further analyze the effect of the architecture learned by our architecture learning procedure, as shown in Fig. 11
|AlexNet||77.676 1.183||88.619 0.200||79.724 0.514||84.900 0.382||82.697 0.069||86.321 0.281|
|VGG16||81.467 0.306||88.953 0.272||82.416 0.506||87.760 0.213||85.003 0.224||89.502 0.150|
|Googlenet||82.705 0.987||87.666 1.070||82.772 1.537||88.150 0.777||86.592 0.824||91.613 0.657|
|InceptionV3||81.676 1.228||87.762 1.364||82.444 0.745||89.030 1.272||88.305 0.479||92.584 0.267|
|ResNet-50||83.371 1.104||88.714 1.755||82.476 0.683||88.100 0.651||87.774 0.871||92.013 0.291|
|DenseNet||83.790 0.611||89.667 0.916||82.760 0.703||87.500 0.242||88.692 0.517||92.708 0.173|
|Wide ResNet||82.476 0.741||88.857 0.488||82.160 0.467||88.250 0.941||87.478 0.967||92.308 0.171|
|MobileNetV2||84.000 0.415||90.000 0.996||85.680 0.331||89.330 0.372||89.543 0.224||92.289 0.190|
|Our Methods||88.628 1.134||93.428 0.706||86.996 0.750||91.120 0.498||89.725 0.386||93.667 0.241|
Iv-D Ablation Study
For the observation that all the optimal architectures found by our architecture learning procedure contain atrous convolution operator, the larger context caught by atrous convolution may crucial for remote sensing scene classification. To further vertify our conclusion, we design an ablation study.
We perform ablation experiments in two ways. One is to replace all the atrous convolutions in our optimal cells by separable convolution with the same size, as the example we do for the optimal cell of UCM illustrated in Fig. 12. The other is to exclude atrous convolutions in the architecture learning procedure, and then repeat the previous architecture learning procedure to observe the performance of the optimal architectures. We train these architectures with the same training settings described in Section IV.B. The performance comparison between the architectures with and without atrous convolution is shown in Table III. Obviously, all the architectures without atrous convolution can be observed a performance decrease from 3.16% to 0.62%. It is indicated that the architectures with atrous convolution can learn a better representation than without it.
|Our Methods||88.628 1.134||93.428 0.706||86.996 0.750||91.120 0.498||89.725 0.386||93.667 0.241|
|Ablation Set 1||87.001 0.465||91.571 0.574||86.200 0.821||91.040 0.947||89.351 0.328||93.271 0.562|
|Ablation Set 2||87.526 0.343||92.754 0.483||86.327 0.643||90.872 0.569||89.486 0.223||93.197 0.376|
In this paper, we proposed an architecture learning procedure which can learn a task-special CNN architecture for remote sensing scene classification. And experiment results indicate that the task-special CNN architecture outperform several classical human-designed architectures on three remote sensing scene classification benchmarks. Moreover, architecture determines the function of CNN, this architecture-learned paradigm may help to understand which representations are important for remote sensing scene classification tasks and lead to a better backbone for remote sensing scene classification.