Recent advancements of remote sensing techniques have boosted the volume of attainable high-resolution aerial images, and massive amounts of applications, such as urban cartography [1, 2, 3, 4], traffic monitoring [5, 6, 7], terrain surface analysis [8, 9, 10, 11], and ecological scrutiny [12, 13], have benefited from these developments. For this reason, the aerial image classification has become one of the fundamental visual tasks in the remote sensing community and drawn a plethora of research interests [14, 15, 16, 17, 18, 19, 20, 21]. The classification of aerial images refers to assigning these images with specific labels according to their semantic contents, and a common hypothesis shared by many relevant studies is that an image should be labeled with only one semantic category, such as scene categories (see Fig. 1). Although such image-level labels [22, 23] are capable of delineating images from a macroscopic perspective, it is infeasible for them to provide a comprehensive view of objects in aerial images. To tackle this, huge quantities of algorithms have been proposed to identify each pixel in an image [24, 25, 26] or localize objects with bounding boxes [27, 28, 29]. However, the acquisition of requisite ground truths (i.e., pixel-wise annotations and bounding boxes) demands enormous expertise and human labors, which makes relevant datasets expensive and difficult to access. With this intention, multi-label image classification now attracts increasing attention in the remote sensing community [30, 31, 32, 33, 34] owing to that 1) a comprehensive picture of aerial image contents can be drawn, and 2) datasets required in this task are not expensive (only image-level labels are needed).
Fig. 1 illustrates the difference between image-level scene labels and object labels. As shown in this figure, although these four images are assigned with the same scene label, their multiple object labels vary a lot. It is worth noting that the identification of some objects can actually offer important cues to understand a scene more deeply. For example, the existence of building and pavement
indicates a high probability that rivers in Fig.() and () are very close to areas with frequent human activities, while rivers in Fig. () and () are more likely in the wild due to the absence of human activity cues. In contrast, simply recognizing scene labels can hardly provide such information. Therefore, in this paper, we dedicate our efforts to explore an effective model for the multi-label classification of aerial images.
I-a Challenges of Identifying Multiple labels
In identifying multiple labels of an aerial image, two main challenges need to be faced with. One is how to extract semantic feature representations from raw images. This is crucial but difficult especially for high-resolution aerial images, as they always contain complicated spatial contextual information. Conventional approaches mainly resort to manually crafted features and semantic models [22, 35, 36, 37, 38], while these methods cannot effectively extract high-level semantics and lead to limited performance in classification. Hence an efficient high-level feature extractor is desirable.
The other challenge is how to take full advantage of label correlations to infer multiple object labels of an aerial image. In contrast to single-label classification, which mainly focuses on modeling image-label relevance, exploring and modeling label-label correlations plays a supplementary yet essential role in identifying multiple objects in aerial images. For instance, the presence of ships confidently infers the co-occurrence of water or sea, while the existence of a car suggests a high probability of the appearance of pavements. Unfortunately, such label correlations are scarcely addressed in the literature. One solution is to use a recurrent neural network (RNN) to learn label dependencies. However, this is done with a chain propagation fashion, and its performance heavily depends on the learning effectiveness of its long-term memorization. Moreover, in this way, label relations are modeled implicitly, which leads to a lack of interpretability.
Overall, an efficient multi-label classification model is supposed to be capable of not only learning high-level feature representations but also modeling label correlations effectively.
I-B Related Work
Zegeye and Demir 32] introduce a spatial and structure SVM for multi-label classification by considering spatial relations between a given patch and its neighbors. Similarly, Zeggada et al. 
employs a conditional random field (CRF) framework to model spatial contextual information among adjacent patches for improving the performance of classifying multiple object labels.
With the development of computational resources and deep learning, very recent approaches mainly resort to deep networks for multi-label classification. In, the authors make use of a standard CNN architecture to extract feature representations and then feed them into a multi-label classification layer, which is composed of customized thresholding operations, for predicting multiple labels. In , the authors demonstrate that training a CNN for multi-label classification with a limited amount of labeled data usually leads to an underwhelming-performance mdoel and propose a dynamic data augmentation method for enlarging training sets. More recently, Sumbul and Demir  propose a CNN-RNN method for identifying labels in multi-spectral images, where a bidirectional LSTM is employed to model spatial relationships among image patches. In order to exploring inherent correlations among object labels,  proposes a CNN-LSTM hybrid network architecture to learn label dependencies for classifying object labels of aerial images.
I-C The Motivation of Our Work
In order to explicitly model label relations, we propose a label relational inference network for multi-label aerial image classification. This work is inspired by recent successes of relation networks in visual question answering , object detection , video classification , activity recognition in videos , and semantic segmentation . A relation network is characterized by its inherent capability of inferring relations between an individual entity (e.g., a region in an image or a frame in a video) and all other entities (e.g., all regions in the image or all frames in the video). Besides, to increase the effectiveness of relational reasoning, we make use of a spatial transformer, which is often used to enhance the transformation invariance of deep neural networks , to reduce the impact of irrelevant semantic features.
More specifically, in this work, an innovative end-to-end multi-label aerial image classification network, termed as attention-aware label relational reasoning network, is proposed and characterized by its capabilities of localizing label-specific discriminative regions and explicitly modeling semantic label dependencies for the task. This paper’s contributions are threefold.
We propose a novel multi-label aerial image classification network, attention-aware label relational reasoning network, which consists of three imperative components: a label-wise feature parcel learning module, an attentional region extraction module, and a label relational inference module. To our best knowledge, it is the first time that the idea of relation networks is employed to predict multiple object labels of aerial images, and experimental results demonstrate its effectiveness.
We extract attentional regions from the label-wise feature parcels in a proposal-free fashion. Particularly, a learnable spatial transformer is employed to localize attentional regions, which are assumed to contain discriminative information, and then re-coordinate them into a given size. By doing so, attentional feature parcels can be yielded.
To facilitate progress in the multi-label aerial image classification, we produce a new dataset, AID multi-label dataset, by relabeling images in the AID dataset . In comparison with the UCM multi-label dataset , the proposed dataset is more challenging due to diverse spatial resolutions of images, more scenes, and more samples.
The remaining sections of this paper are organized as follows. Section II delineates three elemental modules of our proposed network, and Section III introduces experiments, where experimental setups are given and results are analyzed and discussed. Eventually, Section IV draws a conclusion of this paper.
Ii-a Network Architecture
As illustrated in Fig. 2, the proposed network comprises three components: a label-wise feature parcel learning module, an attentional region extraction module, and a label relational inference module. Let be the number of object labels and be the -th label. The label-wise feature parcel learning module is designed to extract high-level feature maps with channels, termed as feature parcel (for more details refer to Section II-B), for each object . The attentional region extraction module is used to localize discriminative regions in each and generate an attentional feature parcel , which is supposed to contain the most relevant semantics with respect to the label . Finally, relations among and all other label-wise attentional feature parcels are reasoned about by the label relational inference module for predicting the presence of the object .
Details of the proposed network are introduced in the remaining sections.
Ii-B Label-wise Feature Parcel Learning
The extraction of high-level features is crucial for visual recognition tasks, and many recent studies adopt CNNs owing to their remarkable performance in learning such features [15, 49, 50, 51, 52]. Hence, we take a standard CNN as the backbone of the label-wise feature parcel learning module in our model. As shown in Fig. 2
, an aerial image is first fed into a CNN (e.g., VGG-16), which consists of only convolutional and max-pooling layers, for generating high-level feature maps. Afterwards, these features are encoded intofeature parcels via a label-wise multi-modality feature learning layer, where convolutional filters with the size of are employed. The channel dimensionality of output features is , while the spatial dimensionality is unchanged. With this design, feature maps are learned for each object , so called feature parcel, and denoted as . After iterative training, these feature parcels are expected to contain discriminative label-related semantic information.
In our experiment, we notice that with a higher resolution is beneficial for the subsequent module to localize discriminative regions, as more spatial contextual cues are included. Accordingly, we discard the last max-pooling layer in VGG-16, leading to a spatial size of for outputs. Moreover, we extend our researches to GoogLeNet (Inception-v3)  and ResNet (ResNet-50 in our case) 
for a comprehensive evaluation. Specifically, we adapt GoogLeNet by removing global average pooling and fully-connected layers as well as reducing the stride of convolutional and pooling layers in“mixed8” to 1 to improve the spatial resolution. Besides, in order to preserve receptive fields of subsequent convolutional layers, filters in “mixed9” are replaced with atrous convolutional filters, and the dilation rate is defined as 2. Regarding ResNet, we set the convolution stride and dilation rate of filters as 1 and 2, respectively, in the last residual block. Global average pooling and fully-connected layers are removed as well.
Ii-C Attentional Region Extraction Module
Although label-wise feature parcels can be directly applied to exploring label dependencies , less informative regions (see blue areas in Fig. 3) may bring noise and further reduce the effectiveness of these feature parcels. As shown in the left image of Fig. 3, weakly activated regions indicate a loose relevance to the corresponding object label, while highlighted regions suggest a strong region-label relevance. To diminish the influence of irrelated regions, we employ an attentional region extraction module to automatically extract discriminative regions from label-wise feature parcels.
We localize and re-coordinate attentional regions from with a learnable spatial transformer. Particularly, we sample a feature parcel into a regular spatial grid (cf. green dots in the left image of Fig. 3) according to the spatial resolution of and regard pixels in as points on the grid with coordinates . Similarly, we can define coordinates of a new grid, attentional region grid (see white dots in the middle image of Fig. 3), as , and the number of grid points along with the height and width is equivalent to that of . As demonstrated in  that can be learned by performing spatial transformation on , can be calculated with the following equation:
where is a learnable transformation matrix, and grid coordinates, and , are normalized to . Considering that this module is designed for localization, we only adopt scaling and translation in our case. Hence Eq. 1 can be rewritten as
where and indicate scaling factors along x- and y-axis, respectively, and and represent how feature maps should be translated along both axes. Notably, since different objects distribute variously in aerial images, is learned for each object individually. In other words, extracted attentional regions are label-specific and capable of improving the effectiveness of label-wise features.
As to the implementation of this module, we first vectorize
with a flatten function and then employ a localization layer (e.g., a fully connected layer) to estimate elements infrom the vectorized . Afterwards, attentional region grid coordinates can be learned from with Eq. 2, and values of pixels at
is able to be obtained from neighboring pixels by bilinear interpolation. Finally, the attentional region gridis re-coordinated to a regular spatial grid, which shares an identical structure with , for yielding the final attentional feature parcel .
Ii-D Label Relational Inference Module
Being the core of our model, the label relational inference module is designed to fully exploit label interrelations for inferring existences of all labels. Before diving into this module, we define the pairwise label relation as a composite function with the following equation:
where the input is a pair of attentional feature parcels, and , and and range from to . The functions and are used to reason about the pairwise relation between label and . More specifically, the role of is to reason about whether there exist relations between the two objects and how they are related. In previous works [42, 45]
, a multilayer perceptron (MLP) is commonly employed asfor its simplicity. However, spatial contextual semantics are not taken into account in this way. To address such issue, here, we make use of convolution instead of an MLP to explore spatial information. Furthermore, is applied to encode the output of into the final pairwise label relation . In our case, consists of a global average pooling layer and an MLP, which finally yields the relation between label and .
Following the motivation of our work, we infer each label by accumulating all related pairwise label relations, and the accumulated label relation for object is defined as:
where represents all attentional feature parcels except . Based on this formula, we implement the label relational inference module with the following steps (taking the prediction of label as an example): 1) and every other attentional feature parcel are concatenated and fed into a convolutional layer, respectively. 2) Afterwards, a global average pooling layer is employed to transform into vectors, which are then element-wise added. 3) Finally, the output is fed into an MLP layer with trainable parameters to produce the accumulated label relation
. Since we expect the model to predict probabilities, an activation functionis utilized to restrict each output digit to . For label , a digit approaching implies a high probability of its presence, while one closing suggests the absence. Fig. 4 presents an visual illustration of the label relational inference module.
Compared to other multi-label classification methods, our model has three benefits:
The module can inherently reason about label relations as indicated by Eq. 3 and requires no particular prior knowledge about relations among all objects. That is to say, our network does not need to learn how to compute label relations and which object relations should be considered. All relations are automatically learned through a data-driven way and proven to meet the reality in our experiments.
The function is learned for each object pair and separately, which suggests that pairwise label relations are encoded in a specific way. Besides, our implementation of can extend the applicability of relational reasoning compared to using an MLP.
Iii Experiments and Discussion
In this section, we conduct experiments on the UCM  and proposed AID multi-label dataset for evaluating our model. Specifically, Section III-A presents a description of these two datasets. Afterwards, we introduce training strategies and thoroughly discuss experimental results in the subsequent subsections.
Iii-a Dataset Introduction
Iii-A1 UCM multi-label dataset
UCM multi-label dataset  is reproduced by assigning all aerial images collected in UCM dataset  with newly defined object labels. The number of all candidate object labels is 17: building, sand, dock, court, tree, sea, bare soil, mobile home, ship, field, tank, water, grass, pavement, chaparral, and car. It is worth noting that labels, such as tank, airplane, and building, exist in both  and  while at different levels. In , such terms are considered as scene-level labels due to the fact that related images can be characterized and depicted by them, while in , they mean objects that may present in aerial images.
As to properties of images in this dataset, the spatial resolution of each sample is one foot, and the size is pixels. All images are manually cropped from aerial imagery contributed by the National Map of the U.S. Geological Survey (USGS), and there are 2100 images in total. For each object category, the number of images is listed in Table I. Besides, 80% of image samples per scene class are selected to train our model, and the other 20% of images are used to build test samples. Numbers of images assigned to training and test sets with respect to all object labels are available in Table I as well. Some visual examples are shown in Fig. 5.
|Category No.||Category Name||Training||Test||Total|
Iii-A2 AID multi-label dataset
In order to further evaluate our network and meanwhile promote progress in the area of multi-class classification of high-resolution aerial images, we produce a new dataset, named AID multi-label dataset, based on the widely used AID scene classification dataset. The AID dataset consists of 10000 high-resolution aerial images collected from worldwide Google Earth imagery, including scenes from China, the United States, England, France, Italy, Japan, and Germany. In contrast to the UCM dataset, spatial resolutions of images in the AID dataset vary from 0.5 m/pixel to 8 m/pixel, and the size of each aerial image is pixels. Besides, the number of images in each scene category ranges from 220 to 420. Overall, the AID dataset is more challenging compared to the UCM dataset.
Here, we manually relabel some images in the AID dataset. With extensive human visual inspections, 3000 aerial images from 30 scenes in the AID dataset are selected and assigned with multiple object labels, and the distribution of samples in each category is shown in Table II. Besides, 80% of all images are taken as training samples, while the rest is used for testing our model. Several example images are shown in Fig. 6.
Iii-B Training Details
|Category No.||Category Name||Training||Test||Total|
As to the initialization of our network, different modules are done in different ways. For the label-wise feature parcel learning module, we initialize the backbone and weights in other convolutional layers with a pre-trained ImageNet model and a Glorot uniform initializer, respectively. Regarding the attentional region extraction module, we initialize the transformation matrix in Eq. 1 as an identical transformation,
In the label relational inference module, all weights are initialized with the same strategy as that in the first module. Notably, weights in the backbone are trainable during the training phase.
In our case, multiple labels are encoded into multi-hot binary sequences instead of one-hot vectors widely used in single-label classification tasks. The length of such multi-hot binary sequence is identical to the number of total object categories, i.e., 17 in our case, and as to each digit, 0 suggests an absent object, while 1 indicates the presence of its corresponding object label. Accordingly, we define the network loss as the binary cross entropy. Besides, Adam with Nesterov momentum
, which shows faster convergence than stochastic gradient descent (SGD) for our task, are selected and its parameters are set as recommended: , , and . The learning rate is initially defined as and will decayed by a factor of 0.1 if the validation loss fails to decrease.
Our model is implemented on TensorFlow-1.12.0 and trained for 100 epochs. The computational resource is an NVIDIA Tesla P100 GPU with a 16GB memory. As a compromise between the training speed and GPU memory capacities, we set the size of training batches as 32. To avoid overfitting, the training progress is terminated once the validation loss increases continuously in five epochs.
Iii-C Results on the UCM Multi-label Dataset
To validate the effectiveness of the proposed attention-aware label relational reasoning network (AL-RN-CNN), we compare it with the following competitors: a standard CNN, CNN-RBFNN , and CA-CNN-BiLSTM 
. Taking into account that the CNN is designed to perform single-label classification, we replace its last softmax layer with a sigmoid layer to produce multi-hot sequences. For all models, output sequences are binarized with a threshold of 0.5 to generate final predictions.
Iii-C1 Quantitative analysis
scores as evaluation metrics to quantitatively assess the performance of different models. Specifically, these twoscores are calculated with the following equation:
indicates the example-based precision and recall of predictions. Formulas of calculating and are:
where (example-based true positive) indicates the number of correctly predicted positive labels in an example, while (example-based false positive) denotes the number of those failed to be recognized. Besides, (example-based false negative) represents the number of incorrectly predicted negative labels in an example. Here, an example stands for an aerial image and its associated multiple labels.
To evaluate our network comprehensively, we take mean and score as principal indexes. Moreover, we also report mean and mean . In addition to the example-based perspective, label-based precision and recall are also considered and calculated with:
to demonstrate the performance of networks from the perspective of each object label.
Table III exhibits experimental results on the UCM multi-label dataset. We can observe that our model surpasses all competitors on the UCM multi-label dataset with variant backbones. Specifically, AL-RN-VGGNet increases mean and scores by 7.16% and 5.64%, respectively, in comparison with VGGNet. Compared to CA-VGG-BiLSTM, which resorts to employing a bidirectional LSTM structure for exploring label dependencies, our network obtains an improvement of 5.92% in the mean score. Besides, although CA-VGG-BiLSTM is superior to VGGNet in both mean and scores, it achieves decreased mean precisions and recalls. In contrast, AL-RN-VGGNet outperforms VGGNet not only in mean and scores but also in mean example- and label-based precisions and recalls. For another backbone, GoogLeNet, our network gains the best mean and scores. As shown in Table III, AL-RN-GoogLeNet increases the mean score by 4.56% and 3.42% with respect to GoogLeNet and CA-GoogLeNet-BiLSTM, respectively. For the mean score and precisions, our model also surpasses other competitors, which proves the effectiveness and robustness of our method. AL-RN-ResNet achieves the best mean socre, 0.8676, and score, 0.8667, in comparison with all other models. Furthermore, it obtains the best mean example-based precision, 0.8881, and label-based precision, 0.9233, and recall, 0.8595. To summarize, comparisons between AL-RN-CNN and other models demonstrate the effectiveness of our network. Furthermore, comparisons between AL-RN-CNN and CA-CNN-BiLSTM illustrate that explicitly modeling label relations seems better than the implicit way of LSTM-based structures. Table IV presents several example predictions from the UCM multi-label dataset.
Iii-C2 Qualitative analysis
In order to figure out what is going on inside our network, we further visualize features learned from each module and validate the effectiveness of the proposed network in a qualitative manner. In Fig. 7, a couple of feature parcels regarding bare soil, building, car, pavement, court, and tank is displayed for several example images. Note that for feature maps in each feature parcel, we select the most strongly activated one as the representative. We can observe that discriminative regions related to positive labels are highlighted in these feature maps, while less informative regions are weakly activated. As an exception, the feature map at the bottom left of Fig. 7 shows that the baseball field is misidentified as tanks, which may lead to incorrect predictions.
For evaluating the localization ability of the proposed network, we visualize attentional regions learned from the second module. Coordinates of bottom left (BL) and top right (TR) corners of attentional region grids are calculated with the following equation:
Fig. 8 shows some examples of learned attentional regions. As we can see, most attentional regions concentrate on areas covering objects of interest. Besides, it is noteworthy that even objects are distributed dispersedly, the learned attentional regions can still cover most of them, e.g., buildings in Fig. () and cars in ().
Furthermore, learned pairwise label relations are visualized in the format of matrix, where an element at indicates . Fig. 9 exhibits some examples for the four scenes in Fig. 8. In these examples, we take only positive object labels into consideration and perform normalization alongside each row to yield a distinct visualization of “label relations”. Since differs from , we assign null values to diagonal elements and mark them as white color in Fig. 9. It can be seen that in Fig. () and (), relations between car and pavement contribute significantly to predicting presences of both car and pavement. Besides, Fig. () shows that the existence of tree highly suggests the presence of bare soil, but not vice versa. These observations illustrate that even without prior knowledge, the proposed network can reason about relations, that are in line with the reality.
Iii-D Results on the AID Multi-label Dataset
Iii-D1 Quantitative analysis
To further evaluate the proposed network, we report experimental results on the AID multi-label dataset. Evaluation metrics in here are the same with those in previous experiments, and results are presented in Table V. As we can observe, the proposed AL-RN-CNN behaves superior to all competitors in most of the metrics. To be more specific, AL-RN-VGGNet improves the mean and score by 2.57% and 2.71%, respectively, compared to the baseline model. In comparison with CA-VGG-BiLSTM, our network gains an improvement of 1.41% in the mean score and 1.43% in the mean score. Regarding the other two backbones, similar phenomena can be observed as well. AL-RN-GoogLeNet achieves the highest mean and score, 0.8817 and 0.8825, compared to GoogLeNet and CA-GoogLeNet-BiLSTM, while AL-RN-ResNet surpasses the second best model by 1.09% and 0.51% in the mean and score, respectively. Besides, it is noteworthy that although CA-GoogLeNet-BiLSTM shows a decreased performance compared to the baseline model, our network still achieves higher scores in all metrics. Moreover, we notice that the proposed AL-RN-CNNs outperform baseline CNNs by a large margin in the mean label-based recall, and the maximum improvement can reach 18.30%. In conclusion, these comparisons suggest that explicitly modeling label relations can improve the robustness and retrieval ability of a network. Several example predictions on the AID multi-label dataset are presented in Table IV.
Iii-D2 Qualitative analysis
To dive deep into the model, we visualize label-specific features and attentional regions in Fig. 10 and 11, respectively. In Fig. 10, representative feature maps in various feature parcels for bare soil, building, car, pavement, tree, and water are displayed. As shown here, regions with label-related semantics are highlighted, while less informative regions present weak activations. For instance, regions of ponds are considered as discriminative regions for identifying water. Residential and industrial areas are strongly activated in feature maps for recognizing building. In Fig. 11, it can be observed that attentional regions learned from our network are able to capture areas of semantic objects, such as cars and trees. We also note that some attentional regions in Fig. 11 are coarser than those in Fig. 8, which is because the AID multi-label dataset has a lower spatial resolution.
Furthermore, pairwise relations among positive labels are visualized in Fig. 12. As shown in Fig. (), (), and (), existences of both tree and pavement contribute significantly to the identification of car, while the occurrence of car only suggests a high probability that pavement presents. Strong pairwise relations between building and other labels, e.g., car, pavement, and tree, indicate that the presence of building can heavily assist in predicting those labels.
Iii-E Discussion on the Relational Inference Module
Regarding the relational inference module, the function is an important component, which reasons about relations between two objects. Hence, in this subsection, we discuss about different implementations of . Specifically, we compare our AL-RN-CNN with LR-CNN , which employs a global average pooling layer and an MLP as , on both the UCM and AID multi-label datasets. Experimental results are reported in Table VI. As shown in this table, our network gains the best mean and score on both datasets with variant backbones. AL-RN-VGGNet achieves the highest improvements of 3.59% and 3.82% for the mean and score, respectively, compared to LR-VGGNet on the UCM multi-label dataset. AL-RN-GoogLeNet increases the mean and score by 3.25% and 1.28%, respectively, in comparison with LR-ResNet on the AID multi-label dataset. Moreover, AL-RN-CNN can encode label relations through various field of views by simply changing the size of convolutional filters in .
, , and indicate the mean score achieved by VGGNet-, GoogLeNet-, and ResNet-based networks.
, , and indicate the mean score achieved by VGGNet-, GoogLeNet-, and ResNet-based networks.
In this work, we propose a novel aerial image multi-label classification network, namely attention-aware label relational reasoning network. This network comprise three components: a label-wise feature parcel learning module, an attentional region extraction module, and a label relational inference module. To be more specific, the label-wise feature parcel learning module is designed to learn high-level feature parcels, which are proven to encompass label-relevant semantics, and the attentional region extraction module further generates finer attentional feature parcels by preserving only features located in discriminative regions. Afterwards, the label relational inference module reasons about pairwise relations among all labels and exploit these relations for the final prediction. In order to assess the performance of our network, experiments are conducted on the UCM multi-label dataset and a newly proposed AID multi-label dataset. In comparison with other deep learning methods, our network can offer better classification results. In addition, we visualize extracted feature parcels, attentional regions, and relation matrices for demonstrating the effectiveness of each module in a qualitative way. Looking into the future, such network architecture has several potentials, e.g., weakly supervised object detection and semantic segmentation.
-  D. Marmanis, K. Schindler, J. D. Wegner, S. Galliani, M. Datcu, and U. Stilla, “Classification with an edge: Improving semantic image segmentation with boundary detection,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 135, no. January, pp. 158–172, 2018.
-  N. Audebert, B. L. Saux, and S. Lefèvre, “Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 140, no. June, pp. 20–32, 2018.
-  D. Marcos, M. Volpi, B. Kellenberger, and D. Tuia, “Land cover mapping at very high resolution with rotation equivariant CNNs: Towards small yet accurate models,” ISPRS Journal of Photogrammetry and Remote Sensing, DOI:10.1016/j.isprsjprs.2018.01.021.
-  L. Mou and X. X. Zhu, “RiFCN: Recurrent network in fully convolutional network for semantic segmentation of high resolution remote sensing images,” arXiv:1805.02091, 2018.
-  ——, “Vehicle instance segmentation from aerial image and video using a multi-task learning residual fully convolutional network,” IEEE Transactions on Geoscience and Remote Sensing, in press.
-  ——, “Spatiotemporal scene interpretation of space videos via deep neural network and tracklet analysis,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2016.
Q. Li, L. Mou, Q. Liu, Y. Wang, and X. X. Zhu, “Hsf-net: Multi-scale deep feature embedding for ship detection in optical remote sensing imagery,”IEEE Transactions on Geoscience and Remote Sensing, 2017.
-  S. Lucchesi, M. Giardino, and L. Perotti, “Applications of high-resolution images and DTMs for detailed geomorphological analysis of mountain and plain areas of NW Italy,” European Journal of Remote Sensing, vol. 46, no. 1, pp. 216–233, 2013.
-  L. Mou and X. X. Zhu, “IM2HEIGHT: Height estimation from single monocular imagery via fully residual convolutional-deconvolutional network,” arXiv:1802.10249, 2018.
-  Q. Weng, Z. Mao, J. Lin, and X. Liao, “Land-use scene classification based on a CNN using a constrained extreme learning machine,” International Journal of Remote Sensing, vol. 0, no. 0, pp. 1–19, 2018.
-  G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classification: Benchmark and state of the art,” Proceedings of the IEEE, vol. 105, no. 10, pp. 1865–1883, 2017.
-  P. Zarco-Tejada, R. Diaz-Varela, V. Angileri, and P. Loudjani, “Tree height quantification using very high resolution imagery acquired from an unmanned aerial vehicle (UAV) and automatic 3D photo-reconstruction methods,” European Journal of Agronomy, vol. 55, pp. 89–99, 2014.
-  D. Wen, X. Huang, H. Liu, W. Liao, and L. Zhang, “Semantic classification of urban trees using very high resolution satellite imagery,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, no. 4, pp. 1413–1424, 2017.
-  K. Nogueira, O. Penatti, and J. dos Santos, “Towards better exploiting convolutional neural networks for remote sensing scene classification,” Pattern Recognition, vol. 61, pp. 539–556, 2017.
-  X. X. Zhu, D. Tuia, L. Mou, G. Xia, L. Zhang, F. Xu, and F. Fraundorfer, “Deep learning in remote sensing: A comprehensive review and list of resources,” IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 4, pp. 8–36, 2017.
-  B. Demir and L. Bruzzone, “Histogram-based attribute profiles for classification of very high resolution remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 4, pp. 2096–2107, 2016.
-  F. Hu, G. Xia, J. Hu, and L. Zhang, “Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery,” Remote Sensing, vol. 7, no. 11, pp. 14 680–14 707, 2015.
-  F. Hu, G. Xia, Y. W., and Z. L., “Recent advances and opportunities in scene classification of aerial images with deep models,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2018.
-  F. Zhang, B. Du, and L. Zhang, “Saliency-guided unsupervised feature learning for scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 4, pp. 2175–2184, 2015.
X. Huang, H. Chen, and J. Gong, “Angular difference feature extraction for urban scene classification using ZY-3 multi-angle high-resolution satellite imagery,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 135, pp. 127 – 141, 2018.
-  L. Mou, X. Zhu, M. Vakalopoulou, K. Karantzalos, N. Paragios, B. L. Saux, G. Moser, and D. Tuia, “Multitemporal very high resolution from space: Outcome of the 2016 IEEE GRSS data fusion contest,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, no. 8, pp. 3435–3447, 2017.
-  Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” in International Conference on Advances in Geographic Information Systems (SIGSPATIAL), 2010.
-  G. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, and X. Lu, “Aid: A benchmark data set for performance evaluation of aerial scene classification,” IEEE Transactions on Geoscience and Remote Sensing, 2017.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015.
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
semantic segmentation,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” arXiv:1511.00561, 2015.
-  P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2001.
-  T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in IEEE conference on computer vision and pattern recognition (CVPR), 2017.
-  S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun, “Object detection networks on convolutional feature maps,” IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 7, pp. 1476–1481, 2017.
-  K. Karalas, G. Tsagkatakis, M. Zervakis, and P. Tsakalides, “Land classification using remotely sensed data: Going multilabel,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 6, pp. 3548–3563, 2016.
-  A. Zeggada, F. Melgani, and Y. Bazi, “A deep learning approach to UAV image multilabeling,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 5, pp. 694–698, 2017.
-  S. Koda, A. Zeggada, F. Melgani, and R. Nishii, “Spatial and structured SVM for multilabel image classification,” IEEE Transactions on Geoscience and Remote Sensing, pp. 1–13, 2018.
-  A. Zeggada, S. Benbraika, F. Melgani, and Z. Mokhtari, “Multilabel conditional random field classification for UAV images,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 3, pp. 399–403, 2018.
-  Y. Hua, L. Mou, and X. X. Zhu, “Recurrently exploring class-wise attention in a hybrid convolutional and bidirectional LSTM network for multi-label aerial image classification,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 149, pp. 188–199, 2019.
-  W. Shao, W. Yang, G. Xia, and G. Liu, “A hierarchical scheme of multiple feature fusion for high-resolution satellite scene categorization,” in International Conference on Computer Vision Systems, 2013.
-  V. Risojevic and Z. Babic, “Fusion of global and local descriptors for remote sensing image classification,” IEEE Geoscience and Remote Sensing Letters, vol. 10, no. 4, pp. 836–840, 2013.
-  D. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
-  Q. Zhu, Y. Zhong, B. Zhao, G. S. Xia, and L. Zhang, “Bag-of-visual-words scene classifier with local and global features for high spatial resolution remote sensing imagery,” IEEE Geoscience and Remote Sensing Letters, vol. 13, no. 6, pp. 747–751, 2016.
-  B. Teshome Zegeye and B. Demir, “A novel active learning technique for multi-label remote sensing image scene classification,” in Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, 2018.
-  R. Stivaktakis, G. Tsagkatakis, and P. Tsakalides, “Deep learning for multilabel land cover scene categorization using data augmentation,” IEEE Geoscience and Remote Sensing Letters, 2019.
-  G. Sumbul and D. B., “A CNN-RNN framework with a novel patch-based multi-attention mechanism for multi-label image classification in remote sensing,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2019.
-  A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap, “A simple neural network module for relational reasoning,” in Advances in Neural Information Processing Systems (NIPS), 2017.
-  H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation networks for object detection,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  L. Wang, W. Li, W. Li, and L. Van Gool, “Appearance-and-relation networks for video classification,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  B. Zhou, A. Andonian, and A. Torralba, “Temporal relational reasoning in videos,” in European Conference on Computer Vision (ECCV), 2018.
-  L. Mou, Y. Hua, and X. X. Zhu, “Relation matters: Relational context-aware fully convolutional network for semantic segmentation of high resolution aerial images,” arXiv:1409.1556, 2019.
M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” inAdvances in Neural Information Processing Systems (NIPS), 2015.
B. Chaudhuri, B. Demir, S. Chaudhuri, and L. Bruzzone, “Multilabel remote sensing image retrieval using a semisupervised graph-theoretic method,”IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 2, pp. 1144–1158, 2018.
-  F. Hu, G. Xia, W. Yang, and L. Zhang, “Mining deep semantic representations for scene classification of high-resolution remote sensing imagery,” IEEE Transactions on Big Data, pp. 1–1, 2019.
-  L. Zhang, L. Zhang, and B. Du, “Deep learning for remote sensing data: A technical tutorial on the state of the art,” IEEE Geoscience and Remote Sensing Magazine, vol. 4, no. 2, pp. 22–40, 2016.
-  S. Srivastava, J. E. Vargas-Muñoz, and D. Tuia, “Understanding urban landuse from the above and ground perspectives: A deep learning, multimodal solution,” Remote Sensing of Environment, vol. 228, pp. 129–143, 2019.
-  W. Huang, Q. Wang, and X. Li, “Feature sparsity in convolutional neural networks for scene classification of remote sensing image,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2019.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
-  T. Dozat, “Incorporating Nesterov momentum into Adam,” http://cs229.stanford.edu/proj2015/054_report.pdf, online.
-  X. Wu and Z. Zhou, “A unified view of multi-label performance measures,” arXiv:1609.00288, 2016.
-  “Planet: Understanding the Amazon from space,” https://www.kaggle.com/c/planet-understanding-the-amazon-from-space#evaluation, online.
G. Tsoumakas and I. Vlahavas, “Random k-labelsets: An ensemble method for
multilabel classification,” in
European Conference on Machine Learning, 2007.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.
-  Y. Hua, L. Mou, and X. X. Zhu, “Label relation inference for multi-label aerial image classification,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2019.