With the booming of remote sensing techniques in the recent years, a huge volume of high resolution aerial imagery is now accessible and benefits a wide range of real-world applications, such as urban mapping Marmanis17 ; beyongrgb ; Marcos18 ; mou2018rifcn , ecological monitoring zarco2014tree ; wen2017semantic , geomorphological analysis Mou18 ; lucchesi2013applications ; weng2018land ; cheng2017remote , and traffic management mou2018vehicle ; 7729468 ; hsf . As a fundamental bridge between aerial images and these applications, image classification, which aims at categorizing images into semantic classes, has obtained wide attentions, and many researches have been conducted recently nogueira2017towards ; yang2010bag ; xia2017aid ; zhu2017deep ; 7358126 ; rs71114680 ; hu2018recent ; zhang2015saliency ; HUANG2018127 ; mou2017multitemporal . However, most existing studies assume that each image belongs to only one label (e.g., scene-level labels in Fig. 1), while in reality, an image is usually associated with multiple labels tan2017multi . Furthermore, a good knowledge of existing objects is capable of offering a holistic understanding of an aerial image. With this intension, numerous researches, i.e., semantic segmentation mou2018rifcn ; ren2015faster ; long2015fully ; badrinarayanan2015segnet and object detection ren2015faster ; viola2001rapid ; lin2017feature ; ren2017object , have emerged recently, but unfortunately, the acquisition of ground truths for these studies (i.e., pixel-wise segmentation masks and bounding-box-level annotations) are extremely labor- and time-consuming. Compared to these expensive labels, image-level labels (cf. multiple object-level labels in Fig. 1) are at a fair low cost and readily accessible. To this end, multi-label classification, aiming at assigning an aerial image with multiple object labels, is arising, and in this paper, we deploy our efforts in exploring an efficient multi-label classification model.
1.1 The Challenges of Multi-label Classification
Benefited from the fast growing remote sensing technology, large quantities of high resolution aerial images are available and widely used in many visual tasks. Along with such huge opportunities, challenges have come up inevitably.
On one hand, it is difficult to extract high-level features from high resolution images. Considering its complex spatial structure, conventional hand-crafted features and mid-level semantic models yang2010bag ; shao2013hierarchical ; risojevic2013fusion ; lowe2004distinctive ; 7466064 suffer from the poor performance of capturing holistic semantic features, which leads to an unsatisfactory classification ability.
On the other hand, underlying correlations between dependent labels are required to be unearthed for an efficient prediction of multiple object labels. E.g., the existence of ships infers to a high probable co-occurrence of the sea, while the presence of buildings is almost always accompanied by coexistence of pavement. However, the recently proposed multi-label classification methodskaralas2016land ; zeggada2017deep ; koda2018spatial ; zeggada2018multilabel
assumed that classes are independent and employed a set of binary classifierskaralas2016land or a regression model zeggada2017deep ; koda2018spatial ; zeggada2018multilabel to infer the existence of each class separately.
To summarize, a well-performed multi-label classification system requires powerful capabilities of learning holistic feature representations and should be capable of harnessing the implicit class dependency.
1.2 The Motivation of Our Work
As our survey of related work shows above, recent approaches make few efforts to exploit the high-order class dependency, which constrains the performance in multi-label classification. Besides, direct utilization of CNNs pre-trained on natural image datasets zeggada2017deep ; koda2018spatial ; zeggada2018multilabel leads to a partial interpretation of aerial images due to their diverse visual patterns. Moreover, most state-of-the-art methods decompose multi-label classification into separate stages, which cuts off their inter-correlations and makes end-to-end training infeasible.
To tackle these problems, in this paper, we propose a novel end-to-end network architecture, class attention-based convolutional and bidirectional LSTM network (CA-Conv-BiLSTM), which integrates feature extraction and high-order class dependency exploitation together for multi-label classification. Contributions of our work to the literature are detailed as follows:
We regard the multi-label classification of aerial images as a structured output problem instead of a simple regression problem. In this manner, labels are predicted in an ordered procedure, and the prediction of each label is dependent on others. As a consequence, the implicit class relevance is taken into consideration, and structured outputs are more reasonable and closer to the real-world case as compared to regression outputs.
we propose an end-to-end trainable network architecture for multi-label classification, which consists of a feature extraction module (e.g., a modified network based on VGG-16), a class attention learning layer, and a bidirectional LSTM-based sub-network. These components are designed for extracting features from input images, learning discriminative class-specific features, and exploiting class dependencies, respectively. Besides, such design makes it feasible to train the network in an end-to-end fashion, which enhances the compactness of our model.
Considering that class dependencies are diverse in both directions, a bidirectional analysis is required for modeling such correlations. Therefore, we employ a bidirectional LSTM-based network, instead of a one-way recurrent neural network, to dig out class relationships.
We build a new challenging dataset, DFC15 multi-label dataset, by reproducing from a semantic segmentation dataset, GRSS_DFC_2015 (DFC15) dfc15 . The proposed dataset consists of aerial images at a spatial resolution of 5 cm and can be used to evaluate the performance of networks for multi-label classification.
The following sections further introduce and discuss our network. Specifically, Section 2 provides an intuitive illustration of the class dependency, and then details the structure of the proposed network in terms of its three fundamental components. Section 3 describes the setup of our experiments, and experimental results are discussed from quantitative and qualitative perspectives. Finally, the conclusion of this paper is drawn in Section 4.
2.1 An Observation
consider such problem as a regression issue, where models are trained to fit a binary sequence, and each digit indicates the existence of its corresponding class. Unlike one-hot vectors, a binary sequence is allowed to contain more than one ’hot’ value for indicating the joint existence of multiple candidate classes in one image. Besides, several researcheskaralas2016land formulate multi-label classification into several single-label classification tasks, and thus, train a set of binary classifiers for each class. Notably, one common assumption of these studies is that classes are independent of each other, and classifiers predict the existence of each category independently.
However, this is violent and not accord with real life. As illustrated in Fig. 1, although images obtained in diverse scenes are assigned with multiple different labels, there are still common classes, e.g., car and pavement, coexisting in each image. This is because in the real-life world, some classes have strong correlation, for example, cars are often driven or parked on pavements. To further demonstrate the class dependency, we calculate conditional probabilities for each of two categories. Let denote referenced class, and denote potential co-occurrence class. Conditional probability , which depicts the possibility that exhibits in an image, where the existence of is priorly known, can be solved with Eq. 1,
indicates the joint occurrence probability of and , and refers to the priori probability of . Conditional probabilities of all class pairs in UCM multi-label datasets are listed in Fig. 2, and it is intuitive that some classes have strong dependencies. For instance, it is highly possible that there are pavements in images, which contain airplanes, buildings, cars, or tanks. Moreover, it is notable that class dependencies are not symmetric due to their particular properties. For example, is twice as due to the reason that the occurrence of ships always infer to the co-occurrence of sea, while not vice versa. Therefore, to thoroughly dig out the correlation among various classes, it is crucial to model class probabilistic dependencies bidirectionally in a classification method.
To this end, we boil the multi-label classification down into a structured output problem, instead of a simple regression issue, and employ a unified framework of a CNN and a bidirectional RNN to 1) extract semantic features from raw images and 2) model image-label relations as well as bidirectional class dependencies, respectively.
2.2 Network Architecture
The proposed CA-Conv-BiLSTM, as illustrated in Fig. 3, is composed of three components: a feature extraction module, a class attention learning layer, and a Bidirectional LSTM-based recurrent sub-network. More specifically, the feature extraction module employs a stack of interleaved convolutional and pooling layers to extract high-level features, which are then fed into a class attention learning layer to produce discriminative class-specific features. Afterwards, a bidirectional LSTM-based recurrent sub-network is attached to model both probabilistic class dependencies and underlying relationships between image features and labels.
Section 2.2.1 details the architecture of the feature extraction module, and Section 2.2.2 describes the explicit design of the class attention learning layer. Finally, Section 2.2.3 introduces how to produce structured multi-label outputs from class-specific features via a bidirectional LSTM-based recurrent sub-network.
2.2.1 Dense High-level Feature Extraction
Learning efficient feature representations of input images is extremely crucial for image classification task. To this end, a modern popular trend is to employ a CNN architecture to automatically extract discriminative features, and many recent studies hua2018lahnet ; mou2018vehicle ; xia2017aid ; kang2018u ; zhu2017deep ; mou2017multitemporal have achieved great progresses in a wide range of classification tasks. Inspired by this, our model adapts VGG-16 simonyan2014very , one of the most welcoming CNN architectures for its effectiveness and elegance, to extract high-level features for our task.
Specifically, the feature extraction module consists of 5 convolutional blocks, and each of them contains 2 or 3 convolutional layers (as illustrated in the left of Fig. 3). Notably, the number of filters is equivalent in a common convolutional block and doubles after the spatial dimension of feature maps is scaled down by pooling layers. The purpose of such design is to enable the feature extraction module to learn diverse features at less computational expense. The receptive field of all convolutional filters is
, which increases nonlinearities inside the feature extraction module. Besides, the convolution stride is 1 pixel, and the spatial padding of each convolutional layer is set as 1 pixel as well. Among these convolutional blocks, max-pooling layers are interleaved for reducing the size of feature maps and meanwhile, maintaining only local representative, such as, maximum in a-pixel region. The size of pooling windows is pixels, and the pooling stride is 2 pixels, which halves feature maps in width and length.
Features directly learned from a conventional CNN (e.g., VGG-16) are proved to be high-level and semantic, but their spatial resolution is significantly reduced, which is not favorable for generating high-dimensional class-specific features in the subsequent class attention learning layer. To address this, max-pooling layers following the last two convolutional blocks are discarded in our model, and atrous convolutional filters with dilation rate 2 are employed in the last convolutional block for preserving original receptive fields. Consequently, our feature extraction module is capable of learning high-level features with finer spatial resolution, so called “dense”, compared to VGG-16, and it is feasible to initialize our model with pre-trained VGG-16, considering that all filters have equivalent receptive fields.
Moreover, it is worth nothing that other popular CNN architectures can be taken as prototypes of the feature extraction module, and thus, we extend researches to GoogLeNet szegedy2015going and ResNet he2016deep for a comprehensive evaluation of CA-Conv-BiLSTM. Regarding GoogLeNet, i.e., Inception-v3 szegedy2016rethinking , the stride of convolutional and pooling layers after “mixed7” is reduced to 1 pixel, and the dilation rate of convolutional filters in “mixed9” is 2. For ResNet (we use ResNet-50), the convolution stride in last two residual blocks is set as 1 pixel, and the dilation rate of filters in the last residual block is 2. Besides, layers after global average pooling layers, as well as itself, are removed to ensure dense high-level feature maps.
2.2.2 Class Attention Learning Layer
Although Features extracted from pre-trained CNNs are high-level and can be directly fed into a fully connected layer for generating multi-label predictions, it is infeasible to learn high-order probabilistic dependencies by recurrently feeding it with identical features. Therefore, extracting discriminative class-wise features plays a key role in discovering class dependencies and effectively bridging CNN and RNN for multi-label classification tasks.
Here, we propose a class attention learning layer to explore features with respect to each category, and the proposed layer, illustrated in the middle of Fig. 3, consists of the following two stages: 1) generating class attention maps via a convolutional layer with stride 1, and 2) vectorizing each class attention map to obtain class-specific features. Formally, given feature maps , extracted from the feature extraction module, with a size of , and let represent the -th convolutional filter in the class attention learning layer. The attention map for class can be obtained with the following formula:
where ranges from 1 to the number of classes. Besides, represents convolution operation. Given that the size of convolutional filters is , and the stride is 1, Eq. 2 can be further modified as:
where , and and indicate activations of the class attention map and the -th channel of at a spatial location , respectively. is the -th channel of . The modified formula highlights that a class attention map is intrinsically a linear combination of all channels in , and depicts the importance of the -th channel of for class . Therefore, with a strong activation suggests that the region is highly relevant to class , and vice versa. With this design, the proposed class attention learning layer is capable of tracking distinctive attention of the network when predicting different classes, and extracted class attention maps are abundant in discriminative class-specific semantic information. Some examples are shown in Fig. 4. It is observed that class attention maps highlight discriminative areas for different categories and exhibit almost no activations with respect to absent classes (as shown in Fig. ()).
Subsequently, class attention maps are transformed into class-wise feature vectors of dimensions by vectorization. Instead of fully connecting class attention maps to each hidden unit in the following layer, we construct class-wise connections between class attention maps and their corresponding hidden units, i.e., corresponding time steps in a LSTM layer in our network. In this way, features fed into different units are retained to be class-specific discriminative and significantly contribute to exploitation of the dynamic class dependency in the subsequent bidirectional LSTM layer.
2.2.3 Class Dependency Learning via a BiLSTM-based Sub-network
As an important branch of neural networks, RNN is widely used in dealing with sequential data, e.g., textual data and temporal series, due to its strong capabilities of exploiting implicit dependencies among inputs. Unlike CNN, RNN is characterized by its recurrent neurons, of which activations are dependent on both current inputs and previous hidden states. However, conventional RNNs suffer from the gradient vanishing problem and are found difficult to learn long-term dependencies. Therefore, in this work, we seek to model class dependencies with an LSTM-based RNN, which is first proposed inhochreiter1997long and has shown great performance in processing long sequences graves2013generating ; gers1999learning ; xu2015show ; Mournn ; moulearn .
Instead of directly summing up inputs as in a conventional recurrent layer, an LSTM layer relies on specifically designed hidden units, LSTM units, where information, such as the class dependency between category and , is “memorized”, updated, and transmitted with a memory cell and several gates. Specifically, given a class-specific feature obtained from the class attention learning layer as an input of the LSTM memory cell at time step , and let represent the activation of . New memory information , learned from the previous activation and the present input feature , is obtained as follows:
where and denote weight matrix from input vectors to memory cell and hidden-memory coefficient matrix, respectively, and is a bias term. Besides, is the hyperbolic tangent function. In contrast to conventional recurrent units, where the is directly used to update the current state , an LSTM unit employs an input gate to control the extent to which is added, and meanwhile, partially omits uncorrelated prior information from with a forget gate . The two gates are performed by the following equations:
Consequently, the memory cell is updated by
where represents element-wise multiplication. Afterwards, an output gate , formulated by
is designed to determine the proportion of memory content to be exposed, and eventually, the memory cell at time step is activated by
Although it is not difficult to discover that the activation of the memory cell at each time step is dependent on both input class-specific feature vectors and previous cell states. However, taking into account that the class dependency is bidirectional, as demonstrated in Section 2.1, a single-directional LSTM-based RNN is insufficient to draw a comprehensive picture of inter-class relevance. Therefore, a bidirectional LSTM-based RNN, composed of two identical recurrent streams but with reversed directions, is introduced in our model, and the hidden units are updated based on signals from not only their preceding states but also subsequent ones.
In order to practically adapt a bidirectional LSTM-based RNN to modeling the class dependency, we set the number of time steps in our bidirectional LSTM-based sub-network equivalent to that of classes under the assumption that distinct classes are predicted at respective time steps. Validated in Section 3.3 and 3.4, such design enjoys two outstanding characteristics: on one hand, the LSTM memory cell at time step , , focuses on learning dependent relationship between class and others in dual directions (cf. Fig. 5), and on the other hand, the occurrence probability of class , , can be predicted from outputs with a single-unit fully connected layer:
where denotes the activation of in the other direction, and
is used as the activation function.
3 Experiments and Discussion
In this section, two high resolution aerial datasets of different resolution used for evaluating our network are first described in Section 3.1, and then, the training strategies are introduced in Section 3.2. Afterwards, the performance of the proposed network on the two datasets is quantitatively and qualitatively evaluated in the following sections.
3.1 Data description
3.1.1 UCM Multi-label Dataset
UCM multi-label dataset chaudhuri2018multilabel is reproduced from UCM dataset yang2010bag by reassigning them with multiple object labels. Specifically, UCM dataset consists of 2100 aerial images of pixels, and each of them is categorized into one of 21 scene labels: airplane, beach, agricultural, baseball diamond, building, tennis courts, dense residential, forest, freeway, golf course, mobile home park, harbor, intersection, storage tank, medium residential, overpass, sparse residential, parking lot, river, runway, and chaparral. For each of them, there are 100 images with a spatial resolution of one foot collected by cropping manually from aerial ortho imagery provided by the United States Geological Survey (USGS) National Map.
In contrast, images in UCM multi-label dataset are relabeled by assigning each image sample with one or more labels based on their primitive objects. The total number of newly defined object classes is 17: airplane, sand, pavement, building, car, chaparral, court, tree, dock, tank, water, grass, mobile home, ship, bare soil, sea, and field. It is notable that several labels, namely, airplane, building, and tank, are defined in both datasets but with variant level. In UCM dataset, they are scene-level labels, since they are predominant objects in an image and used to depict the whole image, while in UCM multi-label dataset, they are object-level labels, regarded as candidate objects in a scene. The numbers of images related to each object category are listed in Table 1, and examples from each scene category are shown in Fig. 6, as well as their corresponding object labels. To train and test our network on UCM multi-label dataset, we select 80% of sample images evenly from each scene category for training and the rest as the test set.
|Class No.||Class Name||Total||Training||Test|
3.1.2 DFC15 Multi-label Dataset
Considering that images collected from the same scene may share similar patterns, alleviating task challenges, we build a new multi-label dataset, DFC15 multi-label dataset, based on a semantic segmentation dataset, DFC15 dfc15 , which was published and first used in 2015 IEEE GRSS Data Fusion Contest. DFC15 dataset is acquired over Zeebrugge with an airborne sensor, which is 300m off the ground. In total, 7 tiles are collected in DFC dataset, and each of them is pixels with a spatial resolution of 5 cm. Unlike UCM dataset, where images are assigned with image-level labels, all tiles in DFC15 dataset are labeled in pixel-level, and each pixel is categorized into 8 distinct object classes: impervious, water, clutter, vegetation, building, tree, boat, and car.
Considering our task, the following processes are conducted: First, we crop large tiles into images of
pixels with a 200-pixel-stride sliding window. Afterwards, images containing unclassified pixels are ignored, and labels of all pixels in each image are aggregated into image-level multi-labels. An important characteristic of images in DFC15 multi-label dataset is lower inter-image similarity due to that they are cropped from vast regions consecutively without specific preferences, e.g., seeking images belonging to a specific scene. Moreover, extremely high resolution makes it more challenging as compared to UCM multi-label dataset. The numbers of images containing each object label are listed in Table2, and example images with their image-level object labels are shown in Fig. 7. To conduct evaluation, 80% of images are randomly selected as the training set, while the others are utilized to test our network.
|Class No.||Class Name||Total||Training||Test|
3.2 Training details
The proposed CA-Conv-BiLSTM is initialized with separate strategies with respect to three dominant components: 1) the feature extraction module is initialized with CNNs pre-trained on ImageNet datasetimagenet_cvpr09 , 2) convolutional filters in the class attention learning layer is initialized with Glorot uniform initializer, and 3) all weights in the bidirectional 2048-d LSTM layer are randomly initialized in the range of
with a uniform distribution. Notably, weights in the feature extraction module is trainable and fine tuned during the training phase of our network.
Regarding the optimizer, we chose Nestro Adam nadam2
, claimed to converge faster than stochastic gradient descent (SGD), and set parameters of the optimizer as recommended:, , and . The learning rate is set as
, and decayed by 0.1 when the validation accuracy is saturated. The loss of the network is simply defined as mean squared error. We implement the network on TensorFlow and train it on one NVIDIA Tesla P100 16GB GPU for 100 epochs. The size of training batch is 32 as a trade-off between GPU memory capacity and training speed. To avoid overfitting, we stop training procedure when the loss fails to decrease in five epochs.
3.3 Results on UCM Multi-label Dataset
3.3.1 Quantitative Results
To evaluate the performance of CA-Conv-BiLSTM for multi-label classification of high resolution aerial imagery, we calculate score as follows:
where is the example-based precision tsoumakas2007random of predicted multiple labels, and indicates the example-based recall. They are computed by:
where , , and indicate the numbers of positive labels, which are predicted correctly (true positives) and incorrectly (false positives), and negative labels, which are incorrectly predicted (false negatives) in an example (i.e., an image with multiple object labels in our case), respectively. Then, the average of scores of each example is formed to assess the overall accuracy of multi-label classification tasks. Besides, example-based mean precision as well as mean recall are calculated to assess the performance from the perspective of examples, while label-based mean precision and mean recall can help us understand the performance of the network from the perspective of object labels:
where , , and represent the numbers of correctly predicted positive images, incorrectly predicted positive images, and incorrectly predicted negative images with respect to each label.
For a fair validation of CA-Conv-BiLSTM, we decompose the evaluation into two components: we compare 1) CA-Conv-LSTM with standard CNNs to validate the effectiveness of employing LSTM-based recurrent sub-network, and 2) CA-Conv-BiLSTM with CA-Conv-LSTM for further assess the significance of the bidirectional structure. The detailed configurations of these competitors are listed in Table 3.
|Model||CNN model||Class Attention Map||Bi.|
N indicates the number of classes in the dataset.
Bi. indicates whether the model is bidirectional or not.
Table 4 exhibits results on UCM multi-label dataset, and it can be seen that, compared to directly applying standard CNNs to multi-label classification, CA-Conv-LSTM framework performs superiorly as expected due to taking class dependencies into consideration. CA-VGG-LSTM increases the mean by 0.0304 with respect to VGGNet, while for CA-ResNet-LSTM, an increment of 0.0107, is obtained compared to ResNet. Mostly enjoying this framework, CA-GoogLeNet-LSTM achieves the best mean score of 0.8423 and an increment of 0.0341 in comparison with other CA-Conv-LSTM models and GoogLeNet, respectively. Another important observation is that the proposed CA-Conv-LSTM models are equipped with higher recall but lower precision. This is because the recall accounts more for a higher score according to Eq. 10. To summarize, all comparisons demonstrate that instead of directly using a standard CNN as a regression task, exploiting class dependencies plays a key role in multi-label classification. Table 5 exhibits several example predictions in UCM multi-label dataset.
M. indicates the mean score.
indicate example-based mean precision and recall.
and indicate class-based mean precision and recall.
Concerning the signification of employing a bidirectional structure, CA-ResNet-BiLSTM performs better than CA-ResNet-BiLSTM, and CA-GoogL-eNet-BiLSTM achieves higher mean score, increased by 0.0446, compared to CA-GoogLeNet-LSTM. CA-VGG-BiLSTM shows an increment of 0.0259 compared to VGGNet and achieves a comparable mean score against CA-VGG-LSTM. In general, this result is consistent with our observation (cf. Fig. 2), that class dependencies are unsymmetrical and required to be modeled in both directions.
3.3.2 Qualitative Results
In addition to validate classification capabilities of the network by computing the mean score, we further explore the effectiveness of class-specific features learned from the proposed class attention learning layer and try to“open” the black box of our network by feature visualization. Example class attention maps produced by the proposed network on UCM multi-label dataset are shown in Fig. 8, where column (a) is original images, and columns (b)-(i) are class attention maps for different objects: (b) bare soil, (c) building, (d) car, (e) court, (f) grass, (g) pavement, (h) tree, and (i) water. As we can see, these maps highlight discriminative regions for positive classes, while present almost no activations when corresponding objects are absent in original images. For example, object labels of the image at the first row in Fig. 8 are building, grass, pavement, and tree, and its class attention maps for these categories are strongly activated. From images at the forth row of Fig. 8, it can be seen that regions of the grass land, forest, and river are highlighted in their corresponding class attention maps, leading to positive predictions, while no discriminative areas are intensively activated in the other maps.
3.4 Results on DFC15 Multi-label Dataset
3.4.1 Quantitative Results
Following the evaluation on UCM multi-label dataset, we assess our network on DFC15 multi-label dataset by calculating the mean score, mean precision, and mean recall. Table 6 shows experimental results on this dataset, and the conclusion can be drawn that modeling class dependencies with a bidirectional structure contributes significantly to multi-label classification. Specifically, the mean score achieved by CA-GoogLeNet-BiLSTM is 0.0151 and 0.0285 higher than CA-GoogLeNet-LSTM and GoogLeNet, respectively. CA-VGG-BiLSTM obtains the best mean score of 0.7616 in comparison with VGGNet and CA-VGG-LSTM, and the mean score of CA-ResNet-BiLSTM is 0.7934, higher than its competitors. To conclude, all these increments demonstrate the effectiveness and robustness of our bidirectional structure for high resolution aerial image multi-label classification. Several example predictions in DFC15 multi-label dataset are shown in Table 5.
3.4.2 Qualitative Results
To study the effectiveness of class-specific features, we visualize class attention maps learned from the proposed class attention learning layer, as shown in Fig. 9. Columns (b)-(i) are example class attention maps with respect to (b) impervious, (c) water, (d) clutter, (e) vegetation, (f) building, (g) tree, (h) boat, and (i) car. As we can see, figures at column (b) of Fig. 9 show that the network pays high attention on impervious region, such as parking lots, while figures at column (i) highlight regions of cars. However, some of class attention maps for negative object labels exhibit unexpected strong activations. For instance, the class attention map for car at the third row of Fig. 9 is not supposed to highlight any region due to its absence of cars. This can be explained as the highlighted regions share similar patterns as cars, which also illustrates why the network made wrong predictions (cf. wrongly predicted car label in Fig. 9). Overall, the visualization of class attention maps demonstrates that the features captured from the proposed class attention learning layer are discriminative and class-specific.
In this paper, we propose a novel network, CA-Conv-BiLSTM, for the multi-label classification of high resolution aerial imagery. The proposed network is composed of three indispensable elements: 1) a feature extraction module, 2) a class attention learning layer, and 3) a bidirectional LSTM-based sub-network. Specifically, the feature extraction module is responsible for capturing fine-grained high-level feature maps from raw images, while the class attention learning layer is designed for extracting discriminative class-specific features. Afterwards, the bidirectional LSTM-based sub-network is used to model the underlying class dependency in both directions and predict multiple object labels in a structured manner. With such design, the prediction of multiple object-level labels is performed in an ordered procedure, and outputs are structured sequences instead of discrete values. We evaluate our network on two datasets, UCM multi-label dataset and DFC15 multi-label dataset, and experimental results validate the effectiveness of our model from both quantitative and qualitative respects. On one hand, the mean score is increased by at most 0.0446 compared to other competitors. On the other hand, visualized class attention maps, where discriminative regions for existing objects are strongly activated, demonstrate that features learned from this layer are class-specific and discriminative. Looking into the future, the application of our network can be extended to fields, such as weakly supervised semantic segmentation and object localization.
This work is jointly supported by the China Scholarship Council, the Helmholtz Association under the framework of the Young Investigators Group SiPEO (VH-NG-1018, www.sipeo.bgu.tum.de), and the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. ERC-2016-StG-714087, Acronym: So2Sat). In addition, the authors would like to thank the National Center for Airborne Laser Mapping and the Hyperspectral Image Analysis Laboratory at the University of Houston for acquiring and providing the data used in this study, and the IEEE GRSS Image Analysis and Data Fusion Technical Committee.
- (1) D. Marmanis, K. Schindler, J. D. Wegner, S. Galliani, M. Datcu, U. Stilla, Classification with an edge: Improving semantic image segmentation with boundary detection, ISPRS Journal of Photogrammetry and Remote Sensing 135 (January) (2018) 158–172.
- (2) N. Audebert, B. L. Saux, S. Lefèvre, Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks, ISPRS Journal of Photogrammetry and Remote Sensing 140 (June) (2018) 20–32.
- (3) D. Marcos, M. Volpi, B. Kellenberger, D. Tuia, Land cover mapping at very high resolution with rotation equivariant CNNs: Towards small yet accurate models, ISPRS Journal of Photogrammetry and Remote Sensing.
- (4) L. Mou, X. X. Zhu, RiFCN: Recurrent network in fully convolutional network for semantic segmentation of high resolution remote sensing images, arXiv:1805.02091.
- (5) P. Zarco-Tejada, R. Diaz-Varela, V. Angileri, P. Loudjani, Tree height quantification using very high resolution imagery acquired from an unmanned aerial vehicle (UAV) and automatic 3D photo-reconstruction methods, European Journal of Agronomy 55 (2014) 89–99.
- (6) D. Wen, X. Huang, H. Liu, W. Liao, L. Zhang, Semantic classification of urban trees using very high resolution satellite imagery, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 10 (4) (2017) 1413–1424.
- (7) L. Mou, X. X. Zhu, IM2HEIGHT: Height estimation from single monocular imagery via fully residual convolutional-deconvolutional network, arXiv:1802.10249.
- (8) S. Lucchesi, M. Giardino, L. Perotti, Applications of high-resolution images and DTMs for detailed geomorphological analysis of mountain and plain areas of NW Italy, European Journal of Remote Sensing 46 (1) (2013) 216–233.
Q. Weng, Z. Mao, J. Lin, X. Liao, Land-use scene classification based on a CNN using a constrained extreme learning machine, International Journal of Remote Sensing 0 (0) (2018) 1–19.
- (10) G. Cheng, J. Han, X. Lu, Remote sensing image scene classification: Benchmark and state of the art, Proceedings of the IEEE 105 (10) (2017) 1865–1883.
- (11) L. Mou, X. X. Zhu, Vehicle instance segmentation from aerial image and video using a multi-task learning residual fully convolutional network, IEEE Transactions on Geoscience and Remote Sensing.
- (12) L. Mou, X. X. Zhu, Spatiotemporal scene interpretation of space videos via deep neural network and tracklet analysis, in: IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2016.
Q. Li, L. Mou, Q. Liu, Y. Wang, X. X. Zhu, Hsf-net: Multi-scale deep feature embedding for ship detection in optical remote sensing imagery, IEEE Transactions on Geoscience and Remote Sensing.
K. Nogueira, O. Penatti, J. dos Santos, Towards better exploiting convolutional neural networks for remote sensing scene classification, Pattern Recognition 61 (2017) 539–556.
- (15) Y. Yang, S. Newsam, Bag-of-visual-words and spatial extensions for land-use classification, in: International Conference on Advances in Geographic Information Systems (SIGSPATIAL), 2010.
- (16) G. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, X. Lu, AID: A benchmark data set for performance evaluation of aerial scene classification, IEEE Transactions on Geoscience and Remote Sensing 55 (7) (2017) 3965–3981.
X. X. Zhu, D. Tuia, L. Mou, S. Xia, L. Zhang, F. Xu, F. Fraundorfer, Deep learning in remote sensing: A comprehensive review and list of resources, IEEE Geoscience and Remote Sensing Magazine 5 (4) (2017) 8–36.
- (18) B. Demir, L. Bruzzone, Histogram-based attribute profiles for classification of very high resolution remote sensing images, IEEE Transactions on Geoscience and Remote Sensing 54 (4) (2016) 2096–2107.
- (19) F. Hu, G. Xia, J. Hu, L. Zhang, Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery, Remote Sensing 7 (11) (2015) 14680–14707.
- (20) F. Hu, G. Xia, Y. W., Z. L., Recent advances and opportunities in scene classification of aerial images with deep models, in: IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2018.
- (21) F. Zhang, B. Du, L. Zhang, Saliency-guided unsupervised feature learning for scene classification, IEEE Transactions on Geoscience and Remote Sensing 53 (4) (2015) 2175–2184.
- (22) X. Huang, H. Chen, J. Gong, Angular difference feature extraction for urban scene classification using ZY-3 multi-angle high-resolution satellite imagery, ISPRS Journal of Photogrammetry and Remote Sensing 135 (2018) 127 – 141.
- (23) L. Mou, X. X. Zhu, M. Vakalopoulou, K. Karantzalos, N. Paragios, B. Le Saux, G. Moser, D. Tuia, Multitemporal very high resolution from space: Outcome of the 2016 IEEE GRSS data fusion contest, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 10 (8) (2017) 3435–3447.
- (24) Q. Tan, Y. Liu, X. Chen, G. Yu, Multi-label classification based on low rank representation for image annotation, Remote Sensing 9 (2) (2017) 109.
- (25) S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, in: Advances in Neural Information Processing Systems, 2015.
J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: IEEE conference on computer vision and pattern recognition (CVPR), 2015.
- (27) V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: A deep convolutional encoder-decoder architecture for image segmentation, arXiv:1511.00561.
- (28) P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in: IEEE conference on computer vision and pattern recognition (CVPR), 2001.
- (29) T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: IEEE conference on computer vision and pattern recognition (CVPR), 2017.
- (30) S. Ren, K. He, R. Girshick, X. Zhang, J. Sun, Object detection networks on convolutional feature maps, IEEE transactions on Pattern Analysis and Machine Intelligence 39 (7) (2017) 1476–1481.
- (31) W. Shao, W. Yang, G. Xia, G. Liu, A hierarchical scheme of multiple feature fusion for high-resolution satellite scene categorization, in: International Conference on Computer Vision Systems, 2013.
- (32) V. Risojevic, Z. Babic, Fusion of global and local descriptors for remote sensing image classification, IEEE Geoscience and Remote Sensing Letters 10 (4) (2013) 836–840.
- (33) D. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision 60 (2) (2004) 91–110.
- (34) Q. Zhu, Y. Zhong, B. Zhao, G. S. Xia, L. Zhang, Bag-of-visual-words scene classifier with local and global features for high spatial resolution remote sensing imagery, IEEE Geoscience and Remote Sensing Letters 13 (6) (2016) 747–751.
- (35) K. Karalas, G. Tsagkatakis, M. Zervakis, P. Tsakalides, Land classification using remotely sensed data: Going multilabel, IEEE Transactions on Geoscience and Remote Sensing 54 (6) (2016) 3548–3563.
- (36) A. Zeggada, F. Melgani, Y. Bazi, A deep learning approach to UAV image multilabeling, IEEE Geoscience and Remote Sensing Letters 14 (5) (2017) 694–698.
- (37) S. Koda, A. Zeggada, F. Melgani, R. Nishii, Spatial and structured SVM for multilabel image classification, IEEE Transactions on Geoscience and Remote Sensing (2018) 1–13.
- (38) A. Zeggada, S. Benbraika, F. Melgani, Z. Mokhtari, Multilabel conditional random field classification for UAV images, IEEE Geoscience and Remote Sensing Letters 15 (3) (2018) 399–403.
- (39) 2015 IEEE GRSS data fusion contest, http://www.grss-ieee.org/community/technical-committees/data-fusion, online.
- (40) Y. Hua, L. Mou, X. X. Zhu, LAHNet: A convolutional neural network fusing low- and high-level features for aerial scene classification, in: IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2018.
- (41) J. Kang, Y. Wang, M. Körner, H. Taubenböck, X. X. Zhu, Building instance classification using street view images.
- (42) K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556.
- (43) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
- (44) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
- (45) C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- (46) S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation 9 (8) (1997) 1735–1780.
- (47) A. Graves, Generating sequences with recurrent neural networks, arXiv:1308.0850.
- (48) F. Gers, J. Schmidhuber, F. Cummins, Learning to forget: Continual prediction with LSTM, in: International Conference on Artificial Neural Networks.
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in: International Conference on Machine Learning, 2015.
- (50) L. Mou, P. Ghamisi, X. Zhu, Deep recurrent neural networks for hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing 55 (7) (2017) 3639–3655.
- (51) L. Mou, L. Bruzzone, X. X. Zhu, Learning spectral-spatial-temporal features via a recurrent convolutional neural network for change detection in multispectral imagery, arXiv:1803.02642.
- (52) B. Chaudhuri, B. Demir, S. Chaudhuri, L. Bruzzone, Multilabel remote sensing image retrieval using a semisupervised graph-theoretic method, IEEE Transactions on Geoscience and Remote Sensing 56 (2) (2018) 1144–1158.
- (53) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
T. Dozat, Incorporating Nesterov momentum into Adam,http://cs229.stanford.edu/proj2015/054_report.pdf, online.
- (55) G. Tsoumakas, I. Vlahavas, Random k-labelsets: An ensemble method for multilabel classification, in: European Conference on Machine Learning, 2007.