Binary Patterns Encoded Convolutional Neural Networks for Texture Recognition and Remote Sensing Scene Classification

06/05/2017 ∙ by Rao Muhammad Anwer, et al. ∙ 0

Designing discriminative powerful texture features robust to realistic imaging conditions is a challenging computer vision problem with many applications, including material recognition and analysis of satellite or aerial imagery. In the past, most texture description approaches were based on dense orderless statistical distribution of local features. However, most recent approaches to texture recognition and remote sensing scene classification are based on Convolutional Neural Networks (CNNs). The d facto practice when learning these CNN models is to use RGB patches as input with training performed on large amounts of labeled data (ImageNet). In this paper, we show that Binary Patterns encoded CNN models, codenamed TEX-Nets, trained using mapped coded images with explicit texture information provide complementary information to the standard RGB deep models. Additionally, two deep architectures, namely early and late fusion, are investigated to combine the texture and color information. To the best of our knowledge, we are the first to investigate Binary Patterns encoded CNNs and different deep network fusion architectures for texture recognition and remote sensing scene classification. We perform comprehensive experiments on four texture recognition datasets and four remote sensing scene classification benchmarks: UC-Merced with 21 scene categories, WHU-RS19 with 19 scene classes, RSSCN7 with 7 categories and the recently introduced large scale aerial image dataset (AID) with 30 aerial scene types. We demonstrate that TEX-Nets provide complementary information to standard RGB deep model of the same network architecture. Our late fusion TEX-Net architecture always improves the overall performance compared to the standard RGB network on both recognition problems. Our final combination outperforms the state-of-the-art without employing fine-tuning or ensemble of RGB network architectures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Texture analysis in real-world images, robust to variations in scale, orientation, illumination or other visual appearance, is a challenging computer vision problem with many applications, including object classification and remote sensing. Over the years, a variety of texture analysis approaches have been proposed in literature Ojala02b ; Zhang07 ; Varma10j ; Liu16k ; Liu17j

to capture different properties of texture, such as spatial structure, roughness, contrast, regularity, and orientation in images. Most successful texture description methods are based on orderless distribution of local features leading to the development of several classification frameworks, including histograms of vector quantized filter responses 

Leung96k , textons theory Leung01j , bag-of-visual-words Csurka04 and later the Fisher Vector Perronnin07k . In this paper, we tackle the issue of learning robust texture description for texture recognition and remote sensing scene classification.

The first problem investigated in this paper is that of texture recognition, where the task is to associate each texture image to its respective texture category. Texture recognition plays a crucial role in many applications, related to biomedical imaging, material recognition, document image analysis, and biometrics. The problem of texture recognition can be divided into two phases: the texture description stage and the classification phase. Generally, much attention has been focused on the texture description phase since it is challenging to design powerful texture features robust to imaging conditions. One of the most successful approaches to texture description is that of Local Binary Patterns (LBP) Ojala02b and its variants. The standard LBP descriptor Ojala96bb is invariant to monotonic gray scale changes and is based on the signs of differences of neighboring pixels in an image. The LBP descriptor was later extended Ojala02b to obtain multi-scale, rotation invariant and uniform representations and has been successfully employed in other tasks, including object detection Zhang11h

, face recognition 

Ahonen04h , and remote sensing scene classification Chen16jk .

The second problem investigated in this paper is that of remote sensing scene classification. Remote sensing scene classification is a challenging and open research problem crucial for understanding high-resolution remote sensing imagery with numerous applications including vegetation mapping, urban planning, land resource management and environmental monitoring. In this problem, the task is to automatically associate a semantic class label to each high-resolution remote sensing image containing multiple land cover types and ground objects. The problem is challenging due to several factors, such as large intra-class variations, changes in illumination due to images extracted at different times and seasons, small inter-class dissimilarity and scale variations. Several existing approaches either rely on using low-level visual features Santos10j ; Yang13jk ; Chen16jk , such as color, shape or using combination of visual features Bin13jj ; Chen15jk . Contrary to approaches based on low-level visual features, mid-level remote sensing scene classification methods tackle the problem by encoding low-level features into a holistic high-order statistical image representation. Popular mid-level approaches include bag-of-words (BOW) variants Chen11jj ; Yang10j , spatial extensions to BOW Yang11j ; Shizhijk , semantic BOW using topic models Kusumaningrum14k ; Yanfei15k , and unsupervised feature learning Zhang15jj ; Fan15jj .

Recently, Convolutional Neural Networks (CNNs) have revolutionised computer vision, being the catalyst to significant performance gains in many vision applications, including texture recognition Cimpoi16j and remote sensing scene classification Fan15jk ; Penatti15j . CNNs and other ”deep networks” are generally trained on large amounts of labeled training data (e.g. ImageNet Deng09a ) with raw image pixels with a fixed size as input. Deep networks consists of several convolution and pooling operations followed by one or more fully connected (FC) layers. Several works Azizpour14h ; Oquab14k

have shown that intermediate activations of the FC layers in a deep network, pre-trained on the ImageNet dataset, are general-purpose features applicable to visual recognition tasks. Deep features based approaches have shown to provide the best results in recent evaluations for texture recognition 

Liu16k and remote sensing scene classification Xia177 .

As mentioned above, the de facto practice is to train deep models on the ImageNet dataset using RGB values of the image patch as an input to the network. These pre-trained RGB deep networks are typically employed in state-of-the-art methods for texture recognition and remote sensing scene classification. Interestingly, in a recent performance evaluation for texture recognition Liu16k , the hand-crafted LBP texture descriptor and its variants were shown to provide competitive performance compared to deep features based methods especially in the presence of rotations and several types of noise. In addition to texture recognition, LBP and its variants have been successfully employed for remote sensing scene classification Santos10j ; Chen16jk . Moreover, the work of Levi15c proposes to train CNNs on pre-processed texture coded images in addition to RGB for emotion recognition. Motivated by these observations, we investigate the impact of integrating LBP within deep learning architectures for texture recognition and remote sensing scene classification.

The combination of multiple feature streams into a single architecture has recently been a subject of intense study. It is being investigated in the context of action recognition Simonyan14k ; Cheron15k ; Feichtenhofer16j , RGB-D Hoffman16k , and multi-modal networks Reed16k ; Akira16k . In the aforementioned multiple feature streams action recognition approaches, the spatial stream captures the appearance information by using RGB images as input to the network and the temporal stream captures the motion information by using dense optical flow images as input to the network. The spatial and motion streams are then fused since they contain complimentary information. Inspired by the success of these two-stream deep networks, we propose a two-stream deep architecture where texture coded mapped images are used as the second stream and fuse it with the normal RGB image stream. The two network streams can be fused at different stages in the deep architecture. In the first strategy, termed as late fusion, the RGB and texture streams are trained separately and combined at a later stage by fusing them at the FC layers. In the second strategy, termed as early fusion, the two streams are joined at an early stage by aggregating the RGB and texture coded image channels as an input, in order to train a joint two-stream deep model. To the best of our knowledge, we are the first to investigate these two fusion strategies, to combine RGB and texture streams, in the context of texture recognition and remote sensing scene classification.

Contributions: In this work we investigate the problem of learning robust texture description by integrating one of the most popular hand-crafted texture descriptor, Local Binary Patterns (LBP), within deep learning architectures for texture recognition and remote sensing scene classification. To this end, we propose deep models, which we call TEX-Nets, by designing a two-stream deep architecture where texture coded mapped images are used as the second stream and fuse it with the normal RGB image stream. To obtain the texture coded mapped images, we first extract LBP based codes from an image. Afterwards, as in Levi15c , the unordered LBP code values are mapped to points in a 3D metric space. The mapping is performed by employing Multi Dimensional Scaling (MDS) using code-to-code dissimilarity scores based on approximated Earth Mover’s Distance (EMD). We further evaluate two fusion strategies, early and late fusion, to combine RGB and texture streams for texture recognition and remote sensing scene classification.

The proposed approach is first evaluated on a selection of texture benchmark datasets to demonstrate the overall effectiveness of the approach, and then applied to several remote sensing benchmark datasets to demonstrate its potential and applicability to remote sensing scene classification. The results of our experiments suggest that our late fusion TEX-Net architecture provides superior results compared to the early fusion TEX-Net architecture. Further, the proposed late fusion TEX-Net architecture always improves the overall performance compared to the standard RGB stream deep network architecture. Lastly, our final combination leads to performance superior to the state-of-the-art without employing fine-tuning or ensemble of RGB network architectures, for remote sensing scene classification.

2 Related Work

Here, we briefly review the Local Binary Patterns (LBP) and its variants, deep learning and state-of-the-art in texture recognition and remote sensing scene classification.

Local Binary Patterns: In the field of texture recognition, local binary patterns (LBP)  Ojala02b is one of the most commonly used texture description approaches. Besides texture recognition, LBP based texture description has been applied to other vision tasks, including face recognition Tan07h , gender recognition khan14gg and person detection Wang09k

. The LBP descriptor works by thresholding intensity values of a pixel around its neighborhood. The threshold is computed from the intensity of each neighborhood’s center pixel. A circular symmetric neighborhood is employed by interpolating the locations not exactly at the center of a pixel. A variety of LBP variants have been proposed in literature, including Local Ternary Patterns 

Tan10j

, Local Binary Pattern Variance 

Zhenhua10j , Noise Tolerant Local Binary Patterns Fathi12j , Completed Local Binary Patterns Zhenhua10jj , Extended Local Binary Patterns Liu12jj and Rotation Invariant Local Phase Quantization Ojansivu09k . In addition to the introduction of different LBP variants, the fusion of LBP descriptor with color features have also been investigated in previous studies Maenpaa04p ; khan15j .

Deep Learning: In recent years, Convolutional Neural Networks (CNNs) LeCun89k have shown to provide excellent performance for many computer vision tasks. CNNs are generally trained using large amount of labeled training samples and take fixed sized RGB images as input to a series of convolution, normalization and pooling operations (termed as layers). The network typically ends with several fully-connected (FC) layers, used to extract features for recognition. Several attempts have been made to improve deep network architectures, including increasing the depth of the network by introducing additional convolutional layers Simonyan15k ; Kaiming16k . In addition to RGB based appearance networks, other modalities such as motion and depth have also been used to construct multi-cue deep networks for action recognition Simonyan14k and RGB-D object recognition Eitel15k .

Deep Learning for Remote Sensing Image Analysis: In recent years, deep learning methods have made a breakthrough for satellite image analysis, with several works published in the major remote sensing journals. The most notable applications of deep neural networks (DNNs) in remote sensing include land cover classification with optical images Xueyun14jjk ; Adriana16k ; Molinier07k , hyperspectral image analysis Yushi14jj ; Yushi15jj ; Tuia15jj or Synthetic Aperture Radar (SAR) image analysis Geng15jjk .

A large majority of published works use DNNs trained on patches extracted from satellite images. DNNs are usually not trained on databases of full sized satellite images (1 to several GB per image) due to memory limitations, even on powerful GPU servers. CNNs are the most commonly used deep learning architectures for the classification of optical Xueyun14jjk and SAR Geng15jjk satellite images. Because large datasets of satellite images with high quality labels are not easily available, most of the earlier works utilized pre-trained DNNs that were trained on computer vision benchmark datasets (ImageNet), not on satellite images Marmanis16jjk .

Texture Recognition: A variety of texture recognition approaches have been proposed in literature Liu16k ; Pietikainen16g . The work of Varma10j

proposes a statistical approach to model textures based on the joint probability distribution of filter responses. The work of 

Chen10j proposes an approach based on Weber’s law which consists of two components: differential excitation and orientation. An image is represented by the concatenation of these two components in a single representation. The work of Hussain12h introduces an approach that uses lookup-table based vector quantization for texture description. A set of low and mid-level perceptually inspired visual features are introduced by Sharan13j for texture recognition. A multi-resolution framework based on LBP is proposed by Ojala02b for rotation invariant texture recognition. As discussed earlier, LBP is one of the most successful approaches for texture recognition with several variants existing in literature Guo10b ; Ylioinas13h ; Ylioinas12h .

Other than LBP and its variants, bag-of-words based representations employing SIFT features and Fisher Vector encoding scheme have shown promising results for texture recognition Cimpoi14k . Recently, deep features have also been investigated for texture recognition. Bruna and Mallat Bruna13jj introduce the wavelet convolutional scattering network (ScatNet), where no learning is required and convolutional filters are defined as wavelets. The work of Chan14jj

proposes a deep network based on multistage principal component analysis (PCANet). The work of 

Cimpoi16j proposes to use the convolutional layers of the deep networks as dense local descriptors encoded with Fisher Vector to obtain the final image representation.

Our Approach: As discussed above, most existing hand-crafted approaches employ LBP and its variants for texture description. On the other hand, deep learning based approaches have shown promising results for texture recognition and remote sensing scene classification. Despite the success of deep features, the hand-crafted LBP texture descriptor and its variants have been shown to provide competitive performance compared to deep feature based methods especially in the presence of rotations and several types of noises in a recent performance evaluation for texture recognition Liu16k . Moreover, the deep features based texture recognition and remote sensing scene classification methods employ deep networks pre-trained on the ImageNet dataset using RGB images as input. This motivates us to investigate the impact of integrating texture features, in particular, the popular hand-crafted LBP texture descriptor within deep learning architectures. We investigate fusion strategies by constructing a two-stream deep architecture where texture coded mapped images are used as the second stream and fuse it with the normal RGB image stream. To the best of our knowledge, we are the first to investigate the two fusion strategies in a two-stream deep architecture, to combine RGB and texture streams, in the context of texture recognition and remote sensing scene classification. This paper is an extended version of our earlier work RaoICMR17 . We have extended our experiments by evaluating the proposed approach for remote sensing scene classification application with results on four challenging benchmarks. In addition, we also provide an analysis of our two-stream deep architecture on the ImageNet dataset.

3 Binary Patterns Encoded Convolutional Neural Networks

Here, we first describe the construction of deep models based on texture coded mapped images. Afterwards, we investigate different strategies to fuse the texture coded mapped stream with the normal RGB image stream.

3.1 Mapped LBP Codes

As discussed earlier, Local Binary Patterns (LBP) has shown competitive performance for texture recognition and is one of the most commonly employed approaches for texture description. LBP features describe the neighborhood of a pixel by its binary derivatives. These binary derivatives are then used to form a short code to describe the neighborhood of the pixel. The short LBP codes are binary numbers (lower than threshold (0) or higher than the threshold (1)). Each LBP code can be considered as a micro-texton since each pixel is assigned a code of the texture primitive with its best local neighborhood match. Several local primitives are detected by the LBP operator, including flat areas, edges, corners, curves, and edge ends. The primitive version of LBP operator considered only the eight-neighbors of a pixel, while using the center pixel value as a threshold. Later variants extended the primitive LBP operator to consider all circular neighborhoods with any number of pixels. Given an image of size , with . Here, are the coordinates of the center pixel of a circular local neighborhood , where denotes the number of sampling points and is the circle radius of the of local neighborhood. The LBP code (a -bit word) describing the local image texture around the the center pixel is computed as,

(1)

where the thresholding function is defined as:

(2)

The standard LBP computation results in distinct values for the LBP code. In case of an 8 pixel neighborhood, the LBP code computation results in a binary string of eight-bit numbers between 0 and 255. ŒThe €final image representation is obtained by computing the histogram as a distribution of LBP codes over an entire image region. The resulting feature vector normalizes for translation and is invariant to monotonic changes in the gray scale.

As discussed above, the LBP codes are generally pooled as histogram representations and employed as an input to a discriminative classifier, such as Support Vector Machines (SVMs). Instead, due to the overwhelming recent success of deep learning, it is worth investigating to integrate the strength of LBP descriptor within the CNN architectures. A straightforward integration strategy is to train deep models by directly using LBP codes as CNN inputs. However, such a strategy is not applicable since the convolution operations, equivalent to a weighted average of the input values, performed within CNN models are unsuitable for the unordered nature of the LBP code values.

Recently, the work of Levi15c provides a solution to this problem within the context of texture description for emotion recognition. They propose to map the LBP codes to points in a 3D metric space in which the Euclidian distance approximates the distance between the LBP codes. After the transformation of the LBP codes they can be averaged together using convolution operations within CNN models.

The method is based on defining a distance between the LBP codes and . The authors of Levi15c choose the Earth Movers Distance (EMD) Rubner0jj because it accounts for both the different bit values and their locations. Having defined the distance between the LBP codes, it is now possible to look for a mapping of the LBP codes into a -dimensional space which approximately preserves this distance. This mapping can be found by applying Multi Dimensional Scaling (MDS) BorgGroenen2005 , such that:

(3)

where is the mapping of code into the -dimensional space. Applying this mapping allows us to transfer the LBP codes into a representation which can be used as input to a CNN. In Levi15c they experimented with the optimal dimensionality and found that good results were obtained with . In this work, we use the same settings and in addition also investigate an early fusion scheme with . We refer to Levi15c for more details. Figure 1 shows an example image converted to LBP codes (middle). The LBP codes are mapped to a 3D metric space (right) and normalized before used as an input to CNNs.

Figure 1: Example of the texture coded mapped image (visualized here in color). The mapped LBP image is obtained by converting LBP codes (shown as grayscale values) into a 3D metric space.

3.2 Texture Coded Two-Stream Deep Architecture

As described earlier, the de-facto standard when training deep models is to use raw RGB pixels values of an image as input. These RGB based deep networks have achieved state-of-the-art results for texture recognition recently Cimpoi16j and remote sensing scene classification Fan15jk ; Penatti15j . In this work, we investigate to what extent texture coded deep networks complement the standard RGB based CNN models in two classification problems: texture recognition and remote sensing scene classification. To this end, we design a two-stream deep architecture, referred as TEX-Nets, using both texture coded mapped images (section 3.1) and raw RGB pixel values. Our TEX-Nets models are trained on the ImageNet ILSVRC-2012 dataset Deng09a . We employ two different architectures to validate our approach: the VGG-M architecture Chatfield14h which is similar to Zeiler and Fergus network Zeiler14c and the ResNet architecture Kaimingcvpr15 . The VGG-M network comprises of five convolutional and three fully-connected (FC) layers. The VGG-M network takes as input an image of

pixels. The first convolutional layer employs smaller stride (1) and receptive field (or the filter size). The second convolutional layer uses a relatively larger stride (2 compared to 1). The number of convolution filters is 96 in the first convolutional layer, 256 in the second convolutional layer and 512 in the third and last convolutional layers. During training, the learning rate is set to

, a weight decay that acts as a regularizer and helps reducing the training error of the model is set to . The momentum rate is associated with the gradient descent algorithm used to minimize the objective function and is set to . We also employ the ResNet-50 architecture Kaimingcvpr15 which is a 50 layer Residual Network. This architecture is based on residual learning framework that facilitates efficient training of deeper networks by reformulating the layers as learning residual functions with reference to the layer inputs. The ResNet-50 architecture takes as input an image of pixels. For the first 30 training iterations, the learning rate is set to . For the second and the last 30 training iterations, the learning rate is set to and respectively. The momentum and the weight decay is set to and respectively.

Next, we investigate strategies to fuse the two network streams at different stages in the deep architectures.

Late Fusion: In this strategy, both standard (RGB) and texture coded network streams are trained separately on the ImageNet dataset. The standard RGB network stream takes RGB values as input, whereas the second network stream takes texture coded mapped images as input. These texture coded mapped images are obtained by first employing the LBP encoding that converts intensity values in an image to one of the 256 LBP code values. The LBP code values are then mapped into a 3D metric space (section 3.1). The resulting 3-channel texture coded mapped images are then used as input to CNN models. Despite being efficient to compute, the texture coded mapped images still introduce a bottleneck if done on-the-fly. We therefore pre-compute these texture coded mapped images before training the deep network. Once separately trained, the RGB and texture coded network streams are combined at a later stage by fusing them in the FC layers in the VGG-M architecture. In case of ResNet architecture, late fusion is performed before the softmax loss. The two-stream late fusion strategy has been previously used in action recognition to combine spatial (RGB) and temporal (flow) information Simonyan14k ; Cheron15k .

Figure 2: Two-stream deep fusion VGG-M architectures. The left example shows late fusion architecture where the deep models trained using RGB and texture coded mapped images are kept separately. The point of fusion, to combine the two network towers, is in the FC layer. The right example shows early fusion architecture where the point of fusion is the input to the network. As a result, a joint deep model is trained by aggregating the RGB and texture image channels as an input to the network.

Early Fusion: Other than late fusion, we also investigate an alternative strategy, termed as early fusion, where the point of fusion is the input to the network. In the early fusion based two-stream network architecture, a joint deep model is trained by aggregating the RGB and texture coded mapped image channels as an input to the deep network. As a result, the input to CNN is an image of dimensions. We employ same early fusion strategy for both VGG-M and ResNet architectures. We also investigated converting the 3-channel mapped coded images into a single channel and combining it with the three RGB channels. In both networks, the filters are learned jointly on the RGB and texture coded images. Figure 2 shows both early and late fusion based two-stream deep fusion VGG-M architectures designed to combine the color and texture image streams.

Method Architecture Channels Top-1 Error (%) Top-5 Error (%)
Standard RGB (Baseline) VGG-M 3 37.6 15.9
TEX-Net Standard VGG-M 3 45.8 21.9
TEX-Net-EF-6ch VGG-M 6 39.3 17.7
TEX-Net-EF-4ch VGG-M 4 37.1 15.5
TEX-Net-LF VGG-M 6 34.4 13.8
Standard RGB (Baseline) ResNet 3 25.4 8.0
TEX-Net-LF ResNet 6 23.7 7.0
Table 1: Classification performance comparison of our two-stream deep TEX-Net architectures with the standard RGB network on the ImageNet ILSVRC 2012 validation data. In case of VGG-M architecture, we show comparison with both early and late fusion TEX-Net models: early fusion architecture aggregating the RGB and 3 mapped coded channels (TEX-Net-EF-6ch), early fusion architecture aggregating the RGB and a single mapped coded channel (TEX-Net-EF-4ch) and the late fusion architecture (TEX-Net-LF) combining separate streams of RGB and texture networks. We also show results based on only mapped coded images (TEX-Net Standard), without color information. In case of ResNet architecture, we show the comparison between our late fusion approach and the standard RGB network.

Figure 3: Object categories in the ImageNet dataset where our late fusion two-stream deep architecture provides significant reduction in the top-5 error compared to the baseline standard RGB deep network. On the left, we show the comparison (in top-5 error) and on the right, we show example images from these object categories (left to right). Both approaches are based on VGG-M architecture.

Training TEX-Nets on ImageNet: As described earlier, we train our TEX-Nets from scratch on the ImageNet 2012 dataset employed in ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). The dataset consists of 1000 object classes and 1.2 million training images, 50,000 validation images, and 100,000 test images. On this dataset, the results are measured by top-1 and top-5 error rates. The error rates are computed from the predictions using the deep network and obtaining the predicted class multinomial distributions. The top-5 error is the fraction of test images for which the true label is not among the five labels (the 5 predictions with the highest probabilities) considered most probable by the deep model. The top-1 error is computed by evaluating if the top class (the one having the highest confidence) is the same as the correct (target) label. Table 1 shows the classification performance comparison, based on VGG-M architecture, of our early and late fusion based two-stream deep TEX-Net architectures with the standard RGB deep network on the ILSVRC 2012 dataset. The standard baseline RGB network achieves top-1 and top-5 errors of and respectively. Our late fusion deep architecture significantly reduces the error with an absolute reduction of in the top-5 error, compared to the standard RGB network. The late fusion architecture results in increasing the number of network parameters by a factor of 1.4, compared to the standard RGB. We therefore also train a six channel early fusion network by increasing the network depth with a factor of 1.4, resulting in same number of parameters as late fusion. This improves the results for the six channel early fusion architecture. However, it still provides inferior results (35.3 top-1 error) compared to the late fusion architecture (34.4 top-1 error).

We further validate the effectiveness of our late fusion two-stream approach by employing the ResNet-50 architecture. Table 1 shows the classification performance comparison of our approach and the standard RGB network. Both networks are trained from scratch on the ImageNet dataset. The standard baseline RGB network achieves top-1 and top-5 errors of and respectively. Our late fusion based TEX-Net ResNet architecture (TEX-Net-LF) reduces the error with top-1 and top-5 errors of and respectively.

Figure 3 shows 20 object categories from the ImageNet dataset where our late fusion two-stream deep architecture provides the largest reduction in the top-5 error rate, compared to the standard RGB deep network. For majority of the depicted classes it is likely that a good texture representation is crucial for correct classification. Consequently, the aforementioned results suggest that our late fusion two-stream deep architecture provides superior results compared to both standard RGB and early fusion.

4 Experimental Results

Here, we start by evaluating our TEX-Net deep models for the texture recognition problem. We then provide a comparison of our approach with the standard RGB based deep network in the remote sensing scene classification task. Finally, we compare the performance of our approach with state-of-the-art remote sensing scene classification results reported in literature.

Figure 4: Example images from the four texture datasets: DTD, KTH-TIPS-2a, KTH-TIPS-2b and Texture-10.

4.1 Texture Recognition

We evaluate our approach by performing experiments on four challenging texture datasets: DTD, KTH-TIPS-2a, KTH-TIPS-2b and Texture-10. Figure 4 shows example images from the four texture datasets.

DTD: The DTD dataset consists of 5640 images from 47 texture classes, collected from the web. Each texture class consists of 120 images with the dataset equally divided into training, validation and test. The training and test splits are provided by the authors.

KTH-TIPS-2a: The KTH-TIPS-2a dataset consists of 11 texture classes. The 4752 images are captured at 9 different scales, 3 poses and 4 different illumination conditions. Similar to previous works Chen10j ; Caputo05h ; Sharma12h , average classification performance is reported over the 4 test runs. In each run, images from 1 sample are used for testing while the images from the remaining 3 samples are used as a training set.

KTH-TIPS-2b: The KTH-TIPS-2b dataset consists of 11 texture categories. Here, images from 1 sample are used for training while all the images from remaining 3 samples are used for testing in each test run.

Texture-10: The Texture-10 dataset consists of 400 images of 10 different texture categories. For each texture category, 25 images are used for training and 15 images are used for testing.

Experimental Setup: As discussed earlier, both the TEX-Net networks and the standard RGB deep network are trained from scratch on ImageNet 2012 training set. The deep models are trained by employing the Matconvnet library Vedaldi15k . We evaluate our VGG-M architecture based deep models, pre-trained on ImageNet, as feature extractors on texture datasets. We therefore remove the last fully-connected layer (FC8) of the VGG-M networks which performs 1000-way ImageNet (ILSVRC) classification, and instead use 4096 dimensional activations from the FC7 (second last) layer as image features. The resulting image features are -normalised and input to a linear SVM classifier. Throughout our experiments, we fixed the weights (no fine-tuning) of all the pre-trained deep VGG-M networks for fair comparison. In all cases (datasets), the results are reported as average recognition accuracy over all texture categories in a texture dataset. The classification is performed by employing one-versus-all SVMs with linear kernel. The category label from the classifier providing the highest confidence is assigned to the test instance. The overall classification results are then obtained by calculating the average of the classification scores of all texture classes in a dataset.

In case of of ResNet architecture, we fine-tuned both the standard RGB and our late fusion two-stream model to perform classification in an end-to-end fashion. For fine-tuning on each dataset, we use the training samples with a batch size of 80 and a momentum value of . The learning rate is set to .

4.2 Baseline Comparison

We compare our TEX-Net deep models with the standard RGB based CNN approach to validate whether RGB and texture coded mapped images contain complementary information. We further evaluate both early and late fusion two-stream deep architectures (section 3.2) for combining texture and color information. For fair comparison, we use the same network architecture (VGG-M) together with the same set of parameters for all the deep models. Table 2 shows the baseline comparison on four texture datasets. In case of VGG-M architecture, the standard RGB deep network provides a mean accuracy of on the DTD dataset. The two early fusion based two-stream deep architectures (TEX-Net-EF-6ch and TEX-Net-EF-4ch) slightly improve the accuracy over the standard RGB, with mean classification scores of and respectively. The image representation based on the TEX-Net standard model provides a classification score of . On this dataset, the best results are obtained with our late fusion based two-stream deep ResNet architecture. On the KTH-TIPS-2a dataset, the standard RGB deep network provides a mean classification rate of . Our TEX-Net standard model based on texture coded mapped images provides a classification score of . The two early fusion based two-stream deep architectures (TEX-Net-EF-6ch and TEX-Net-EF-4ch) provide slight improvement in performance over standard RGB, with mean recognition scores of and respectively. Our late fusion based two-stream deep architecture achieves a mean classification rate of . When using the ResNet architecture, our late fusion approach provides superior results compared to the standard RGB network.

Architecture DTD KTH-TIPS-2a KTH-TIPS-2b Texture-10

Standard RGB
VGG-M 63.4 81.8 72.9 87.3


TEX-Net Standard
VGG-M 55.9 68.6 60.2 81.7
TEX-Net-EF-6ch VGG-M 64.0 82.6 73.6 89.1
TEX-Net-EF-4ch VGG-M 64.6 83.4 73.8 89.3
TEX-Net-LF VGG-M 68.2 85.3 75.5 91.3
Standard RGB ResNet 69.6 83.3 75.2 90.1
TEX-Net-LF ResNet 73.6 88.3 78.0 92.3
Table 2: Comparison (in ) of our approaches with the standard RGB deep network on four texture datasets. In case of VGG-M architecture, we show comparison with our different TEX-Net models: based on only mapped coded images (TEX-Net Standard), early fusion two-stream architectures combining either the RGB and 3 mapped coded channels (TEX-Net-EF-6ch) or RGB and a single mapped coded channel (TEX-Net-EF-6ch), and the late fusion architecture (TEX-Net-LF) combining standard RGB and TEX-Net standard networks. In case of ResNet architecture, we show the comparison between our late fusion approach and the standard RGB network. For both VGG-M and ResNet architectures, our late fusion approach always outperforms the corresponding baseline standard RGB network.

Figure 5: On the left, visualization of filter weights from the RGB and TEX-Net VGG-M model with mapped coded texture information respectively. On the right, visualization of activations with highest energy from the conv3 layer of RGB (top row) and TEX-Net (bottom row) networks on an example texture image. Th‡e TEX-Net model is trained on the texture coded mapped images (visualized here in color), obtained by converting LBP codes into a 3D metric space. In both cases, the models are based on VGG-M architecture.

In case of VGG-M architecture, the baseline RGB deep network provides a mean accuracy of on the KTH-TIPS-2b dataset. The two early fusion based two-stream deep architectures (TEX-Net-EF-6ch and TEX-Net-EF-4ch) achieve mean classification scores of and respectively. When using the ResNet architecture, our late fusion based deep network provides a gain of over the standard RGB network. Finally, on the Texture-10 dataset, the standard RGB deep network achieves a mean classification score of with VGG-M architecture. Our late fusion based two-stream deep VGG-M architecture obtains a mean accuracy of , leading to a gain of compared to the standard RGB VGG-M network. The best results are obtained using our late fusion approach with ResNet architecture. Figure 5 shows a VGG-M architecture based visualization of filter weights (on the left) from the RGB and TEX-Net model respectively and a visualization of activations (on the right) with the highest energy from the conv3 layer of the RGB (top row) and TEX-Net (bottom row) networks on an example texture image. In conclusion, the results suggest a robust description of texture features with the proposed approach, which we then apply to remote sensing benchmark datasets.

4.3 Remote Sensing Scene Classification

We evaluate our approach by performing experiments on four challenging remote sensing scene classification datasets: UC-Merced, WHU-RS19, RSSCN7 and the recently introduced AID.

Figure 6: Example images from the four remote sensing scene classification datasets from top to bottom: UC-Merced, WHU-RS19, RSSCN7 and the recently introduced AID.

UC-Merced is a publicly available dataset Yang10j consisting of 2100 aerial scene images with pixel resolution of one foot, downloaded from the United States Geological Survey (USGS) National Map. The images were downloaded from 20 regions across the USA: Buffalo, Boston, Birmingham, Columbus, Dallas, Houston, Harrisburg, Jacksonville, Las Vegas, Los Angeles, Miami, New York, Napa, Reno, San Diego, Santa Barbara, Seattle, Tampa, Tucson, and Ventura. The images in the dataset are cropped into 256 256 pixels, equally divided into 21 classes: agriculture, airplane, baseball diamond, beach, buildings, chaparral (shrubland / heathland), dense residential, forest, freeway, golf course, harbor, intersection, medium density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis courts. The dataset is challenging with a variety of spatial land-use patterns with a significant overlap among several categories, such as medium residential, sparse residential and dense residential. These overlapping categories only differ in the density of structures.

WHU-RS19 is a publicly available dataset Sheng12j consisting of 950 high spatial resolution aerial images collected from Google Earth imagery. The images in the dataset are of size 600 600 pixels, 50 samples per category and equally divided into 19 scene classes: airport, beach, bridge, river, forest, meadow, pond, parking, port, viaduct, residential area, industrial area, commercial area, desert, farmland, football field, mountain, park, and railway station. The dataset is challenging since images within each scene class are collected from different regions around the world with scale variations and different lighting conditions.

RSSCN7 is a publicly available dataset Qin15j , released in 2015, consisting of 2800 aerial scene images. The images are divided into 7 scene classes: grassland forest, farmland, parking lot, residential region, industrial region, river, and lake. Each scene class comprises of 400 images, where each image has a size of 400 400 pixels. The dataset is challenging since images in each category are sampled at four different scales with different imaging angles.

AID is a recently introduced publicly available large-scale aerial image dataset Xia177 . The dataset consists of 10000 images and 30 aerial scene categories: airport, bare land, baseball field, beach, bridge, center, church, commercial, dense residential, desert, farmland, forest, industrial, meadow, medium residential, mountain, park, parking, playground, pond, port, railway station, resort, river, school, sparse residential, square, stadium, storage tanks and viaduct. Unlike other aerial scene datasets, such as the UC-Merced dataset, the images in the AID dataset are collected from Google Earth imagery using different remote imaging sensors. The dataset is challenging since images in each scene category are collected from different countries around the world including China, USA, UK, France, Italy, and Germany. Further, the images are collected under varying imaging conditions (time and seasons), thereby further complicating the task of aerial scene classification. Figure 6 shows example images from the four remote sensing scene classification datasets.

Experimental Setup: We follow the standard protocol Xia177 to evaluate our approach on benchmark datasets. The performance is measured in terms of mean classification accuracy over all scene categories in a dataset. The classification accuracy is computed as , where is the number of correct predictions (images) in the test set and

is the total number of samples (images) in the test set. To compute the accuracy, each dataset is randomly split into training and test sets for evaluation. The evaluation procedure is then repeated ten times for a reliable performance comparison. The final results are reported as the mean and standard deviation over the ten runs. Following 

Xia177 , in case of UC-Merced dataset, the ratio of training to test images was set to 50:50 and 80:20 respectively, with the images randomly selected for each category. In the case of WHU-RS19, the ratio of training to test samples was set to 40:60 and 60:40 respectively. In the case of RSSCN7 and AID datasets, the ratio of the training set was fixed to and per class respectively. As in texture recognition (section 4.1), we use 4096-dimensional activations from the FC7 (second last) layer as image features, where the resulting image features are -normalised and input to a linear SVM classifier. Consequently, we fine-tuned both our late fusion based approach and the standard RGB ResNet architecture to perform end-to-end remote sensing scene classification. For fine-tuning ResNet models, we used the same parameter settings as in texture recognition experiments.

Method Architecture UC-Merced (50) UC-Merced (80) WHU-RS19 (40) WHU-RS19 (60) RSSCN7 (20) RSSCN7 (50) AID (20) AID (50)
Standard RGB VGG-M 94.13 95.40 96.01 96.57 86.0 88.8 87.70 90.29
TEX-Net Standard VGG-M 91.25 92.91 92.41 94.53 83.64 86.30 82.0 85.25
TEX-Net-EF-6ch VGG-M 94.36 95.27 94.71 96.0 85.65 88.70 86.84 89.68
TEX-Net-EF-4ch VGG-M 94.22 95.31 95.78 96.40 86.77 89.61 87.32 90.0
TEX-Net-LF VGG-M 95.89 96.62 97.61 98.0 88.61 91.25 90.87 92.96
Standard RGB ResNet 96.22 96.80 97.83 98.24 90.23 93.12 92.33 94.91
TEX-Net-LF ResNet 96.91 97.72 98.48 98.88 92.45 94.0 93.81 95.73
Table 3: Baseline comparison of our Tex-Net models (overall accuracy (OA) in ) with the standard RGB network on UC-Merced, WHU-RS19, RSSCN7 and AID datasets. Our late fusion based two-stream deep ResNet architecture always outperforms the standard baseline RGB deep ResNet.

4.4 Baseline Comparison

Table 3 shows the baseline comparison on four remote sensing scene classification datasets. In case of VGG-M architecture, the two early fusion based two-stream architectures provide slightly inferior performance compared to the baseline RGB network. As in texture recognition, the best results are obtained when using our late fusion based two-stream deep architecture approach, providing consistent improvements over the baseline standard RGB deep network for both VGG-M and ResNet architectures. A large gain in classification accuracy is achieved on the RSSCN7 and the large scale AID datasets. The RSSCN7 dataset comprises of several natural scene categories, such as grassland forest and farmland where texture features provide valuable complementary information to color features when other spectral channels besides RGB (like Near-Infrared) are not available. Similarly, the recently introduced large scale AID dataset consists of both natural scene types (farmland and forest) and man made scene categories (medium residential, sparse residential and school). Our late fusion approach achieves favorable results compared to the baseline RGB deep network. Figure 7 shows per-class classification performance comparison of our late fusion approach compared to the baseline RGB deep network, when using the VGG-M architecture. Our approach provides consistent improvement in performance on most scene categories.

Figure 7: Per-category performance comparison of our approach compared to the baseline RGB deep network on the AID dataset. Both the networks are based on the VGG-M architecture. Our approach improves the classification performance on most scene categories.

In the seminal work of Xia177 , it was shown that among different mid-level methods, the SIFT descriptors with the Improved Fisher Vector (IFK-SIFT) encoding provide improved results for remote sensing scene classification. Table 4 shows the comparison of our late fusion two-stream deep ResNet architecture with the best mid-level method: IFK-SIFT and several existing high-level deep methods: the shallow CaffeNet and the very deep VGG-VD-16 and GoogleNet. All the baseline results are taken from  Xia177 .The high-level deep feature approaches obtain consistently improved performance compared to the best mid-level method IFK-SIFT. Despite having only 8 layers, CaffeNet achieves competitive performance compared to very deep VGG-VD-16 and GoogleNet. Our late fusion based two-stream ResNet architecture provides consistent gain in performance compared to the existing high-level deep methods on all four datasets. In particular, a large gain in performance is achieved on the RSSCN7 and AID datasets. On the RSSCN7 dataset (20:80 training and test set ratio), the best mid-level method (IFK-SIFT) yields a mean recognition rate of . The existing high-level deep methods: CaffeNet, VGG-VD-16 and GoogleNet provide mean classification scores of , and respectively. Our approach achieves a mean classification rate of outperforming best existing deep feature methods. A similar gain of in mean accuracy is achieved, compared to best existing method, with 50:50 training and test set ratio on this dataset. On the recently introduced AID dataset (20:80 training and test set ratio), the best mid-level method (IFK-SIFT) provides a mean recognition rate of . The existing high-level deep methods: CaffeNet, VGG-VD-16 and GoogleNet provide mean recognition rate of , and respectively. Our approach provides superior performance compared to existing methods. Furthermore, a gain of is obtained compared to the best existing deep feature method, with the 50:50 training and test set ratio on this dataset.

Method UC-Merced (50) UC-Merced (80) WHU-RS19 (40) WHU-RS19 (60) RSSCN7 (20) RSSCN7 (50) AID (20) AID (50)
IFK-SIFT 78.74 83.02 83.35 87.42 81.08 85.09 71.92 78.99
CaffeNet 93.98 95.02 95.11 96.24 85.57 88.25 86.86 89.53
VGG-VD-16 94.14 95.21 95.44 96.05 83.98 87.18 86.59 89.64
GoogleNet 92.70 94.31 93.12 94.71 82.55 85.84 83.44 86.39
Ours 96.91 97.72 98.48 98.88 92.45 94.0 93.81 95.73
Table 4: Comparison of our late fusion ResNet based approach (overall accuracy (OA) in ) with the best mid-level method: SIFT descriptors with Improved Fisher Vector (IFK-SIFT) encoding and the existing high-level deep methods: CafeeNet, VGG-VD-16 and GoogleNet on UC-Merced, WHU-RS19, RSSCN7 and AID datasets. Our approach provides consistently improved accuracy compared to both mid-level method and high-level deep methods on all datasets.
Method UC-Merced WHU-RS19 RSSCN7 AID
BOVW + spatial co-occurrence kernel Yang10j 77.70 - - -
Color Gabor Yang10j 80.50 - - -
SPCK + SPM Yang11j 77.40 - - -
Structural texture similarity Risojevic11j 86.0 - - -
Wavelet BOVW Lijun14j 87.40 - - -
Unsupervised feature learning Cheriyadat14j 81.10 - - -
Saliency-guided feature learning Zhang15jj 82.70 - - -
Concentric circle-structured BOVW Lijun14jj 86.60 - - -
Multifeature concatenation Shao13j 89.50 - - -
Pyramid-of-spatial-relations Shizhi15jk 89.10 - - -
CLBP Chen16jk 85.50 - - -
MS-CLBP Chen16jk 90.60 - - -
HHCV Hang16jk 91.80 - 86.40 -

DBN based feature selection 

Qin15j
- - 77.0 -
Dirichlet Kobayashi14j 92.80 - - -
VLAT Negrel14j 94.30 - - -
Deep CNN Transfer (Scenario I: FC features) Fan15jk 96.88 96.71 - -
Deep CNN Transfer (Scenario II: Conv features) Fan15jk 96.90 98.64 - -
Deep Filter Banks Hang16jjk 92.70 - 90.40 -
Class-Specific Codebook + Two-Step Classification Yan17jk 93.80 93.70 - -
CaffeNet Xia177 95.02 94.80 88.25 89.53
VGG-VD-16 Xia177 95.21 95.10 87.18 89.64
GoogleNet Xia177 94.31 92.92 85.84 86.39
This paper 97.72 98.20 94.0 95.70
Table 5: Comparison (overall accuracy in %) with the state-of-the-art approaches. Our approach provides a consistent improvement over the state-of-the-art on three datasets. Most notably a significant gain of 6.1% is obtained, compared to the state-of-the-art, on the large scale AID dataset. Note that on the WHU-RS19 dataset, Deep CNN Transfer (Scenario II) Fan15jk achieves by employing VLAD encoding on the Conv layer features from the VGG-VD16. On the other hand, we do not employ any encoding scheme with the deep network.

4.5 State-of-the-art Comparison

Finally, we provide a comparison with the state-of-the-art approaches in literature. Our final image representation is late fusion two-stream ResNet architecture. Table 5 shows the comparison with the state-of-the-art methods in literature. We follow the same sampling setting as Yang10j ; Sheng12j ; Fan15jk for fair comparisons, by taking 80 samples per class for training in case of the UC-Merced and 30 samples per class for training in case of the WHU-RS19 dataset. In case of the RSSCN7 and AID datasets, we use 50 training samples per class for training. On the UC-Merced dataset, the approach of Yang10j integrating the spatial co-occurrence kernel within the bag-of-visual-words (BOVW) framework achieves a mean recognition rate of . They also investigate integrating color information within Gabor features leading to a mean accuracy of . The work of Yang11j obtains a classification accuracy of with a spatial pyramid co-occurrence based image representation that accounts for both photometric and geometric aspects of an image. Several approaches Risojevic11j ; Chen16jk ; Lijun14j aim to exploit texture information. Among these approaches, the multi-scale completed LBP feature provides superior performance with a mean recognition rate of . A considerable gain in performance on this dataset can be observed with the use of deep feature based methods. The deep filter banks based approach of Hang16jjk achieves an accuracy of . Transferring deep CNN features from the FC layer of the deep network (Deep CNN Transfer Scenario I: FC features) Fan15jk obtains a mean classification accuracy of . Transferring deep CNNs from the Convolutional layers of the deep network encoded with the VLAD scheme (Scenario II: Conv features) achieves a recognition rate of . Our approach achieves improved results () on this dataset.

On the WHU-RS19 dataset, the recently introduced improved class-specific codebook using kernel collaborative representation based classification framework Yan17jk achieves a mean accuracy of . The CaffeNet and the very deep VGG-VD-16 and GoogleNet provide mean recognition rates of , and respectively. Transferring deep CNN features from the FC layer of the deep network Fan15jk obtains a mean classification accuracy of . Our approach achieves favorable results compared to existing methods. On this dataset, the best results () are obtained when transferring deep CNNs from the Convolutional layers of the deep network encoded using the VLAD scheme. It is worthy to mention that our approach is complementary to (Scenario II: Conv features) method Fan15jk and combining the two approaches can be expected to provide further gain in the classification performance.

On the RSSCN7 dataset, the deep learning based feature selection approach (DBN) Qin15j achieves a mean recognition rate of . The hierarchical coding vectors based classification approach Hang16jk achieves a classification result of . The deep filter banks approach Hang16jjk provides a classification performance of . Our approach outperforms the best existing method (deep filter banks) with a mean classification accuracy of . Finally, on the recently introduced AID dataset, the CaffeNet and the very deep VGG-VD-16 and GoogleNet methods provide mean recognition rates of , and respectively. Our approach achieves the best results on this dataset with a mean classification accuracy of .

5 Conclusions

In this paper, we address the problem of learning robust texture description within deep learning architectures for texture recognition and remote sensing scene classification. We design deep models by constructing a two-stream deep architecture where texture coded mapped images are used as a second stream and fuse it with the standard RGB stream. Furthermore, we investigate two fusion strategies, early and late fusion, to combine RGB and texture streams in our two-stream deep architecture. Experiments are conducted on several benchmark texture recognition and remote sensing scene classification datasets. Our results clearly demonstrate that the proposed late fusion two-stream deep architecture always improves the overall performance compared to the standard RGB stream deep network architecture for both recognition tasks. Further, our final combination leads to improved results compared to the state-of-the-art for remote sensing scene classification. In this paper, we investigate Local Binary Patterns (LBP) encoded CNNs and different deep network fusion architectures for texture recognition and remote sensing scene classification. Future work involves investigating alternative texture description techniques and fusion strategies for texture coded deep CNNs. Another future direction is to include training and testing the proposed approach on actual full-sized satellite images containing all available spectral bands besides RGB (e.g. Near Infrared).

Acknowledgements

This work has been funded by the Spanish project TIN2016-79717-R, the CHISTERA project M2CR (PCIN2015-251), SSF through a grant for the project SymbiCloud, VR starting grant (2016-05543), through the Strategic Area for ICT research ELLIIT, CENIIT grant (18.14), the project AIROBEST (317387, 317388) funded by the Academy of Finland, the project MegaMrt2, funded by the Electronic Component Systems for European Leadership (ECSEL) Joint Undertaking (grant agreement No. 737494) of the Horizon 2020 European Union funding programme. We acknowledge the computational resources provided by the Aalto Science-IT project and CSC IT Center for Science, Finland. We also acknowledge the computational support from Nvidia and the NSC.

References

  • (1) T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, PAMI 24 (7) (2002) 971–987.
  • (2) J. Zhang, M. Marszalek, S. Lazebnik, C. Schmid, Local features and kernels for classification of texture and object catergories: A comprehensive study, IJCV 73 (2) (2007) 213–218.
  • (3) M. Varma, A. Zisserman, A statistical approach to texture classification from single images, IJCV 32 (9) (2010) 1705–1720.
  • (4) L. Liu, P. Fieguth, X. Wang, M. Pietikainen, D. Hu, Evaluation of lbp and deep texture descriptors with a new robustness benchmark, in: ECCV, 2016.
  • (5) L. Liu, P. Fieguth, Y. Guo, X. Wang, M. Pietikainen, Local binary features for texture classification: Taxonomy and experimental study, PR 62 (2017) 135–160.
  • (6) T. Leung, J. Malik, Detecting, localizing and grouping repeated scene elements from an image, in: ECCV, 1996.
  • (7) Thomas Leung, J. Malik, Representing and recognizing the visual appearance of materials using three-dimensional textons, IJCV 43 (1) (2001) 29–44.
  • (8) G. Csurka, C. Bray, C. Dance, L. Fan, Visual categorization with bags of keypoints, in: Workshop on Statistical Learning in Computer Vision, ECCV, 2004.
  • (9) F. Perronnin, C. Dance, Fisher kernels on visual vocabularies for image categorization, in: CVPR, 2007.
  • (10) T. Ojala, M. Pietikainen, D. Harwood, A comparative study of texture measures with classification based on featured distributions, PR 29 (1) (1996) 51–59.
  • (11) J. Zhang, K. Huang, Y. Yu, T. Tan, Boosted local structured hog-lbp for object localization, in: CVPR, 2011.
  • (12) T. Ahonen, A. Hadid, M. Pietikainen, Face recognition with local binary patterns, in: ECCV, 2004.
  • (13) C. Chen, B. Zhang, H. Su, W. Li, L. Wang, Land-use scene classification using multi-scale completed local binary patterns, SIVP 4 (2016) 745–752.
  • (14)

    J. A. dos Santos, O. A. B. Penatti, R. da Silva Torres, Evaluating the potential of texture and color descriptors for remote sensing image retrieval and classification, in: VISAPP, 2010.

  • (15) Y. Yang, S. Newsam, Geographic image retrieval using local invariant features, TGRS 51 (2) (2013) 818–832.
  • (16) B. Luo, S. Jiang, L. Zhang, Indexing of remote sensing images with different resolutions by multiple features, JSTARS 6 (4) (2013) 1899–1912.
  • (17) X. Chen, T. Fang, H. Huo, D. Li, Measuring the effectiveness of various features for thematic information extraction from very high resolution remote sensing imagery, TGRS 53 (9) (2015) 4837–4851.
  • (18) L. Chen, W. Yang, K. Xu, T. Xu, Evaluation of local features for scene classification using vhr satellite images, in: JURSE, 2011.
  • (19) Y. Yang, S. Newsam, Bag-of-visual-words and spatial extensions for land-use classification, in: GIS, 2010.
  • (20) Yi Yang, S. Newsam, Spatial pyramid co-occurrence for image classification, in: ICCV, 2011.
  • (21) S. Chen, Y. Tian, Pyramid of spatial relatons for scene-level land use classification, TGRS 53 (4) (2015) 1947–1957.
  • (22) R. Kusumaningrum, H. Wei, R. Manurung, A. Murni, Integrated visual vocabulary in latent dirichlet allocation based scene classification for ikonos image, JARS 8 (1) (2014) 083690–083690.
  • (23) Y. Zhong, Q. Zhu, L. Zhang, Scene classification based on the multifeature fusion probabilistic topic model for high spatial resolution remote sensing imagery, TGRS 53 (11) (2015) 6207–6222.
  • (24) F. Zhang, B. Du, L. Zhang, Saliency-guided unsupervised feature learning for scene classification, TGRS 53 (4) (2015) 2175–2184.
  • (25)

    F. Hu, G.-S. Xia, Z. Wang, X. Huang, L. Zhang, H. Sun, Unsupervised feature learning via spectral clustering of multidimensional patches for remotely sensed scene classification, JSTARS 8 (5) (2015) 2015–2030.

  • (26) M. Cimpoi, S. Maji, I. Kokkinos, A. Vedaldi, Deep filter banks for texture recognition, description, and segmentation, IJCV 118 (1) (2016) 65–94.
  • (27) F. Hu, G.-S. Xia, J. Hu, L. Zhang, Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery, Remote Sensing 7 (11) (2015) 680–707.
  • (28) O. Penatti, K. Nogueira, J. Santos, Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?, in: CVPRW, 2015.
  • (29) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, F.-F. Li, Imagenet: A large-scale hierarchical image database, in: CVPR, 2009.
  • (30) H. Azizpour, J. Sullivan, S. Carlsson, Cnn features off-the-shelf: An astounding baseline for recognition, in: CVPRW, 2014.
  • (31) M. Oquab, L. Bottou, I. Laptev, J. Sivic, Learning and transferring mid-level image representations using convolutional neural networks, in: CVPR, 2014.
  • (32) G.-S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, Aid: A benchmark dataset for performance evaluation of aerial scene classification, TGRS 55 (7) (2017) 3965–3981.
  • (33) G. Levi, T. Hassner, Emotion recognition in the wild via convolutional neural networks and mapped binary patterns, in: ICMI, 2015.
  • (34) K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in: NIPS, 2014.
  • (35) G. Cheron, I. Laptev, C. Schmid, P-cnn: Pose-based cnn features for action recognition, in: ICCV, 2015.
  • (36) C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition, in: CVPR, 2016.
  • (37) J. Hoffman, S. Gupta, T. Darrell, Learning and transferring mid-level image representations using convolutional neural networks, in: CVPR, 2016.
  • (38) S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee, Generative adversarial text to image synthesis, in: ICML, 2016.
  • (39) A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal compact bilinear pooling for visual question answering and visual grounding, in: EMNLP, 2016.
  • (40) X. Tan, B. Triggs, Fusing gabor and lbp feature sets for kernel-based face recognition, in: AMFG, 2007.
  • (41) F. S. Khan, J. van de Weijer, R. M. Anwer, M. Felsberg, C. Gatta, Semantic pyramids for gender and action recognition, TIP 23 (8) (2014) 3633–3645.
  • (42) X. Wang, T. Han, S. Yan, An hog-lbp human detector with partial occlusion handling, in: ICCV, 2009.
  • (43) X. Tan, B. Triggs, Enhanced local texture feature sets for face recognition under difficult lighting conditions, TIP 19 (9) (2010) 1635–1650.
  • (44) Z. Guo, L. Zhang, D. Zhang, Rotation invariant texture classification using lbp variance (lbpv) with global matching, PR 43 (3) (2010) 706–719.
  • (45) A. Fathi, A. Nilchi, Noise tolerant local binary pattern operator for efficient texture analysis, PRL 33 (9) (2012) 1093–1100.
  • (46) Z. Guo, L. Zhang, D. Zhang, A completed modeling of local binary pattern operator for texture classification, TIP 19 (6) (2010) 1657–1663.
  • (47) L. Liu, L. Zhao, Y. Long, P. Fieguth, Extended local binary patterns for texture classification, IMAVIS 30 (2) (2012) 86–99.
  • (48) V. Ojansivu, E. Rahtu, J. Heikkila, Rotation invariant local phase quantization for blur insensitive texture analysis, in: ICPR, 2009.
  • (49) T. Maenpaa, M. Pietikainen, Classification with color and texture: jointly or separately?, PR 37 (8) (2004) 1629–1640.
  • (50) F. S. Khan, R. M. Anwer, J. van de Weijer, M. Felsberg, J. Laaksonen, Compact color-texture description for texture classification, PRL 51 (2015) 16–22.
  • (51) Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, L. Jackel, Handwritten digit recognition with a back-propagation network, in: NIPS, 1989.
  • (52) K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: ICLR, 2015.
  • (53) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: CVPR, 2016.
  • (54) A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, W. Burgard, Multimodal deep learning for robust rgb-d object recognition, in: IROS, 2015.
  • (55) X. Chen, S. Xiang, C.-L. Liu, C.-H. Pan, Vehicle detection in satellite images by hybrid deep convolutional neural networks, LGRS 11 (10) (2014) 1797–1801.
  • (56)

    A. Romero, C. Gatta, G. Camps-Valls, Unsupervised deep feature extraction for remote sensing image classification, TGRS 54 (3) (2016) 1349–1362.

  • (57)

    T. H. Matthieu Molinier, Jorma Laaksonen, Detecting man-made structures and changes in satellite imagery with a content-based information retrieval system built on self-organizing maps, TGRS 45 (4) (2007) 861–874.

  • (58) Y. Chen, Z. Lin, X. Zhao, G. Wang, Y. Gu, Deep learning-based classification of hyperspectral data, JSTARS 7 (6) (2014) 2094–2107.
  • (59)

    Y. Chen, X. Zhao, X. Jia, Spectral spatial classification of hyperspectral data based on deep belief network, JSTARS 8 (6) (2015) 2381–2392.

  • (60) D. Tuia, R. Flamary, N. Courty, Multiclass feature learning for hyperspectral image classification: Sparse and hierarchical solutions, JPRS 105 (2015) 272–285.
  • (61)

    J. Geng, J. Fan, H. Wang, X. Ma, B. Li, F. Chen, High-resolution sar image classification via deep convolutional autoencoders, LGRS 12 (11) (2015) 2351–2355.

  • (62) D. Marmanis, M. Datcu, T. Esch, U. Stilla, Deep learning earth observation classification using imagenet pretrained networks, LGRS 13 (1) (2016) 105–109.
  • (63) M. Pietikainen, G. Zhao, Two decades of local binary patterns: A survey, arXiv preprint arXiv:1612.06795.
  • (64) J. Chen, S. Shan, C. He, G. Zhao, M. Pietikainen, X. Chen, W. Gao, Wld: A robust local image descriptor, PAMI 32 (9) (2010) 1705–1720.
  • (65) S. ul Hussain, B. Triggs, Visual recognition using local quantized patterns, in: ECCV, 2012.
  • (66) L. Sharan, C. Liu, R. Rosenholtz, E. Adelson, Recognizing materials using perceptually inspired features, IJCV 103 (3) (2013) 348–371.
  • (67) Z. Guo, L. Zhang, D. Zhang, A completed modeling of local binary pattern operator for texture classification, TIP 19 (6) (2010) 1657–1663.
  • (68) J. Ylioinas, X. Hong, M. Pietikainen, Constructing local binary pattern statistics by soft voting, in: SCIA, 2013.
  • (69) J. Ylioinas, A. Hadid, Y. Guo, M. Pietikainen, Efficient image appearance description using dense sampling based local binary patterns, in: ACCV, 2012.
  • (70) M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, A. Vedaldi, Describing textures in the wild, in: CVPR, 2014.
  • (71) J. Bruna, S. Mallat, Invariant scattering convolution networks, TSE 35 (8) (2013) 1872–1886.
  • (72) T.-H. Chan, K. Jia, S. Gao, Y. Ma, Pcanet: A simple deep learning baseline for image classification?, TIP 24 (12) (2014) 5017–5032.
  • (73) R. M. Anwer, F. S. Khan, J. van de Weijer, J. Laaksonen, Tex-nets: Binary patterns encoded convolutional neural networks for texture recognition, in: ICMR, 2017.
  • (74) Y. Rubner, C. Tomasi, L. Guibas, The earth mover’s distance as a metric for image retrieval, IJCV 40 (2) (2000) 99–121.
  • (75) I. Borg, F. Groenen, Modern Multidimensional Scaling: Theory and Applications, Springer, 2005.
  • (76) K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in the details: Delving deep into convolutional nets, in: BMVC, 2014.
  • (77) M. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: ECCV, 2014.
  • (78) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: CVPR, 2015.
  • (79) B. Caputo, E. Hayman, P. Mallikarjuna, Class-specific material categorisation, in: ICCV, 2005.
  • (80) G. Sharma, S. ul Hussain, F. Jurie, Local higher-order statistics (lhs) for texture categorization and facial analysis, in: ECCV, 2012.
  • (81) A. Vedaldi, K. Lenc, Matconvnet: Convolutional neural networks for matlab, in: ACM Multimedia, 2015.
  • (82) G. Sheng, W. Yang, T. Xu, H. Sun, High-resolution satellite scene classification using a sparse coding based multiple feature combination, IJRS 33 (8) (2012) 2395–2412.
  • (83) Q. Zou, L. Ni, T. Zhang, Q. Wang, Deep learning based feature selection for remote sensing scene classification, LGRS 12 (11) (2015) 2321–2325.
  • (84) V. Risojevic, Z. Babic, Aerial image classification using structural texture similarity, in: ISSPIT, 2011.
  • (85) L. Zhao, P. Tang, L. Huo, A 2-d wavelet decomposition-based bag-of-visual-words model for land-use scene classification, IJRS 35 (2014) 2296–2310.
  • (86) A. Cheriyadat, Unsupervised feature learning for aerial scene classification, TGRS 52 (1) (2014) 439–451.
  • (87) L. Zhao, P. Tang, L. Huo, Land-use scene classification using a concentric circle-structured multiscale bag-of-visual-words model, JSTARS 7 (12) (2014) 4620–4631.
  • (88) W. Shao, W. Yang, G.-S. Xia, G. Liu, A hierarchical scheme of multiple feature fusion for high-resolution satellite scene categorization, in: ICCVS, 2013.
  • (89) S. Chen, Y. Tian, Pyramid of spatial relatons for scene-level land use classification, TGRS 53 (4) (2015) 1947–1957.
  • (90) H. Wu, B. Liu, W. Su, J. Sun, Hierarchical coding vectors for scene level land-use classification, Remote Sensing 8 (5) (2016) 436–453.
  • (91) T. Kobayashi, Dirichlet-based histogram feature transform for image classification, in: CVPR, 2014.
  • (92) R. Negrel, D. Picard, P.-H. Gosselin, Evaluation of second-order visual features for land-use classification, in: CBMIW, 2014.
  • (93) H. Wu, B. Liu, W. Su, W. Zhang, J. Sun, Deep filter banks for land-use scene classification, LGRS 13 (12) (2016) 1895–1899.
  • (94) L. Yan, R. Zhu, N. Mo, Y. Liu, Improved class-specific codebook with two-step classification for scene-level classification of high resolution remote sensing images, Remote Sensing 9 (3) (2017) 223–247.