Ground Truth Simulation for Deep Learning Classification of Mid-Resolution Venus Images Via Unmixing of High-Resolution Hyperspectral Fenix Data

11/24/2019 ∙ by Ido Faran, et al. ∙ 11

Training a deep neural network for classification constitutes a major problem in remote sensing due to the lack of adequate field data. Acquiring high-resolution ground truth (GT) by human interpretation is both cost-ineffective and inconsistent. We propose, instead, to utilize high-resolution, hyperspectral images for solving this problem, by unmixing these images to obtain reliable GT for training a deep network. Specifically, we simulate GT from high-resolution, hyperspectral FENIX images, and use it for training a convolutional neural network (CNN) for pixel-based classification. We show how the model can be transferred successfully to classify new mid-resolution VENuS imagery.



There are no comments yet.


page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pixel-based classification of hyperspectral images is a major task in remote sensing, which involves assigning a class label to every pixel of an input image. This task, also known as pixel-wise classification or semantic segmentation, has attracted many studies over the years. Various methods have been proposed for this task. The traditional approach classifies hand-crafted features, using support vector machines

[12], morphological profiles [4], sparse representation [2], etc.

However, these methods rely typically on human expertise for tuning them on a specific dataset and they can extract only “shallow” features of the original data [1]. An alternative approach is to extract useful features directly from the image pixels.

Deep learning (DL) models have proven to be suitable for this kind of problem [19]. Such models are trained on image data sets, and are capable of learning both low-level and high-level feature representations directly from an input image, due to their deep hierarchical architectures. In addition, some DL models can exploit both spectral and spatial features of hyperspectral images, leading to improved classification results.

DL models can be categorized to supervised and unsupervised models. Unsupervised models (e.g., auto-encoders), are trained to extract features from large unlabeled data sets. By restricting the encoder-decoder structure, one can adjust the model to achieve the results of the required task [7] [9] [14]. Supervised models (e.g., convolutional neural networks

(CNNs) and deep belief networks) are trained using ground truth (GT) information as expected output of the network. In principle, supervised networks can learn more precise features by exploiting the label-specific information from the training data

[5] [11] [18].

Although most supervised models achieve superior classification, they rely on a considerable amount of GT for training the model. Additionally, there is a limited amount of labeled datasets in the remote sensing community [19], especially for a new source of information such as a new satellite.

The Vegetation and Environment monitoring New Micro-Satellite (VENS) is a new satellite that was launched in August 2017. It acquires frequent, high-resolution multispectral images of over 100 sites of interest around the world. This enables monitoring of plant growth and their health status, as well as the impact of environmental factors, such as human activities and climate change, on land surfaces of the Earth [16].

Up to this day, there is a relatively small number of images acquired by VENS and virtually no GT for training a model on this data. Thus, in an attempt to overcome the lack of labeled VENS data, and in order to start using these images in supervised models, we need to generate GT correlated with the satellite data acquired.

To avoid the expensive task of obtaining a large number of labeled samples, we propose a novel method for simulating GT from a higher spectral resolution airborne sensor, and using it as initial training data for a CNN model. By applying a state-of-the-art spectral unmixing algorithm [8] to the above airborne data, and adapting the acquired images to the spatial and spectral resolutions of VENS, we can train a CNN to classify VENS images to several predefined endmembers (EMs). This approach may help provide initial classification of incoming VENS images, without any GT, as part of a more comprehensive effort of processing time-series VENS data on a continuous basis.

The paper’s contributions are as follows: (1) Introduction of a novel GT simulation for training a DL-classification model without manual labeling, (2) presentation for the first time of classification results for the recently launched VENS satellite over a Mediterranean region that is of much interest (as far as climate change is concerned), i.e., the results can serve as a baseline for comparison with further methods, and (3) providing simulated data that will allow us to train more sophisticated models (such as fully convolutional neural network [10]) or explore more complex tasks, such as spectral unmixing using neural networks, for further processing of VENS data.

2 Background

2.1 Fenix

Figure 1: FENIX flight strip over Israel

A FENIX airborne scan took place on April 04, 2017 under clear sky conditions, along a transect from the Beit-Guvrin area (representing a semi-arid region with a rainfall rate of 450[mm/year]) to Lehavim (representing a desert fringe zone with 250[mm/year] rainfall). The scan was carried out over a 35-[km] long strip with a swath of 1.5[km] (Figure 1). A SPECIM AisaFENIX 1K airborne system was mounted on a Cessna 172 airplane flying at an altitude of 1,828[m] above sea level over a topographic area with an average height of 250[m]. The system consisted of VNIR and SWIR instruments, yielding a ground sampling distance of 1[m], and having wavelength ranges of 380-970[nm] (at 4.5[nm] spectral resolution) and 970-2500[nm] (at 12[nm] spectral resolution), respectively. The spectral bands were resampled into 41 bands of 5[nm] width in the wavelength range of 400-2400[nm].

2.2 VenS

The VENS satellite carries a super-spectral camera characterized by 12 narrow spectral bands ranging from 415[nm] to 910[nm]. The radiometric resolution for all bands is 10 bits and the spatial resolution is 5[m]. The spectral resolution at the Vis-NIR range is 40[nm], and 16[nm] and 20[nm], respectively, for the red edge and water vapor bands. Each experimental site is of size [km], with a 2-day revisit time.

2.3 Spectral Unmixing

Given a spectral image and the spectra of a set of distinct EM materials, the spectral unmixing process allows for extracting quantitative subpixel information by estimating the abundance fraction of each EM, in each pixel. Assuming a

linear mixture model (LMM), we write the spectral signature of each pixel as follows:

where is a signature of mixed pixel, is the number of spectral bands, is the matrix of EMs, is a vector of the corresponding fractions, and represents the system noise and assumed to be Gaussian with zero mean. Requiring a fully constrained solution, the unmixing problem is solved subject to two constraints: for , and , where is a vector of ones. In our case, we use the vectorized code projected gradient descent unmixing (VPGDU) method [8]. VPGDU combines the projected gradient descent (PGD) and an exact line search strategy to optimize an objective function that is based on spectral angle mapper (SAM).

2.4 Convolutional Neural Network (CNN)

CNNs have shown excellent performance in various visual perception tasks, such as object detection, object classification, semantic segmentation, etc., by exploiting the local connectivity between adjacent pixels. Recently, CNNs have also been used successfully for classification of hyperspectral images; see, e.g., Hu et al. [5], Makantasis at el. [11], and Yue et al. [18].

3 Proposed Method

3.1 Overview

Figure 2 illustrates the framework of the proposed method. It consists of three parts: (1) Ground truth simulation, (2) training a CNN model, and (3) evaluating its classification on a real VENS image. In the first part, a spectral unmixing algorithm is executed on higher-resolution images using predefined labels and their estimated abundance vectors in order to extract fraction vectors. The original images are then aggregated and adjusted to match VENS’s spatial and spectral resolutions. In the second step, we use spatial patches around each labeled pixel to train a deep CNN. Finally, we apply the trained network to a calibrated VENS L1 image to obtain its classification map. The proposed method is described below in detail.

Figure 2: Architecture of proposed method: CNN trained using simulated GT from hyperspectral image, and then used for classification of mid-resolution image.

3.2 Ground Truth Simulation

We simulate plausible GT for VENS by converting a given FENIX image (at 1[m] resolution). Non-contiguous regions of pixels are aggregated to a single pixel (at 5[m] resolution), and only the bands matching VENS’s spectral resolution are selected. To synthesize the GT labels (Figure 3), we first generate fraction maps of the high-resolution FENIX image (by applying VPGDU with respect to the seven EMs selected). The label assigned to a given pixel of the simulated VENS image is the EM for which the aggregated fractions (over the corresponding region in the FENIX image) is the greatest.

Figure 3: Illustration of synthesized GT pixel labels based on unmixing results.

3.3 Training Neural Networks

The simulated input images are split to small patches [17] [3]. Each patch contains a spatially correlated area around a specific pixel and its label as explained above. This allows for creating a large amount of label samples for training.

We examined several DL models for the classification task; the best results were achieved for a deep CNN model. The proposed network receives an input matrix (where , are the patch dimensions, and is the number of spectral bands). It consists of

convolution layers with a decreasing amount of filters per layer. Because of the relatively small input size, max-pooling layers are not necessary to simplify the dimensions of the model. Finally, the output of the last convolution layer is flattened and fed into a number of fully connected layers. The softmax activation function is applied to the last layer’s output for creating classification fractions. The output label of the center pixel of the patch is determined by the class of the greatest fraction.

In addition, due to the unbalanced amount of samples per label, data augmentation and denoising layers are added to prevent the network from over-fitting to the most frequent labels. Adding Gaussian noise, random rotations and image mirroring are used for creating balanced label counts per sample. Batch normalization 

[6] and dropout [13] layers are also added between hidden layers to insert noise into the training process.

3.4 Evaluating on New Dataset

The classification due to the trained network can be “transferred” to new images acquired over similar geographical regions with the same EMs. This is applied to VENS images at the L1 level (that were geographically calibrated to clean background noises while preserving spatial resolution). Image patches (from the real VENS image) are then fed into the trained network as before to obtain a classification label for their center pixels.

4 Experimental Results

4.1 Datasets

The suggested procedure has been tested on simulated and real VENS images for quantitative/visual assessment.

The simulated training data was acquired from the FENIX dataset by taking the the six most relevant areas, with an average picture size of pixels.

The VENS test data of size pixels, with a spatial resolution of 5[m] and 11 spectral bands 111band 6 is removed, as it is a duplication of band 5 for image quality, was acquired over the S02 polygon of Israel [15] on June 15, 2018. We worked with atmospherically corrected L1 products to maintain the original spatial resolution.

The following seven EMs that match the common land composition in this area were selected: Brown Soil, Light Soil, Rock, Tall Tree/Shrub, Dwarf Shrub, Herbaceous, and Dense Shrub/Burned Area.

4.2 Parameters and Details

To obtain reliable results, we conducted a 6-fold cross validation; each time one of the images was left out for testing, and the rest were used for training and validation. Specifically, all of the pixels of the latter images were randomly shuffled, and each time 90% of these pixels were used for training and the remaining 10% of the pixels for validation. Also, we normalized the images, as part of preprocessing, by standardizing the values of each spectral band to have zero mean and a standard deviation of 1.0.

After examining several patch configurations, we selected patches (i.e.,

regions) around each pixel, labeled according to the center pixel of the patch. To train a balanced model with a similar count of samples per label, data augmentation was applied via horizontal/vertical flips, rotations by 90°, 180°, and 270°, and addition of Gaussian noise with zero mean and 0.1 standard deviation. During each epoch, 30,000 samples per label (for a total of 210,000 samples in each epoch) were created using a combination of the above techniques.

The full network architecture is shown in Figure 2. For the CNN model, we used 4 layers of

convolution filters, with a different amount of filters per layer, i.e., 64, 64, 32, and 16 filters, respectively. Batch normalization layers are used (before applying a ReLU activation function), as well as dropout layers with a rate of 25%. The CNN is followed by 3 fully connected layers with an output size of 7 neurons. (The hidden layers are activated using the ReLU function, while the output layer uses softmax activation.)

The following hyperparameters were arrived at after various tuning attempts: Batch size = 64, cross entropy loss function, and Adam optimizer with a learning rate of 0.001. All weights were randomly initialized. The deep neural network was implemented using the Python programming language with TensorFlow as the DL framework. The network was trained over 200 epochs using backpropagation on a PC equipped with Intel Core I7 and Nvidia GeForce GTX 1080 Ti GPU.

(a) False color composite (b) Ground Truth (c) CNN classification
Figure 4: Visualization of the results obtained over the Avisure2 area.
(a) False color composite (b) CNN classification (c) ISODATA output
Figure 5: Visualization of the results on a real VENS image.

4.3 Results

Table 1 reports the classification results obtains by the proposed network on the 6 test images. The average accuracy obtained was . Although the quantitative results are not extremely high, visual assessment reveals a notable similarity between GT and the classification maps for large parts of the simulated and real VENS images (Figures 4 and 5, respectively). This should attest to the good promise of our baseline method for further classification of new VENS images.

Testing Image Validation Accuracy Test Accuracy
Amazya1 80.67% 69.40%
Avisure1 82.65% 69.97%
Avisure2 80.61% 75.08%
Between1 80.48% 70.96%
Between2 82.49% 73.27%
Lehavim1 80.80% 76.71%
Overall 81.29% 72.56%
Table 1: Classification results on FENIX test images.

5 Conclusion

We proposed a novel method for GT simulation of mid-resolution data by applying unmixing to high-resolution hyperspectral images. This allows to overcome a fundamental problem in remote sensing, i.e., a severe lack of labeled data. The simulated data was used for initial training of a CNN for pixel-based classification, as part of an ongoing project of temporally evolving CNNs for the analysis of the ecological mapping of Mediterranean environments using VENS images.


  • [1] J. M. Bioucas-Dias, A. Plaza, G. Camps-Valls, P. Scheunders, N. Nasrabadi, and J. Chanussot (2013) Hyperspectral remote sensing data analysis and future challenges. IEEE Geosci. Remote Sens. Mag. 1 (2), pp. 6–36. Cited by: §1.
  • [2] Y. Chen, N. M. Nasrabadi, and T. D. Tran (2011) Hyperspectral image classification using dictionary-based sparse representation. IEEE Trans. Geosci. Remote Sens. 49 (10), pp. 3973–3985. Cited by: §1.
  • [3] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu (2014) Deep learning-based classification of hyperspectral data. IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens. 7 (6), pp. 2094–2107. Cited by: §3.3.
  • [4] M. Fauvel, J. A. Benediktsson, J. Chanussot, and J. R. Sveinsson (2008) Spectral and spatial classification of hyperspectral data using SVMs and morphological profiles. IEEE Trans. Geoscience and Remote Sensing 46 (11), pp. 3804–3814. Cited by: §1.
  • [5] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li (2015) Deep convolutional neural networks for hyperspectral image classification. J. Sensors 2015. Cited by: §1, §2.4.
  • [6] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.3.
  • [7] R. Kemker and C. Kanan (2017) Self-taught feature learning for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 55 (5), pp. 2693–2705. Cited by: §1.
  • [8] F. Kizel, M. Shoshany, N. S. Netanyahu, G. Even-Tzur, and J. A. Benediktsson (2017) A stepwise analytical projected gradient descent search for hyperspectral unmixing and its code vectorization. IEEE Trans. Geosci. Remote Sens. 55 (9), pp. 4925–4943. Cited by: §1, §2.3.
  • [9] Z. Lin, Y. Chen, X. Zhao, and G. Wang (2013)

    Spectral-spatial classification of hyperspectral image using autoencoders

    In 9th IEEE Int. Conf. Inf., Commun. & Signal Process., pp. 1–5. Cited by: §1.
  • [10] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 3431–3440. Cited by: §1.
  • [11] K. Makantasis, K. Karantzalos, A. Doulamis, and N. Doulamis (2015)

    Deep supervised learning for hyperspectral data classification through convolutional neural networks

    In IEEE Int. Symp. Geosci. Remote Sens., pp. 4959–4962. Cited by: §1, §2.4.
  • [12] F. Melgani and L. Bruzzone (2004) Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote Sens. 42 (8), pp. 1778–1790. Cited by: §1.
  • [13] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (1), pp. 1929–1958. Cited by: §3.3.
  • [14] C. Tao, H. Pan, Y. Li, and Z. Zou (2015) Unsupervised spectral–spatial feature learning with stacked sparse autoencoder for hyperspectral imagery classification. IEEE Geosci. Remote Sens. Lett. 12 (12), pp. 2438–2442. Cited by: §1.
  • [15] Venus Israel Website. Note: Cited by: §4.1.
  • [16] Venus Website. Note: Cited by: §1.
  • [17] J. Yang, Y. Zhao, and J. C. Chan (2017) Learning and transferring deep joint spectral–spatial features for hyperspectral classification. IEEE Trans. Geosci. Remote Sens. 55 (8), pp. 4729–4742. Cited by: §3.3.
  • [18] J. Yue, W. Zhao, S. Mao, and H. Liu (2015) Spectral–spatial classification of hyperspectral images using deep convolutional neural networks. Remote Sens. Lett. 6 (6), pp. 468–477. Cited by: §1, §2.4.
  • [19] L. Zhang, L. Zhang, and B. Du (2016) Deep learning for remote sensing data: a technical tutorial on the state of the art. IEEE Geosci. Remote Sens. Mag. 4 (2), pp. 22–40. Cited by: §1, §1.