Spectral Unmixing With Multinomial Mixture Kernel and Wasserstein Generative Adversarial Loss

12/12/2020 ∙ by Savas Ozkan, et al. ∙ 0

This study proposes a novel framework for spectral unmixing by using 1D convolution kernels and spectral uncertainty. High-level representations are computed from data, and they are further modeled with the Multinomial Mixture Model to estimate fractions under severe spectral uncertainty. Furthermore, a new trainable uncertainty term based on a nonlinear neural network model is introduced in the reconstruction step. All uncertainty models are optimized by Wasserstein Generative Adversarial Network (WGAN) to improve stability and capture uncertainty. Experiments are performed on both real and synthetic datasets. The results validate that the proposed method obtains state-of-the-art performance, especially for the real datasets compared to the baselines. Project page at: https://github.com/savasozkan/dscn.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Spectral data provides comprehensive knowledge about the Earth’s surface in the form of narrow-band spectra. The limitations of spectral sensors induce pure materials in a scene to be mixed in different fractions. Indeed, the foremost concern is that pure materials (i.e., endmembers) and their fractions have to be estimated blindly from spectral data . Furthermore, nonlinear relations and spectral uncertainty are ubiquitous with severe scattering conditions (see Appendix A).

Even if the mixture of data can be represented with linear/nonlinear formulations, these assumptions underestimate the solution by considering pure materials as a finite set towards spectral uncertainty. To overcome the limitations, the mathematical formulation is redefined as follows in Zhou et al. (2018):


Here is the number of endmembers, and is the spectral bands of a pixel. Also, is the additive Gaussian noise to account for possible noise sources. Note that varies for each pixel and material. Hence, this term virtually defines the uncertainty of the data.

In the scope of this paper, we will present a novel framework that broadly uses this equation by computing fractions from pre-computed material signatures (i.e., only fractions are estimated from data). Specifically, we introduce a multinomial mixture kernel based on a neural network (NN) structure that obtains the fractions per-pixel. The learnable parameters of this kernel are optimized with WGAN for the first time. This step achieves theoretically more accurate and stable solutions.

2 Related Work

As expected, the initial methods are based on linear models due to their simplicity. In particular, sum-to-one and non-negative constraints are conditioned for least-square solutions Heylen and others (2011). However, these models are insufficient to resolve complex cases such as multiple scattering effects and microscopic-level material mixtures. Hence, bilinear models are used to handle nonlinear scenarios in which bilinear interactions of linear relations are considered in the models Halimi et al. (2011); Altmann and others (2011).

Even if the nonlinearity is retained, spectral uncertainty is still discarded in these models due to the finite set of materials. Distribution-based models represent endmember signatures as they are sampled from some distributions. In the literature, various distributions are exploited Zhang et al. (2014); Du et al. (2014)

. Notice that these models approximate each fraction to a unimodal distribution. Ultimately, this undermines the representation capacity as expected. Furthermore, the Gaussian Mixture Model (GMM) is adopted 

Zhou et al. (2018). However, the complexity and assumptions limit the applicability of this method for large-scale data.

Recently, AutoEncoder (AE) network is redefined with additional layers and a spectral angle distance metric 

Ozkan et al. (2018) to unmix data that improves robustness for spectral uncertainty. Moreover, Su et al. (2019) uses a stacked deeper AE network and a robust initialization is achieved by matrix factorization.

3 Proposed Model

We first describe the basics of the proposed method (see Appendix B). The architecture is equivalent to the standard AE formulation where it encodes the input spectra to a hidden latent representation and then decodes to recompute the reconstructed version of the original spectra by optimizing the learnable parameters and

with some loss function.

3.1 Encoder Part

The dimensionality of data is one of the drawbacks, especially for unsupervised learning. The main reason is that a high-dimensional space causes irregular distributions of data in the learning process. Inevitably, overfitting occurs for the parameters. For additional background, we recommend 

Norouzi and Blei (2011); Jegou et al. (2011).

For this purpose, high-dimensional input data is mapped to a low-dimensional space where denotes the dimension of latent representations. Hence, a latent representation is computed with a set of learnable parameters as . Observe that value should be selected as

to extract sparse representations. In particular, 1D convolution layers are utilized with batch normalization and sparse activation layers in this function. Since fewer parameters are optimized, the stack of convolutions boosts the representation capacity with deep hierarchies. Details can found in Appendix 

C for the NN structure with input/output relations.

Later, fractions are estimated by using these latent representations with a multinomial function where

is the learnable parameters. A NN multinomial mixture model is utilized for this case. As explained, distribution-based methods reproduce materials as they are sampled from probability distributions. Initial methods employ a unimodal distribution as:


Fractions are estimated from

by using a Gaussian distribution where

and indicate the mean and covariance matrix, respectively. In fact, this formulation underestimates the solution by fitting a unimodal distribution for each material. Therefore, Eq. 2 is restored with the Gaussian Mixture Model (GMM) by computing fractions with the mixture of Gaussian distributions:


Here, is the total number of mixture components by constraining and . Multinomial distributions are adopted in our method because fractions can be overly represented so that spectral uncertainty is adequately handled. Furthermore, should be selected as for effective representations. To this end, we mimic this approach with a set of NN models. We prefer to use NN models since the overall framework can be trained in an end-to-end manner. Note that similar attempts are also made in the literature Jalali et al. (2019).

If we extend Eq. 3 to separate the term into two learnable parts, this equation is expressed as where and . As noticed, at least two sets of parameters must be optimized, thus two separate single-layer subnetworks are employed as and .

To model these terms, the quadratic form of

is renounced in our formation. Precisely, multivariate Mahalanobis distance is normalized with a sigmoid function to calculate the probabilistic relations. To this end, the normal distribution resembles a cumulative distribution function. Similarly,

term is calculated with a NN model, and the sum-to-one constraint is retained with softmax activation. In the end, all subparts of the encoder can be trained with differentiable layers.

3.1.1 Parameter Updates for Multinomial Mixture Kernel

Expectation-Maximization (EM) is the most popular technique to estimate GMM parameters. This approach guarantees a stable solution by optimizing the negative log-likelihood function. Note that our loss function exploits the negative log-likelihood term for parameter updates. However, the studies show that EM approach suffers to achieve a true solution for several reasons, such as the random initialization of parameters and discontinuity. For further details, we recommend to read Kolouri et al. (2017); Jin et al. (2016).

To overcome these limitations, WGAN is employed, which provides several theoretical benefits over the classical approaches Kolouri et al. (2017). Specifically, Wasserstein distance is continuous and differentiable almost everywhere. Also, since it has a smooth loss function, the initialization is not critical. Formally, Wasserstein distance measures the distance between two distributions and where they generate variables as and . Kantorovich-Rubinstein duality minimizes the distance as:


where is a 1-Lipschitz function. Inspired by the structure of Variational AutoEncoder (VAE) Kingma and Welling (2013), this duality can be adopted to our formulation in Eqs. 1 and 3 as:


Here is omitted for the simplicity and value is accepted as not random. Note that this simplification is resolved in the decoder part. In addition, the generated sample from multinomial components is denoted as . With these observations, the formulation is simplified as:


To this end, this notion mathematically proves that WGAN loss can be used for the parameter updates of terms. Furthermore, an additional gradient penalty term is used for stable solutions Gulrajani et al. (2017):


Here, is the discriminative module and it is a 1-Lipschitz function. Furthermore, is uniformly sampled data from real and reconstructed data. scales the contribution of penalty term that is set to . The details for the NN structure of discriminative module can be found in Appendix C.

3.2 Decoder Part

As stated, in Eq. 1, the formulation defines value for spectral uncertainty. However, we omit this term in Eq. 6 for the simplicity. Here, we remodel this term with extra NN layers.

For this purpose, we add an uncertainty NN model that takes the output of multinomial mixture kernel and random noise as input. This model computes an additive full-size spectral signature. Observe that there is random noise to mimic the spectral uncertainty for different pixels in the formulation. Lastly, WGAN loss is also used for these parameters. Moreover, we utilize a refinement NN model that allows updates and nonlinearity for the precomputed endmembers.

The final output of the decoder part is reformulated by rescaling the contributions of refinement and uncertainty NN models with and , respectively:


3.3 Implementation Details

For the parameter updates, spectral angle distance based loss function is utilized Ozkan et al. (2018). WGAN loss is also used for the parameters at encoder and decoder parts, as explained. Adam stochastic optimizer is used by setting as  Kingma and Ba (2014).

4 Experiments

Experiments to compare the performance of the proposed method (DSCN++) with the baseline techniques are conducted on two real datasets (Urban Zhu and others (2014a) and Jasper Zhu and others (2014b) datasets) (see Appendix D for additional experiments). For this purpose, Linear Mixture Model (LMM) Heylen and Scheunders (2016), Generalized Bilinear Model (GBM) Halimi et al. (2011), Post-Nonlinear Mixing Model (PPNM) Altmann et al. (2012), Multilinear Mixing Model (MLM) Heylen and Scheunders (2016) and Normal Compositional Model (NCM) Stein (2003) are selected that are extensively used nonlinear and distribution-based methods in the literature. Scores of the baselines are reproduced.

Furthermore, Spatial Compositional Model (SCM) Zhou et al. (2016), Distance-MaxD (DMaxD) Heylen et al. (2011) and Sparse AutoEncoder Network (EndNet) Ozkan et al. (2018) are exploited to estimate endmember spectra from the scenes. Scores are reported in terms of Root Mean Square Error (RMSE) and tests are repeated times.

First, we test all individual steps and parameter configurations for different datasets. For clarity, EU stands for NN models that correspond to uncertainty and refinement models in the decoder part. Furthermore, WGAN indicates the use of Wasserstein GAN in parameter optimization. denotes the total number of mixture components. Experimental results are reported in Tab. 1. An important observation is that all these contributions ultimately improve performance. In particular, both WGAN and EU steps improve the performance of DmaxD methods for both datasets. A similar observation can be made for SCM on the Urban dataset and Endnet on the Jasper dataset. Lastly, the selection of value larger than the material number ultimately advances the unmixing performance.

Second, we also compare our method with the baselines. Results are demonstrated in Tab. 2. Observe that no single baseline method can continuously yield accurate results for different scenes, even if these baselines cover both nonlinear and distribution-based methods. This observation shows the superiority of our method that can capture both nonlinearity and uncertainty conditions perfectly. Of course, our method obtains state-of-the-art performance compared to baselines by a large margin.

Urban Dataset Jasper Dataset
DMaxD SCM EndNet DMaxD SCM EndNet

15.96 2.6 13.26 0.6 8.27 0.4 12.03 0.7 18.10 0.9 6.38 0.7
N 15.47 1.9 13.29 0.8 8.05 0.5 12.29 0.9 18.33 1.1 6.61 0.7
N 16.32 3.2 13.24 0.5 8.16 0.4 12.12 0.7 17.92 0.8 6.31 0.7
N 16.56 2.1 13.23 0.8 8.04 0.3 12.27 0.6 18.03 1.1 6.62 1.1

wo EU
34.14 0.3 15.91 0.3 8.38 0.4 14.78 0.4 20.26 0.4 8.36 0.5
wo WGAN 31.43 1.6 15.76 1.0 8.52 0.5 15.63 0.3 19.03 0.4 10.15 0.5
Table 1: RMSE performance for Urban and Jasper datasets. Best results are bold.
Urban Dataset Jasper Dataset
DMaxD SCM EndNet DMaxD SCM EndNet

31.04 28.00 27.59 15.31 19.78 22.60
GBM 31.04 27.00 27.08 15.98 19.95 21.91
PPNM 36.54 28.17 15.06 12.99 19.78 14.72
MLM 31.43 16.34 9.48 16.59 20.73 8.13
NCM 31.04 28.00 27.51 15.29 19.56 22.42
DSCN++ 15.47 13.23 8.04 12.03 17.92 6.31

Table 2: RMSE performance on Urban and Jasper datasets. Best results are bold.

5 Conclusion

In this paper, we introduce a novel neural network (NN) framework for spectral unmixing. It is composed of 1D convolutions by addressing spectral uncertainty, which is one of the severe physical conditions frequently observed in real data. For this purpose, a novel NN layer is proposed to estimate the fractions per-pixel by leveraging the multinomial mixture kernel. Wasserstein GAN is exploited for the parameter optimization, which leads to more accurate and stable performance than other loss functions. The experimental results validate that the proposed method outperforms baselines by a large margin.


  • [1] Y. Altmann, A. Halimi, N. Dobigeon, and J. Tourneret (2012) Supervised nonlinear spectral unmixing using a postnonlinear mixing model for hyperspectral imagery. IEEE Transactions on Image Processing 21 (6), pp. 3017–3025. Cited by: §4.
  • [2] Y. Altmann et al. (2011) Bilinear models for nonlinear unmixing of hyperspectral images. In IEEE WHISPERS, pp. 1–4. Cited by: §2.
  • [3] X. Du, A. Zare, P. Gader, and D. Dranishnikov (2014) Spatial and spectral unmixing using the beta compositional model. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7 (6), pp. 1994–2003. Cited by: §2.
  • [4] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5769–5779. Cited by: §3.1.1.
  • [5] A. Halimi, Y. Altmann, N. Dobigeon, and J. Tourneret (2011) Nonlinear unmixing of hyperspectral images using a generalized bilinear model. IEEE Transactions on Geoscience and Remote Sensing 49 (11), pp. 4153–4162. Cited by: §2, §4.
  • [6] R. Heylen, D. Burazerovic, and P. Scheunders (2011) Non-linear spectral unmixing by geodesic simplex volume maximization. IEEE Journal of Selected Topics in Signal Processing 5 (3), pp. 534–542. Cited by: §4.
  • [7] R. Heylen et al. (2011) Fully constrained least squares spectral unmixing by simplex projection. IEEE TGRS, pp. 4112–4122. Cited by: §2.
  • [8] R. Heylen and P. Scheunders (2016) A multilinear mixing model for nonlinear spectral unmixing. IEEE transactions on geoscience and remote sensing 54 (1), pp. 240–251. Cited by: §4.
  • [9] S. Jalali, C. Nuzman, and I. Saniee (2019) Efficient deep approximation of gmms. In Advances in Neural Information Processing Systems, pp. 4552–4560. Cited by: §3.1.
  • [10] H. Jegou, M. Douze, and C. Schmid (2011) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §3.1.
  • [11] C. Jin, Y. Zhang, S. Balakrishnan, M. J. Wainwright, and M. I. Jordan (2016) Local maxima in the likelihood of gaussian mixture models: structural results and algorithmic consequences. In Advances in Neural Information Processing Systems, pp. 4116–4124. Cited by: §3.1.1.
  • [12] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
  • [13] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.1.1.
  • [14] S. Kolouri, G. K. Rohde, and H. Hoffman (2017) Sliced wasserstein distance for learning gaussian mixture models. arXiv preprint arXiv:1711.05376. Cited by: §3.1.1, §3.1.1.
  • [15] M. Norouzi and D. M. Blei (2011) Minimal loss hashing for compact binary codes. In

    Proceedings of the 28th international conference on machine learning (ICML-11)

    pp. 353–360. Cited by: §3.1.
  • [16] S. Ozkan, B. Kaya, and G. B. Akar (2018) Endnet: sparse autoencoder network for endmember extraction and hyperspectral unmixing. IEEE Transactions on Geoscience and Remote Sensing 57 (1), pp. 482–496. Cited by: §2, §3.3, §4.
  • [17] D. Stein (2003) Application of the normal compositional model to the analysis of hyperspectral imagery. In Advances in techniques for analysis of remotely sensed data, 2003 IEEE Workshop on, pp. 44–51. Cited by: §4.
  • [18] Y. Su, J. Li, A. Plaza, A. Marinoni, P. Gamba, and S. Chakravortty (2019) DAEN: deep autoencoder networks for hyperspectral unmixing. IEEE Transactions on Geoscience and Remote Sensing 57 (7), pp. 4309–4321. Cited by: §2.
  • [19] B. Zhang, L. Zhuang, L. Gao, W. Luo, Q. Ran, and Q. Du (2014) PSO-em: a hyperspectral unmixing algorithm based on normal compositional model. IEEE Transactions on Geoscience and Remote Sensing 52 (12), pp. 7782–7792. Cited by: §2.
  • [20] Y. Zhou, A. Rangarajan, and P. D. Gader (2016) A spatial compositional model for linear unmixing and endmember uncertainty estimation. IEEE Transactions on Image Processing 25 (12), pp. 5987–6002. Cited by: §4.
  • [21] Y. Zhou, A. Rangarajan, and P. D. Gader (2018) A gaussian mixture model representation of endmember variability in hyperspectral unmixing. IEEE Transactions on Image Processing 27 (5), pp. 2242–2256. Cited by: Appendix D, §1, §2.
  • [22] F. Zhu et al. (2014) Spectral unmixing via data-guided sparsity. IEEE TIP, pp. 5412–5427. Cited by: §4.
  • [23] F. Zhu et al. (2014) Structured sparse method for hyperspectral unmixing. ISPRS Journal of Photogrammetry and Remote Sensing, pp. 101–118. Cited by: §4.

Appendix A Appendix For Spectral Uncertainty

As explained, spectral images can be significantly affected by variations in atmospheric, illumination, or environmental conditions. Hence, spectral uncertainty ultimately influences the mixture of materials, especially for real data. Fig. 1 illustrates the pure material samples from real datasets. Observe that this uncertainty can be tedious for spectral data.

Figure 1: Spectral signatures from Jasper (first row) and Urban (second row) data correspond to two different pure materials.

Appendix B Appendix For Model Flow

The overall flow of the proposed method is illustrated in Fig. 2. Our model takes spectral data as input and computes a latent representation . Later, this representation is modeled with a multinomial mixture model to estimate fractions . These fractions are fed to two NN models with the matrix multiplication of precomputed endmembers in the decoder part. In the end, individual outputs are summed, and the input is reconstructed.

Figure 2: Flow of the proposed model.

Appendix C Appendix For NN Structures

In our method, two deep NN models are primarily utilized. For the encoder part, a stack of 1D convolution layers is used to transform high-dimensional spectral data to a low-dimensional space, as illustrated in Tab. 3

. Observe that the order of convolution and normalization layers is modified. This modification enhances the selectivity of representations by feeding more statistical knowledge to the convolution layers. Furthermore, Inception module that has multiple kernels of different sizes is also integrated. Hence, a rich representation from multiple kernel sizes can be computed.

Index Index of Inputs Operation(s) Output Shape
(1) - Input Spectra D
(2) (1) Conv1D(10, 21, 1) D 10
(3) (2) PReLU + AvgP(5) + Nrm D/5 10
(4) (3) Conv1D(10, 3, 1) D/5 10
(5) (3) Conv1D(10, 5, 1) D/5 10
(6) (3) Conv1D(10, 7, 1) D/5 10
(7) (4),(5),(6) Concat D/5 30
(8) (7) PReLU + AvgP(2) + Nrm D/10 30
(9) (8) Conv1D(5, 3, 1) D/10 10
(10) (9) PReLU + AvgP(2) + Nrm D/20 10
(11) (10) Linear(M) + LReLU M
Table 3:

The NN Structure of the encoder part. The notation of Conv1D(W,K,S) indicates 1D convolution whose output dimension, kernel size and stride are W, K and S, respectively. Furthermore, Nrm and PReLU denote normalization and activation layers. AvgP(K) corresponds to the average pooling layer with kernel size and stride K.

To demonstrate the superiority of this model, Fig. 3 visualizes the distributions of input data and latent representation transformed by our method. Note that colors are discriminative based on their real fractions. Our method can blindly achieve more separable space for spectral data.

Figure 3: Data distributions before (first raw) and after (second raw) applying the encoder part.

Moreover, we utilize a deep NN model for the discriminative part of WGAN. This structure is also given in Tab. 4. In particular, PatchGAN is adopted in our method. Hence, local patches are fed to the discriminative model than full-dimensional data. Local characteristics (i.e., local spectral uncertainty) are practically captured.

Index Index of Inputs Operation(s) Output Shape
(1) - Input Spectra D
(2) (1) Conv1D(5, 21, 5) D/5 5
(3) (2) Nrm + PReLU D/5 5
(4) (3) Conv1D(10, 5, 2) D/10 10
(5) (4) Nrm + PReLU D/10 10
(6) (5) Conv1D(20, 5, 2) D/20 20
(7) (6) Nrm + PReLU D/20 20
(8) (7) Linear(5) D/20 5
Table 4: NN Structure of discriminative network.

Appendix D Appendix For Additional Experiments on Synthetic Data

Our method is also tested on a synthetic dataset released for the experiments in [21]. The dataset consists of four constituent materials from the ASTER spectral library: limestone, basalt, concrete and asphalt. For the scene, limestone is set as the background material; the rest is randomly placed to the image by maintaining the Gaussian-shaped distributions.

Even if synthetic data is conventional to evaluate the methods, it is not suitable to have all adverse cases exhibited on real data. Hence, it is hard to simulate nonlinearity and spectral uncertainty on the data. In particular, additive noise is inserted to simulate spectral uncertainty, which cannot reflect the actual physical conditions. Fig. 4 illustrates samples from two pure materials on this synthetic dataset. Unlike real data, these samples lead to corrupted signatures.

Experimental results are reported in Tab. 5. If we do not use EU and WGAN steps, the model achieves the best performance. This result makes sense since spectral uncertainty is simulated as random additive noise. Hence, WGAN loss eventually learns this noise characteristics instead of the real spectral uncertainty. This characteristic drastically reduces the performance and undermines the model. However, as explained, the actual capacity of these steps on real data is obvious.

We also compare our method with baselines. In Tab. 6, these results show that the proposed method performs better even if the full capacity of model is not used.

Figure 4: Spectra of pixels from the Synthetic dataset.
DMaxD SCM EndNet

8.32 1.4 4.47 0.7 9.20 1.4
N 8.15 1.4 4.36 0.9 9.14 1.0
N 8.18 1.0 3.97 0.5 9.06 1.1
N 7.55 1.1 4.42 0.6 9.51 1.2
wo EU 5.43 0.3 3.81 0.9 6.43 0.4
wo WGAN 5.23 0.4 3.22 1.2 8.88 0.3

Table 5: RMSE performance on synthetic dataset. Best results are bold.
DMaxD SCM EndNet

5.17 3.36 17.62
GBM 5.62 3.26 17.39
PPNM 7.66 4.01 9.34
MLM 6.52 3.85 6.71
NCM 4.09 3.85 11.17
DSCN++ 5.23 3.22 6.43

Table 6: RMSE performance on synthetic dataset. Best results are bold.