Improved Deep Spectral Convolution Network For Hyperspectral Unmixing With Multinomial Mixture Kernel and Endmember Uncertainty

08/03/2018 ∙ by Savas Ozkan, et al. ∙ 8

In this study, we propose a novel framework for hyperspectral unmixing by using a modular neural network structure while addressing the endmember uncertainty in our formulation. We present critical contributions throughout the manuscript: First, to improve the separability for hyperspectral data, we modify deep spectral convolution networks (DSCNs) that lead to more stable and accurate results. Second, we introduce a multinomial mixture kernel with a neural network (NN) which mimics the Gaussian Mixture Model (GMM) to estimate the abundances per-pixel by using the low-dimension representations obtained from the improved DSCN. Moreover, as formulated in the spectral variability assumption, a NN module is incorporated to capture the uncertainty term. Third, to optimize the coefficients of the multinomial model and the uncertainty term, Wasserstein GAN is exploited in particular since it has several theoretical benefits over the expectation maximization. Fourth, all neural network modules are formulated as an end-to-end hyperspectral unmixing pipeline that can be optimized with backpropagation by using a stochastic gradient-based solver. Experiments are performed on real and synthetic datasets. The results validate that the proposed method obtains state-of-the-art hyperspectral unmixing performance particularly on the real datasets compared to the baseline techniques.



There are no comments yet.


page 4

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A hyperspectral imagery can provide substantial knowledge about the Earth surface in the form of narrow-band spectra that has been used in the variety of remote sensing applications. Indeed, the separation of constituent materials (i.e., endmembers) and their fractions (i.e., abundances) which contribute to the measured hyperspectral data is a fundamental step since the spectra of the materials can be subjected to complex interactions [1, 2]. As a consequence, they can be mixed in different fractions due to the limitations of hyperspectral sensors and aggravate the problem to proceed with under these scattering effects. Spectral unmixing aims to determine the fractions of the materials blindly from the mixture data that enables to analyze the data more easily [1, 3, 4, 5].

Even if the mixture data can be simplified with a linear formulation (i.e., linear mixing model

[6], in real applications, the model remains weak and underestimates the solution especially for the cases such as atmospheric/illumination conditions and intrinsic material spectra variability. To straighten the model, nonlinear models reformulate the solution by employing nonlinearity and sparsity in their assumptions [6, 7, 8, 9, 10, 11]. In the literature, the bilinear model is a commonly used technique to handle the non-linear scenarios where the bilinear interactions of linear abundances are taken into account in the model. As stated in [6], the methods based on the bilinear model mainly differ with the additional parameter terms and constraints exploited in their formulations.

However, both models assume that a hyperspectral scene is composed of a limited number of fixed endmember spectra. Hence, the performance of the methods for endmembers and their abundances might be limited due to not accounting the possible spectral variability / endmember uncertainty exhibited from the data. Fig. 1 illustrates the spectra of pixels from the Jasper and Urban datasets [12, 13]

that are labeled as pure materials. It can be seen that each material can form differently even for the same scene under the assumption of spectral variability. To this end, the conventional methods suffer from the shortcomings of the models (i.e., linear and non-linear models) and the representations are deteriorated by the usage of incorrect endmembers and their combinations. Recently, the methods with angler-based loss functions obtain state-of-the-art performance by which they can partially overcome these limitations due to their robustness to the illumination conditions 


. Similarly, the idea of stack autoencoders is adapted to learn blindly spectral signatures. This provides robustness to outliers and noise-prone data due to the large set of trainable parameters 

[14]. Moreover, the methods  [2, 15] explain the importance of sparsity for hyperspectral data.  [16] proposes a deep convolution network to estimate abundances while using both spectral and spatial neighborhoods. However, due to the lack of examples, the solution is overfitted.

Beside the linear and non-linear models, a large number of techniques incorporates spectral variability in their formulations. In general, the definition of spectral variability assumption is simply a linear/non-linear combination of true endmembers and abudances for each pixel such that there is a random uncertainty term which also contributes to the material response [17]:


where is the number of endmembers exhibited in the scene and is the spectral bands of a pixel. Also, is the additive Gaussian noise to account the possible noise sources. Note that can vary for each pixel and abundances regardless of the scene that practically defines the uncertainty exhibited from data.

In the literature, the studies that intend to be robust to the endmember uncertainty are based on two different sets of assumptions. One of them exploits a library of on-hand material spectra (i.e., the set-based assumption) so as to determine abundances [18, 19, 20, 21, 22] while the other uses the statistical distributions of the materials (i.e., the distribution-based assumption) [23, 24, 25, 17]. For the set-based assumption, the straightforward technique to represent the spectral variability is the multiple endmember spectral mixture analysis (MESMA) [18]. This algorithm intuitively unmixes data iteratively by trying all possible material spectra presented in the library until the error/conditions are completely satisfied or reached to the desired values. The main drawback is that this method is computationally demanding especially when a large body of a spectra library is used. Inevitably, performing a brute-force search for every possible models is intractable. Its variants have an objective to reduce the computation complexity and makes the assumption more tractable. Most of the variants in the literature focus to reduce the library size by selecting the representative spectra from the library [20, 21, 22]. The rest [19] modifies the search algorithm which incrementally determines the possible solutions from the subsets of endmembers so that it moderately reduces the complexity exhausted in the search step.

Similarly, the distribution-based models represent the endmembers as if they are drawn from some distributions so that the flexibility for the spectral variability can be improved. For this purpose, various distribution types can be exploited in the literature to represent the endmembers (i.e., normal composition model (NCM) [23] or beta composition model (BCM) [25]). However, Zhou et. al. [17] states that the true representation of endmembers might not be approximated to an unimodal distribution and the mixture of sub-distributions (i.e., Gaussian Mixture Model (GMM)) can be more convenient to model real data in particular. However, note that the method needs high memory requirements which make the algorithm impractical for any real scene even with the medium sized spatial resolutions. Furthermore, even if the expectation maximization is one of the prominent techniques that has been used to optimize the coefficients of the GMM [26], it suffers from to converge to global solutions and it is not reliable to learn the ideal multinomial distributions as explained in [26, 27, 28, 29, 30].

In this study, we extend the deep spectral convolution network (DSCN) framework which is combined with the endmember uncertainty and the multinomial mixture kernel for hyperspectral unmixing. For this purpose, first, the architecture is modified to reduce the chance of overfitting. Later, the prediction error caused by the spectral variability is alleviated with the multinomial mixture kernel since each abundance map for a material is overly expressed with multinomial distributions correlated to the assumption of spectral variability [17]). Moreover, we redefine the formulation in Eq. 1 with two additional neural network (NN) models (i.e., for reconstruction and uncertainty terms). To this end, hyperspectral data is unmixed and the abundances are estimated from a set of pre-computed endmember spectra in an end-to-end learning scheme by a modular neural network (NN) model.

In summary, the overall model takes hyperspectral data as input and computes high-level low-dimension representations by using Improved DSCN (DSCN++). Then, these representations are modeled with Multinomial Mixture Model to estimate abundance maps. In the reconstruction step, abundance maps are fed to two NN modules along with the multiplication of endmember signatures. To this end, individual outputs are summed and the input is reconstructed.

The contributions presented in the manuscript can be summarized as follows:

  • First, we improve the deep spectral convolution neural network (DSCN) model 


    by introducing critical modifications to the architecture which practically reduce the chance of vanishing gradient problem and enhance the selectivity of the filters. To this end, the trainable parameters tend to converge to a more stable solution which is one of the drawbacks highlighted in the baseline algorithm 


  • Second, we introduce a multinomial mixture kernel based on a neural network (NN) architecture that estimates the abundances per-pixel by mimicking the mixture of Gaussian distributions. To improve the effectiveness, the latent features obtained from the DSCN are used as the input for the multinomial mixture kernel. In particular, the parameter set used for the kernel is optimized with the Wasserstein GAN 

    [32, 33] which leads to more accurate and stable solutions than the expectation maximization, the Kullback-Leibler (KL) [34] divergence or the mean square/absolute error (ME) [35]. WGAN model mitigates the limitations of EM-like methods such as sensitivity of parameter initialization and continuity that reduce the performance of GMM models. To this end, a flexibility to the model is defined with the robustness to spectral variability.

    Fig. 1: Spectra of pixels from the Jasper (first row) and Urban (second row) datasets that correspondence to two different pure materials. It can be seen that hyperspectral data can be highly influenced by the assumption of spectral variability.
  • Third, we define a new trainable uncertainty term based on a nonlinear NN model which simply provides robustness to the assumption of endmember uncertainty for the model. To do so, the estimated abundances are conditioned with an additional noise distribution as in [32, 33] to synthesize possible uncertainty. Furthermore, the similar architecture is exploited to minimize the residue estimated from precomputed endmember spectra in the reconstruction step. To this end, this auxiliary model practically helps to converge to a better solution for abundance estimation.

  • Note that all these modifications are formulated as a full hyperspectral unmixing pipeline with NN modules and it is optimized by a stochastic gradient-based solver in an end-to-end learning scheme.

The rest of the manuscript is structured as follows. First, we define the concept of the DSCN and explain the modifications presented in the manuscript. Then, the details of the multinomial mixture kernel and endmember uncertainty are described. Later, the optimization step for learning trainable parameters are summarized. Lastly, the experimental results obtained on several real and synthetic datasets are presented and we conclude the manuscript.

Fig. 2: Flow of the proposed method. This model takes hyperspectral data as input and computes high-level representations by using Improved DSCN. Then, these representations are modeled with Multinomial Mixture Model to estimate abundance maps . In the reconstruction step, abundance maps are fed to two NN modules with the multiplication of endmember signatures. To this end, individual outputs are summed and the input is reconstructed.

Ii Improved Deep Spectral Convolution Network

In this section, we first provide a basic definition related to the deep spectral convolution networks (DSCNs). Then, the modifications for the architecture as well as the multinomial mixture kernel and the endmember uncertainty are explained in detail. Lastly, the optimization step is described. To ease the understandability, the flow of the proposed method is illustrated in Fig. 2. In this figure, and correspond to NN modules to capture the random uncertainty and auxiliary terms in the Equation 1 respectively. The further information about these modules will be provided in this section.

Ii-a Baseline Deep Spectral Convolution Network

The proposed method in [31] is equivalent to the standard autoencoder formulation where it initially encodes the input spectra to a hidden latent feature and then decodes this latent feature to recompute the reconstructed version of the original spectra . Ultimately, iterative transformations are performed and the trainable parameter sets for both encoder and decoder parts are updated based on the reconstruction error calculated by a loss function(s). To this end, corresponds to the abundances per-pixel.

In particular, the main objective of the deep spectral convolution networks (DSCNs) is that on the contrary to linear layers [2], it extracts discriminative latent features about data blindly by abstracting the local spectral characteristics. This is essential since the dimensionality of data limits the model to reach an ideal solution by the fact that affine transformations (i.e., corresponds to matrix operations as explained in [2]) learned from unsupervised data might lead to irregularities/holes in the course of partitioning of the feature space. To overcome this limitation, either supervised data [36, 37] or splitting the data into several overlapping/non-overlapping parts [38] is needed to be utilized (the detailed information about this discussion can be found in [36, 31]). Note that the assumption of splitting the data into several overlapping/non-overlapping parts directly shares similar objectives in the DSCNs with an unsupervised setup.

Furthermore, the DSCN architecture is mainly composed of 1D stacked convolutions (i.e., with normalization, non-linear activation and pooling layers) per-pixel to reveal the latent spectral characteristics of data. However, even if state-of-the-art performance is achieved, the ideal parameter solution can be highly influenced by the initialization, thus unstable results and performance variations can be observed in the solution [31]. Furthermore, since deep neural networks can be easily fooled with a random noise as described in [39], the baseline DSCN can be quite sensitive to noise.

For the clarity and understandability of the terminology, we divide the encoder part into two consecutive steps. First, the proposed DSCN module is applied to extract high-level latent features (i.e. DSCN++) where is the latent feature for each input spectra . Here, denotes the dimension of the latent feature and it can be tuned by the users (It is set to throughout the experiments). Then, these latent features are used to estimate the abundances (i.e. Multinomial Mixture Kernel) by exploiting the proposed multinomial mixture kernel for the second step. In the following section, we first describe the architecture details of the improved DSCN.

Ii-B Modifications In the Architecture

As previously explained, the lack of parameter convergence is an issue in order to obtain stable results for the DSCNs. Thus, the architecture should be restructured such that the error calculated from the loss function should be effectively propagated to the earlier layers in the model otherwise the overfitting risk can be multiplied for the architecture. Note that this problem is renowned as the vanishing gradient problem in the literature [37, 40].

Index Index of Inputs Operation(s) Output Shape
(1) - Input Spectra D
(2) (1) Conv1D(10, 21, 1) D 10
(3) (2) LReLU + AvgP(5) + BN D/5 10
(4) (3) Conv1D(10, 3, 1) D/5 10
(5) (3) Conv1D(10, 5, 1) D/5 10
(6) (3) Conv1D(10, 7, 1) D/5 10
(7) (4),(5),(6) Concat D/5 30
(8) (7) LReLU + AvgP(2) + SN D/10 30
(9) (8) Conv1D(10, 3, 1) D/10 10
(10) (9) LReLU + AvgP(2) + SN D/20 10
(11) (10) Linear(M) + LReLU M

Architecture of the improved DSCN. The notations of Conv1D(W, K, S) indicates 1D convolution whose output dimension, kernel size and stride are W, K and S respectively. Furthermore, SN denotes Spectral Normalization, BN indicates Batch Normalization and AvgP(K) corresponds to Average Pooling layer with kernel size and stride of K.

For this purpose, initially, the parameter-free ReLU activation (called ReLU) is replaced by leaky rectified units (LReLU) 

[41, 42] in the model. Precisely, this unit allows small yet non-zero gradients for negative responses of an input even if it is not completely active. Moreover, the influence/resolution of negative responses is adjusted by a constant value determined as [41].

Second, the normalization layers (i.e. spectral normalization [31] or batch normalization [43]

) and spectral convolutions are distinctly reversed to reduce the chance of getting small/no activation(s) calculated from the previous layers. Spectral normalization rescales the input by merely leveraging spectral statistics (i.e., mean and variance of spectral responses) of data than batch characteristics (As explained 

[44], for noise-prone data, use of batch normalization can saturate the responses due to the extra noise introduced in the learning step). Therefore, even if the normalized spectral values improve the selectivity of the representation particularly for the problem, it has also chance to increase the sparsity with the combination of ReLU [31] (in other words, the chance of vanishing gradient problem). To this end, some of the representative responses can be eliminated unintentionally. By this modification, the resolution of the inputs is improved while still preserving the filter selectivity. Furthermore, we observe that after the initial convolution layer, particularly applying batch normalization than spectral normalization leads to more accurate/stable representations since batch normalization can regularize the network as intended in the Dropout layer [40] that reduces the overfitting of the trainable parameters. Lastly, trainable parameters in convolution operators are normalized with L2-norm to eliminate small values.

Fig. 3: The data distributions before (first column) and after (second column) applying the improved DSCN to the input spectra for the Jasper (first row), Urban (second row) and Synthetic (third row) datasets. As seen, the latent feature space can be more separable than the original material spectra.

Moreover, if we consider the first convolution layer of the model as a low-level feature extractor, the consecutive layers target to compute higher/distinct representations about data. To enrich the capacity of the representation, the inception module 

[45] is adapted. Therefore, multiple filters with different kernel sizes are performed for the same input and the filters can select the best filters by reweighing the responses according to their importance for the data.

In addition, average pooling layers are leveraged to reduce the dimensionality of data at each layer to obtain a low-dimension latent representation for the input spectra. Note that the purpose of using an average operation instead of a max operation is to decrease the chance of overfitting to the constantly responding filters which is a well-known issue in the neural network (NN) community [40]. At the end, the modified architecture for is illustrated in Table I. Note that in the experiments, we observe that performance does not significantly change by parameter sizes in the model.

Moreover, WaveNet model applied for the synthesis of audio signals [46] has a shared notion as in the proposed model and proves the success for 1D signals.

To show the effectiveness of the improved DSCNs, Figure 3 visualizes the data distributions projected to 2D spaces before and after applying the improved DSCN (colors are discriminative based on their abundance values and the PCA is used to reduce the dimensionality). It can be seen that the improved DSCN transforms the representation of data such that the latent features become more representative and the materials can be easily separable from the others.

In the following section, we will explain the multinomial mixture kernel that calculates the abundances per-pixel by exploiting the hidden latent features obtained by the improved DSCN and their distribution characteristics.

Ii-C Multinomial Mixture Kernel For Abundance Maps

For the assumption of endmember uncertainty, compositional models sampled from an unimodal distribution estimate the distributions as follows:


Here, denotes the parameter set of a Gaussian distribution that is composed of mean and covariance matrix where .

As indicated, these kernels can be misleading since the data cannot be directly approximated to an unimodal distribution [17] and multinomial mixture models need to be utilized to model the data properly.

For this purpose, Eq. 2 is generalized with Gaussian Mixture Model (GMM) by computing the distribution from the statistical perspective:



is a random variable whose sample space is

. Moreover, the probabilistic values correspond to and . Hence, the model formulates the estimation of abundances such that each latent feature of a pixel for the endmember can be represented with the mixture of multivariate Gaussian distributions as follows:


where is the number of components in the GMM and assures that endmembers are overly represented with a mixture of Gaussian components that practically provides robustness to the endmember uncertainty. Moreover, is the weight term which determines the influence of the GMM component to the abundance value.

Inspired by the multinomial distribution-based methods, we claim that the distributions (i.e. components) can be imposed on the abundances such that each abundance for a material can be overly expressed with the distributions which is ultimately related to the idea of spectral variability. For this purpose, a novel NN kernel is structured such that it especially computes the material abundances from the latent features (i.e. Multinomial Mixture Kernel).

For our model, we mimic the GMM with a neural network (NN) module and a novel multinomial mixture kernel is proposed. Note that this model is trainable in an end-to-end manner and necessary parameters in the multinomial mixture kernel can be learned by error backpropagation.

In particular, the distance between the latent representation and distribution in Eq. 4 can be simply calculated by the multivariate mahalanobis distance:


By expanding the formulation in Eq. 5, it can be roughly rewritten as . Note that the formulation can be simplified with a linear neural network layer by losing the form of quadratic function (i.e., with trainable weights and bias ) as where and

. Furthermore, the distances for all endmembers is normalized with a sigmoid activation to produce a probabilistic relation (Please note that softmax distribution also works seamlessly/similarly for this model). To this end, the normal distribution is resembled with a cumulative distribution function as follows:


We should note that in the implementation of the kernel, we tested to utilize a model that directly uses and coefficients (i.e. as a quadratic form) rather than parameterizing them with a linear NN layer as described in the previous paragraph. However, we observed that even if this model yields stable results (i.e., small variations), the overall accuracy becomes low. As explained in [47], this issue can be summarized by the fact that parameterizing the formulation with a linear NN layer makes the operations differentiable which enables to obtain better performance. However, the capacity of representing true relations (i.e., covariance matrix) is also underestimated. Therefore, parameterized coefficients are used.

Similarly, the weight term is calculated by a NN model and the sum-to-one constraint is supplied with a softmax activation:


To this end, an abundance per-material is modeled with Eq. 4 by replacing the original components with the differentiable NN layers formulated as in Eq. 6 and Eq. 7. To assure the sum-to-one constraint for abundances of each pixel, L1-norm is applied at the end [48].

Last but not least, note that the proposed kernel decreases the drawback of NNs to noise, since multinomial mixture models can effectively capture the uncertainty in the data so that the trainable parameters yield more stable results as stated in [49, 50]

Ii-D Spectral Variability Assumption

In Eq. 1, the formulation encapsulating spectral variability has an extra term compared to the linear/nonlinear models which defines the uncertainty per-pixel subjected to the abundances. In particular, this term should be also modeled with an additional parameter set in the model in order to obtain state-of-the-art performance, since it captures the possible deformations of spectra due to the spectral variability which cannot be incorporated by the multinomial mixture kernel.

In the formulation, by simply expanding Eq. 1, it is easy to see that there are two sets of equations where denotes the projection of endmember spectra (i.e., precomputed material spectra) with the estimated abundances while the other term corresponds to that defines the influence of the uncertainty depending on the abundance. For this purpose, we introduce a trainable NN module which takes the estimated abundances and Gaussian noise as input and generates an additive nonlinear spectra that degrades the reconstructed spectra based on the uncertainty assumption. More precisely, the uncertainty module is composed of two consecutive nonlinear NN layers that incrementally increases the dimension of spectra (i.e., ) by exploiting the abundance values and Gaussian noise where is the dimension of the noise. In the last layer, the responses are rescaled with TanH activation as well as a trainable coefficient to limit the influence.

Furthermore, since the performance of hypersepctral unmixing is highly influenced by the precomputed material spectra (i.e., it can be computed from various endmember extraction techniques), the representation of endmember spectra can be also improved by the model for better abundance results. Rather than tuning the precomputed material spectra by backpropagation, we observed that introducing a new trainable term which refines the material spectra can yield better performance. Mainly, this stems by the fact that the alteration of material spectra allows one of the crucial information about data to be lost even if the endmembers are blindly estimated. Therefore, this enables us to preserve endmember information by enhancing their weakness with an extra term.

For this purpose, the true material spectra is decomposed into where is the precomputed material spectra and is the possible residue obtained by the endmember method. Similarly, this residue term can be modeled by exploiting the similar architecture as in the uncertainty with no shared parameters as and . Hence, Eq. 1 in the manuscript that corresponds to can be rewritten as follows:


In particular, parameters that provide robustness to the endmember uncertainty need to be optimized with a novel scheme in order to obtain state-of-the-art performance.

Ii-E Parameter Learning

To minimize the reconstruction error, we exploit the optimization scheme proposed in [2]. For this purpose, the Spectral Angle Distance between the input spectra and the reconstructed spectra is dominantly maximized with the KL-divergence. Note that the similar notion (i.e. maximize the cross entropy) is used in GAN optimization:


where determines the influence of the regularization term (i.e., sparsity) and is used for regularizing the parameters of the encoder module (except s, s) with norm. In this work, and are set to and respectively. Furthermore, defines the ratio of the mean square/absolute error (ME) contributed to the reconstruction error. Note that ME is particularly critical for synthetic datasets since the angler distinctiveness is quite small compared to real data (we will explain this statement in the experiment section in detail). Hence, in the default setting, is set to 0. Lastly, is the expectation value of the possible values for the mini-batch.

In particular, to estimate the GMM parameters, the expectation maximization (EM) is the most popular technique in the literature and it theoretically guarantees to obtain a stationary solution by minimizing the negative log-likelihood function. However, the log-likelihood function highly suffers from to converge to a global solution by random parameter initialization and this assumption is highlighted in various studies in the literature. As explained in [26, 27, 28, 29, 30], this stems from the fact that solving negative log-likelihood function is similar to employ the Kullback-Leibler (KL)-divergence and it has weakness that can be summarized as twofold. First, since GMM is continuous and locally Lipschitz [27], the optimization method should be also continuous and differentiable where this cannot be completely supplied by the KL-divergence. Second, the KL-divergence suffers from the local minimum due to the discontinuity of the function which becomes highly sensitive to the choice of initial parameters. Further discussions related to the KL-divergence can be found in [27].

Index Index of Inputs Operation(s) Output Shape
(1) - Input Spectra D
(2) (1) Conv1D(5, 21, 5) D/5 5
(3) (2) BN + PReLU D/5 5
(4) (3) Conv1D(10, 5, 2) D/10 10
(5) (4) BN + PReLU D/10 10
(6) (5) Conv1D(20, 5, 2) D/20 20
(7) (6) BN + PReLU D/20 20
(8) (7) Linear(5) D/20 5
TABLE II: Architecture of Discriminative module. Differently, batch normalization and strides are used to obtain latent representations.

Similarly, for various studies regardless of their applications, the shortcomings of ME for the manifold learning are also stated [35]. In short, the ME-based solutions yield overly smoothed results due to the averaging of all possible solutions in the manifold space so that the resultant outputs can be underestimated. Possible solutions for an inverse problem can be exponentially multiplied if only pixel-wise relations are used. Note that this assumption is built for the assessment of image quality [51, 52] and due to its weakness, ME cannot preserve the structural similarity of data.

As stated, the loss function utilized in our method is mainly composed of the KL-divergence and the ME by minimizing the reconstruction error (The details of the loss functions will be provided in the following section.). Inevitably, the drawbacks of the techniques as we previously discussed affect the performance of the multinomial mixture kernel as well as the proposed method if these techniques are merely used in the formulation. For this purpose, we exploit the Wasserstein GAN [32, 33] to optimize the trainable parameters especially for the mixture model and the uncertainty term that implicitly correspond to s, s and .

Briefly, generative adversarial networks (GANs) have shown impressive performance for learning data distributions in various machine learning (ML) problems by their ability to transform a known distribution

to an unknown data distribution . However, the original GANs are potentially not continuous to minimize the error as explained in [32] and the similar limitations can be observed as if the KL-divergence is used (i.e., discontinuity). Therefore, the improved Wasserstein GAN is exploited in the proposed model which has several theoretical benefits over the original GAN architecture [32, 33, 53, 54].

Formally, Wasserstein distance is constructed to measure the distance between and by using the Kantorovich-Rubinstein duality [55]. Note that the Wasserstein distance is continuous and differentiable almost everywhere. Moreover, compared to the KL-divergence, the loss function is much smoother so that the method converges to the ideal solution freely from the initialization. Additionally, as discussed in [35], more accurate results can be obtained, since it provides robustness to the averaging operations in the ME.

In our case, and correspond to the distributions of input spectra and reconstructed spectra respectively and the adversarial loss function can be defined as:


where is the discriminative module and it is composed of a family of 1-Lipschitz functions [27]. Moreover, is sampled uniformly from normalized input and reconstructed spectra (This is critical since the angler-based metric is used in the model). determines the influence of the gradient penalty [33] for stable parameter learning.

For the discriminative module , we adapt the terminology of the PatchGAN which is extensively used in the image translation problem [56]. More precisely, instead of penalizing the overall structure of spectra with the discriminative module, the spectral patches of spectra are used for the purpose of capturing local spectral characteristics of real and reconstructed spectra. The architecture details are presented in Table II.

Ii-F Implementation Details

For the parameter optimization, since the proposed model is composed of different modules, five individual parameter sets should be learned from data. These are (corresponds to s, s), , , and . Therefore, loss functions and corresponding scale values for each set can be summarized as follows:


From Eq. 11, it can be seen that the Wasserstein GAN is dominantly used to optimize the parameters of and by which these parameters are implicitly used to provide robustness to the assumption of spectral variability. GAN models are powerful to form the noise space based on the definition of a problem. This is critical since the uncertainty exhibited from hyperspectral data can be precisely captured by the Wasserstein distance and noise space, and Wasserstein distance has several theoretical benefits to obtain a stable and accurate solution as explained throughout the manuscript.

Moreover, to optimize the latent representations learned from data, the DSCN parameters are updated by exploiting only the reconstruction loss . In addition, the learning rates for multinomial mixture kernel and auxiliary functions are scaled with the constants to decrease the learning speed compared to the DSCN module, since the DSCN parameters should converge faster so as to obtain more reliable abundances as expected.

Lastly, Adam stochastic optimizer [57]

is utilized to update the parameters for each mini-batch by empirically defining the first-order moment term

as 0.7. Note that the mini-batch size is fixed to 64. Also, we set the learning rate and the number of training iterations to 0.002 and 10K respectively for all datasets. The proposed algorithm is implemented on Python by extensively leveraging Tensorflow framework

111Project Page: . Additional results are also provided.

Iii Experiments

Experiments to compare the performance of the proposed method (DSCN++) with the baseline techniques are conducted on one synthetic and two real datasets. For this purpose, Linear Mixture Model (LMM) [7], Generalized Bilinear Model (GBM) [58], Post-Nonlinear Mixing Model (PPNM) [59], Multilinear Mixing Model (MLM) [7] and Normal Compositional Model (NCM) [60] are selected for the hyperspectral unmixing baselines that are extensively used methods in the literature (codes are available on the web.). Note that the recent method in [17] needs high memory requirement which is impossible to compute the algorithm on real datasets especially by maintaining their original spatial resolutions.

Furthermore, Spatial Compositional Model (SCM) [61], Distance-MaxD (DMaxD) [3] and Sparse AutoEncoder Network (EndNet) [2] are exploited to estimate the constituent endmember spectra from the scenes.

Although no additional experiment is conducted, by comparing the reported results in [31], improved DSCN model improves performance for real datasets compared to the base model.

To evaluate the unmixing performance and compare the estimated abundances with the ground truth, Root Mean Square Error (RMSE) is utilized:


where denotes the true abundance per-pixel while indicates the estimated one by the baselines. Also, corresponds to the total number of pixels in the scene.

Fig. 4: Endmember spectra (first row) and abundance maps (second row) for the Synthetic dataset.

In following sections, we first provide the quantitative performance for each dataset that clearly demonstrates the impact of the proposed method by highlighting each individual step as well as different parameter configurations in detail. Note that is the only coefficient needed to be tuned by the user. Therefore is determined for each step based on their overall results in the experiments (As expected,

(the number of mixture components) can vary for each dataset as well as each endmember method based on its distinctiveness). Later, the comparisons with the baselines are explained. For reliability of assessments, tests are repeated 20 times for the proposed method, thus mean and standard deviation are reported.

To ease the understandability, some abbreviations used in the Tables can be summarized as follows: DSCN indicates the deep spectral convolution network as explained in the subsection II-B. EU stands for the usage of NN models for the uncertainty term and the auxiliary model described in the subsection II-D. Lastly, WGAN indicates the use of Wasserstein GAN in the parameter optimization as mentioned in the subsection II-E.

DMaxD [3] SCM [61] EndNet [2]

8.32 1.4 4.47 0.7 9.20 1.4
8.15 1.4 4.36 0.9 9.14 1.0
8.18 1.0 3.97 0.5 9.06 1.1
7.55 1.1 4.42 0.6 9.51 1.2
wo DSCN 10.78 1.2 10.96 1.5 13.37 1.9
wo EU 5.43 0.3 3.81 0.9 6.43 0.4
wo WGAN 5.23 0.4 3.22 1.2 8.88 0.3

TABLE III: RMSE performance on the synthetic dataset for different configurations.

Iii-a Synthetic Dataset

Fig. 5: Spectra of pixels from the Synthetic dataset. It can be seen that even if random noise is added in the construction of data, spectral variability cannot be exhibited truly and it can be misleading for learning-based methods since learning-based methods (generative-based models in particular) can learn this random noise easily.

The unmixing methods are tested on the synthetic dataset released for the experiments in [17]. The dataset consists of four constituent materials from the ASTER spectral library as limestone, basalt, concrete and asphalt. The spectral range of data is from m to m and the spectral dimension is resampled to 200. Furthermore, the spatial resolution of the scene is . For the scene, limestone is set as the background material, the rest is randomly placed to the scene by maintaining the Gaussian-shaped distributions. The endmember spectra and abundance maps for the scene are illustrated in Fig. 4.

Note that even if synthetic datasets are conventional to test the hyperspectral unmixing performance, it is not quite possible to capture all adverse conditions observed in the real data (such as spectral variability). Hence, it can be hard to validate the performance truly under this strong bias. To degrade the bias, the only practical solution is to add random noise on the data which cannot reflect the true physical conditions (i.e., spectral variability) observed in the real data. More precisely, Fig. 1 and Fig. 5 show the severe difference between real and synthetic datasets for pure materials. It can be seen that the angler distinctiveness is quite small (i.e., spectra of pixels are nearly concentrated on a single response) and use of angler-based metric can be misleading for the dataset. Hence, only for the synthetic dataset, the KL-divergence with SAD in the reconstruction loss is replaced (for SCM) or jointly used (for DmaxD, Endnet) with the ME (i.e., ).

The experimental results for each individual step and different parameter configuration settings are reported in Table III. From the results, the proposed method achieves significantly better performance without EU and WGAN steps. This is quite reasonable since the data is degraded by random noise to simulate the spectral variability and the proposed model ultimately learns the noise from data than the assumption of spectral variability. Hence, this leads to the corruption for the model. However, note that true impacts of these steps can be clearly observed for real datasets. Therefore, further discussions about these steps are reserved for the real datasets.

The RMSE performance for the baselines are shown in Table IV. From the results, the proposed method obtain slightly better performance compared to the baselines. Note that the proposed method is effective on the synthetic dataset even if it does not exploit all its features (i.e., EU and WGAN).

Lastly, note that  [17] obtains RMSE performance for this synthetic data. However, due to its computation requirement and difference for the endmember estimation step, it is not completely compatible with our method.

DMaxD [3] SCM [61] EndNet [2]

LMM [7]
5.17 3.36 17.62
GBM [58] 5.62 3.26 17.39
PPNM [59] 7.66 4.01 9.34
MLM [7] 6.52 3.85 6.71
NCM [60] 4.09 3.85 11.17
DSCN++ 5.23 3.22 6.43

TABLE IV: RMSE performance on the synthetic dataset for baselines. The best results are indicated as Bold-Red while the second best results are shown as Bold.
Fig. 6: Endmember spectra (first row) and abundance maps (second row) for the Urban dataset.

Iii-B Urban Dataset

The spatial resolution of the scene is . The spectral range covers from 10 nm to 2500 nm and it is composed of 210 spectral bands. Several spectral bands (1-4, 76, 87, 101-111, 136-153 and 198-210) are removed from data due to the water-vapor absorption and atmospheric effects, thus the final spectral band is reduced to 162. Four material types are observed in the scene: asphalt, grass, tree and roof [13]. The endmember spectra and abundance maps for the scene are illustrated in Fig. 6.

The performance for the proposed method under different configurations is illustrated in Table V. It can be seen that the selection of larger than improves the performance for all endmember techniques. Note that the optimum value can vary since the discriminative power of the representation can be highly influenced by the endmember extraction techniques regardless of how spectral variability affects the scene. In addition, the results confirms that use of DSCN in order to reduce the dimension of data and to obtain compact representations yields better results. Moreover, for the models without the DSCN, the variations in the performance are significantly increased due to the fact that high dimensionality can lead to irregularities/holes for the space as previously described.

DMaxD [3] SCM [61] EndNet [2]

15.96 2.6 13.26 0.6 8.27 0.4
15.47 1.9 13.29 0.8 8.05 0.5
16.32 3.2 13.24 0.5 8.16 0.4
16.56 2.1 13.23 0.8 8.04 0.3

22.12 2.6 16.63 4.6 11.42 0.8
wo EU 34.14 0.3 15.91 0.3 8.38 0.4
wo WGAN 31.43 1.6 15.76 1.0 8.52 0.5
TABLE V: RMSE performance on the Urban dataset for different configurations.
DMaxD [3] SCM [61] EndNet [2]

LMM [7]
31.04 28.00 27.59
GBM [58] 31.04 27.00 27.08
PPNM [59] 36.54 28.17 15.06
MLM [7] 31.43 16.34 9.48
NCM [60] 31.04 28.00 27.51
DSCN++ 15.47 13.23 8.04

TABLE VI: RMSE performance on the Urban dataset for baselines. The best results are indicated as Bold-Red while the second best results are shown as Bold.

Similarly, considering endmember uncertainty with additional NN models practically improves the performance. For the Wasserstein GAN, it helps to converge to the ideal solution more likely. Moreover, it achieves more stable results by which the influence of parameter initialization is decreased in the model.

The performance is compared with the baselines in Table VI and the proposed method outperforms all hyperspectral unmixing techniques significantly. In particular, the usage of WGAN and EU steps improve the performance of DmaxD such that it leads to a highly accurate solution where the baselines stuck to stationary results.

Fig. 7: Endmember spectra (first row) and abundance maps (second row) for the Jasper dataset.

Iii-C Jasper Dataset

The Jasper dataset is captured by the AVIRIS sensor and the dimension is pixels. Several channels (1-3, 108-112, 154-166 and 220-224) are discarded due to the atmospheric effects and water-vapor absorption. The spectral band is of 198. There are four constituent materials in the scene as tree, water, soil and road [12]. Fig. 6 shows the endmember spectra and abundance maps for the scene.

The detailed performance for different configuration settings and steps are reported in Table VII. It can be seen that similar results can be observed as stated for the Urban dataset. A single outlier is that SCM without the DSCN achieves slightly better performance compared to the baseline. This stems from the fact that the SCM representation space can be more discriminative than the DSCN for this dataset.

DMaxD [3] SCM [61] EndNet [2]

12.03 0.7 18.10 0.9 6.38 0.7
12.29 0.9 18.33 1.1 6.61 0.7
12.12 0.7 17.92 0.8 6.31 0.7
12.27 0.6 18.03 1.1 6.62 1.1
wo DSCN 12.47 0.6 17.49 1.2 6.87 2.2
wo EU 14.78 0.4 20.26 0.4 8.36 0.5
wo WGAN 15.63 0.3 19.03 0.4 10.15 0.5
TABLE VII: RMSE performance on the Jasper dataset for different configurations.
DMaxD [3] SCM [61] EndNet [2]

LMM [7]
15.31 19.78 22.60
GBM [58] 15.98 19.95 21.91
PPNM [59] 12.99 19.78 14.72
MLM [7] 16.59 20.73 8.13
NCM [60] 15.29 19.56 22.42
DSCN++ 12.03 17.92 6.31

TABLE VIII: RMSE performance on the Jasper dataset for baselines. The best results are indicated as Bold-Red while the second best results are shown as Bold.

For the baselines comparisons, the proposed method obtains state-of-the-art performance as demonstrated in Tab VIII. Compared to [31] which corresponds to the baseline DSCN, the improved model significantly improves the performance while small variations in the performance can be observed.

Iv Discussion and Conclusion

In this manuscript, we introduce a novel neural network (NN) framework for hyperspectral unmixing. It is mainly composed of deep spectral convolution networks (DSCNs) by addressing the assumption of spectral variability that is one of the severe physical conditions frequently observed in real data. More precisely, in real applications, pixel responses for each material can be varied due to the atmospheric/illumination conditions instead of fixed material spectra. Therefore, these conditions should be addressed by an effective model in order to obtain state-of-the-art unmixing performance.

For this purpose, we present critical contributions throughout the manuscript as follows: First, we make several modifications in the architecture of the DSCNs to obtain stationary yet more accurate results. Second, a novel NN layer is proposed to estimate the abundances per-pixel by leveraging the multinomial mixture kernel. This kernel particularly provides robustness for the assumption of spectral variability since the distribution-based assumptions can be more convenient to handle to these adverse conditions as explained in the literature. Moreover, an extra NN model is utilized to capture the uncertainty term in the formulation of the spectral variability. Third, the Wasserstein GAN is exploited in the parameter optimization which leads to more accurate and stable performance than the KL-divergence and the ME. Lastly, the model is formulated as a full pipeline based on a NN method that can be learned by backpropagation and a stochastic gradient solver.

The experimental results validate that the proposed method outperforms well-known hyperspectral unmixing baselines by a large margin. In particular, the methods obtains state-of-the-art performances on the real datasets where severe spectral variability can be observed.

Complexity: As analyzed in [2, 62], a NN model with the stochastic gradient algorithm significantly decreases the complexity and memory requirement needed by the model. More precisely, these parameters are independent from the data size due to the batch-based learning. Hence, the method is more applicable for large-scale data.


The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of GPUs used for this research.


  • [1] Z. Yang, G. Zhou, S. Xie, S. Ding, J.-M. Yang, and J. Zhang, “Blind spectral unmixing based on sparse nonnegative matrix factorization,” IEEE Transactions on Image Processing, vol. 20, no. 4, pp. 1112–1125, 2011.
  • [2] S. Ozkan, B. Kaya, and G. B. Akar, “Endnet: Sparse autoencoder network for endmember extraction and hyperspectral unmixing,” arXiv preprint arXiv:1708.01894, 2017.
  • [3] R. Heylen, D. Burazerovic, and P. Scheunders, “Non-linear spectral unmixing by geodesic simplex volume maximization,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 3, pp. 534–542, 2011.
  • [4] J. J. Settle, “On the relationship between spectral unmixing and subspace projection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 34, no. 4, pp. 1045–1046, 1996.
  • [5] N. Keshava and J. F. Mustard, “Spectral unmixing,” IEEE signal processing magazine, vol. 19, no. 1, pp. 44–57, 2002.
  • [6] N. Dobigeon, Y. Altmann, N. Brun, and S. Moussaoui, “Linear and nonlinear unmixing in hyperspectral imaging,” in Data Handling in Science and Technology.   Elsevier, 2016, vol. 30, pp. 185–224.
  • [7] R. Heylen and P. Scheunders, “A multilinear mixing model for nonlinear spectral unmixing,” IEEE transactions on geoscience and remote sensing, vol. 54, no. 1, pp. 240–251, 2016.
  • [8] B. Somers, K. Cools, S. Delalieux, J. Stuckens, D. Van der Zande, W. W. Verstraeten, and P. Coppin, “Nonlinear hyperspectral mixture analysis for tree cover estimates in orchards,” Remote Sensing of Environment, vol. 113, no. 6, pp. 1183–1193, 2009.
  • [9] Y. Altmann, N. Dobigeon, and J.-Y. Tourneret, “Bilinear models for nonlinear unmixing of hyperspectral images,” in Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), 2011 3rd Workshop on.   IEEE, 2011, pp. 1–4.
  • [10] J. M. Nascimento and J. M. Bioucas-Dias, “Nonlinear mixture model for hyperspectral unmixing,” in Image and Signal Processing for Remote Sensing XV, vol. 7477.   International Society for Optics and Photonics, 2009, p. 74770I.
  • [11] N. Raksuntorn and Q. Du, “Nonlinear spectral mixture analysis for hyperspectral imagery in an unknown environment,” IEEE Geoscience and Remote Sensing Letters, vol. 7, no. 4, pp. 836–840, 2010.
  • [12] F. Zhu, Y. Wang, S. Xiang, B. Fan, and C. Pan, “Structured sparse method for hyperspectral unmixing,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 88, pp. 101–118, 2014.
  • [13] F. Zhu, Y. Wang, B. Fan, S. Xiang, G. Meng, and C. Pan, “Spectral unmixing via data-guided sparsity,” IEEE Transactions on Image Processing, vol. 23, no. 12, pp. 5412–5427, 2014.
  • [14] Y. Su, J. Li, A. Plaza, A. Marinoni, P. Gamba, and S. Chakravortty, “Daen: Deep autoencoder networks for hyperspectral unmixing,” IEEE Transactions on Geoscience and Remote Sensing, 2019.
  • [15] X.-R. Feng, H.-C. Li, J. Li, Q. Du, A. Plaza, and W. J. Emery, “Hyperspectral unmixing using sparsity-constrained deep nonnegative matrix factorization with total variation,” IEEE Transactions on Geoscience and Remote Sensing, 2018.
  • [16] X. Zhang, Y. Sun, J. Zhang, P. Wu, and L. Jiao, “Hyperspectral unmixing via deep convolutional neural networks,” IEEE Geoscience and Remote Sensing Letters, no. 99, pp. 1–5, 2018.
  • [17] Y. Zhou, A. Rangarajan, and P. D. Gader, “A gaussian mixture model representation of endmember variability in hyperspectral unmixing,” IEEE Transactions on Image Processing, vol. 27, no. 5, pp. 2242–2256, 2018.
  • [18] D. A. Roberts, M. Gardner, R. Church, S. Ustin, G. Scheer, and R. Green, “Mapping chaparral in the santa monica mountains using multiple endmember spectral mixture models,” Remote Sensing of Environment, vol. 65, no. 3, pp. 267–279, 1998.
  • [19] R. Heylen, A. Zare, P. Gader, and P. Scheunders, “Hyperspectral unmixing with endmember variability via alternating angle minimization,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 8, pp. 4983–4993, 2016.
  • [20] C. Quintano, A. Fernández-Manso, and D. A. Roberts, “Multiple endmember spectral mixture analysis (mesma) to map burn severity levels from landsat images in mediterranean countries,” Remote Sensing of Environment, vol. 136, pp. 76–88, 2013.
  • [21] D. A. Roberts, P. E. Dennison, M. E. Gardner, Y. Hetzel, S. L. Ustin, and C. T. Lee, “Evaluation of the potential of hyperion for fire danger assessment by comparison to the airborne visible/infrared imaging spectrometer,” IEEE Transactions on Geoscience and Remote Sensing, vol. 41, no. 6, pp. 1297–1310, 2003.
  • [22] P. Dennison and D. Roberts, “Endmember selection for mapping chaparral species and fraction using multiple endmember spectral mixture analysis,” Remote Sensing of Environment, vol. 87, no. 2-3, pp. 295–309, 2003.
  • [23] B. Zhang, L. Zhuang, L. Gao, W. Luo, Q. Ran, and Q. Du, “Pso-em: A hyperspectral unmixing algorithm based on normal compositional model,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 12, pp. 7782–7792, 2014.
  • [24] O. Eches, N. Dobigeon, and J.-Y. Tourneret, “Estimating the number of endmembers in hyperspectral images using the normal compositional model and a hierarchical bayesian algorithm,” IEEE Journal of Selected Topics in Signal Processing, vol. 4, no. 3, pp. 582–591, 2010.
  • [25] X. Du, A. Zare, P. Gader, and D. Dranishnikov, “Spatial and spectral unmixing using the beta compositional model,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 7, no. 6, pp. 1994–2003, 2014.
  • [26] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the royal statistical society. Series B (methodological), pp. 1–38, 1977.
  • [27] S. Kolouri, G. K. Rohde, and H. Hoffman, “Sliced wasserstein distance for learning gaussian mixture models,” arXiv preprint arXiv:1711.05376, 2017.
  • [28] B. Jian and B. C. Vemuri, “Robust point set registration using gaussian mixture models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1633–1645, 2011.
  • [29] C. Améndola, M. Drton, and B. Sturmfels, “Maximum likelihood estimates for gaussian mixtures are transcendental,” in International Conference on Mathematical Aspects of Computer and Information Sciences.   Springer, 2015, pp. 579–590.
  • [30] C. Jin, Y. Zhang, S. Balakrishnan, M. J. Wainwright, and M. I. Jordan, “Local maxima in the likelihood of gaussian mixture models: Structural results and algorithmic consequences,” in Advances in Neural Information Processing Systems, 2016, pp. 4116–4124.
  • [31] S. Ozkan and G. B. Akar, “Deep spectral convolution network for hyperspectral unmixing,” arXiv preprint arXiv:1806.08562, 2018.
  • [32] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in International Conference on Machine Learning, 2017, pp. 214–223.
  • [33] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in Neural Information Processing Systems, 2017, pp. 5769–5779.
  • [34]

    J. R. Hershey and P. A. Olsen, “Approximating the kullback leibler divergence between gaussian mixture models,” in

    Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol. 4.   IEEE, 2007, pp. IV–317.
  • [35] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al.

    , “Photo-realistic single image super-resolution using a generative adversarial network,”

    arXiv preprint, 2017.
  • [36] M. Norouzi and D. M. Blei, “Minimal loss hashing for compact binary codes,” in Proceedings of the 28th international conference on machine learning (ICML-11).   Citeseer, 2011, pp. 353–360.
  • [37]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [38] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 1, pp. 117–128, 2011.
  • [39] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2015, pp. 427–436.
  • [40] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [41] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. icml, vol. 30, no. 1, 2013, p. 3.
  • [42] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
  • [43] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [44] T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 901–909.
  • [45] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
  • [46] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
  • [47] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5297–5307.
  • [48] B. Hapke, “Bidirectional reflectance spectroscopy: 1. theory,” Journal of Geophysical Research: Solid Earth, pp. 3039–3054, 1981.
  • [49] P. Melchior and A. D. Goulding, “Filling the gaps: Gaussian mixture models from noisy, truncated or incomplete samples,” arXiv preprint arXiv:1611.05806, 2016.
  • [50] A. Ozerov, M. Lagrange, and E. Vincent, “Uncertainty-based learning of gaussian mixture models from noisy data,” 2012.
  • [51] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, 2003.
  • [52] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli et al., “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, pp. 600–612, 2004.
  • [53] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio, “Learning with a wasserstein loss,” in Advances in Neural Information Processing Systems, 2015, pp. 2053–2061.
  • [54] G. Peyré, J. Fadili, and J. Rabin, “Wasserstein active contours,” in Image Processing (ICIP), 2012 19th IEEE International Conference on.   IEEE, 2012, pp. 2541–2544.
  • [55] C. Villani, Optimal transport: old and new.   Springer Science & Business Media, 2008, vol. 338.
  • [56]

    P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,”

    arXiv preprint, 2017.
  • [57] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [58] A. Halimi, Y. Altmann, N. Dobigeon, and J.-Y. Tourneret, “Nonlinear unmixing of hyperspectral images using a generalized bilinear model,” IEEE Transactions on Geoscience and Remote Sensing, vol. 49, no. 11, pp. 4153–4162, 2011.
  • [59] Y. Altmann, A. Halimi, N. Dobigeon, and J.-Y. Tourneret, “Supervised nonlinear spectral unmixing using a postnonlinear mixing model for hyperspectral imagery,” IEEE Transactions on Image Processing, vol. 21, no. 6, pp. 3017–3025, 2012.
  • [60] D. Stein, “Application of the normal compositional model to the analysis of hyperspectral imagery,” in Advances in techniques for analysis of remotely sensed data, 2003 IEEE Workshop on.   IEEE, 2003, pp. 44–51.
  • [61] Y. Zhou, A. Rangarajan, and P. D. Gader, “A spatial compositional model for linear unmixing and endmember uncertainty estimation,” IEEE Transactions on Image Processing, vol. 25, no. 12, pp. 5987–6002, 2016.
  • [62]

    L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in

    Proceedings of COMPSTAT’2010.   Springer, 2010, pp. 177–186.