1 Introduction
Even though hyperspectral data provides rich content information about the Earth surface and it has been used in a variety of remote sensing applications, the materials exhibited from the surface can be mixed per pixel in different fractions due to the lowspatial resolution of the sensors. Therefore, high dimensional material signatures (i.e., endmembers) and their fractions (i.e., abundance maps) for each pixel need to be extracted blindly from data. The mixture of materials can be formulated with a linear model which intuitively defines the necessary parameters for the problem:
(1) 
where is the number of materials in the scene and is the random noise to approximate the problem to nonlinearity. Moreover, there are two more constraints which bound the physical properties of the solution.
Primarily, the solutions in literature are highly influenced by the geometrical volumebased assumption where the vertices of data distribution correspond to endmembers, since all data can be reconstructed by the combination of these vertices with different fractions. This derivation is exploited in several approaches as the presence of purepixel[4, 20], projectionbased [11, 3], kernelbased [2, 19] in literature. However, even if these linear methods work seamlessly to some extent, e.g., on synthetic data / controlled environment, they strive to cope with some cases such as multiple scattering effects, microscopiclevel material mixtures and waterabsorbed environment on real data [13]. Similarly, various solutions based on nonlinear projection [17, 1] and nonlinear kernel function [9, 5] are derived to mitigate these issues.
Very recently, sparse neural networks
[13, 14] introduce significant performance improvements compared to the traditional blind linear/nonlinear approaches and supervised neural network methods [16, 15]. Modifications on the network architecture and loss function constitute the mainstream of these methods. Both methods explain that combination of ReLU activation with batch normalization ultimately improves the sparsity of abundance maps while endmember estimates lead to nearoptimum solutions. In addition,
[13] introduces a novel loss function which is in accordance with the problem by exploiting spectral angle similarities, regularization layers and additional constrains that boost the sparsity of abundance maps and parameter convergence.However, as explained [13], these methods should be combined with additional hyperspectral unmixing methods (i.e. even if they outperform the stateoftheart methods) in order to improve the performance even further. This limitation stems by the fact that highdimensionality of signatures, shallow network structure and averaging operations in batch normalization.
Project Website: https://github.com/savasozkan/dscn
In this paper, we propose a deep spectral convolutional network (DSCN) to unmix hyperspectral data with precomputed endmembers. To address these current limitations, we present several contributions to the architecture as follows:

[leftmargin=0.4cm,rightmargin=0cm]

To reduce the adverse effects of highdimensionality (we will discuss theoretical explanations/observations in detail in Section 2), we replace the fullyconnected linear operation with convolutions which enables to extract representative local spectral information from a signature rather than its full version.

Furthermore, use of convolutions allows us to promote a deeper architecture, in other words, a sequence of convolutions which improves the sparsity as well as highlevel abstract representation of signatures.

For convolution layers, we replace batch normalization with spectral normalization which aims to improve the spectral selectivity of layers. By this way, more beneficial spectral characteristics can be extracted from hyperspectral data to unmix the fractions of materials.

Lastly, two different fusion configurations that estimate ideal abundance map for each pixel are proposed as DSCNS and DSCNP which use the representations computed from previous layers. The main difference is that DSCNS configuration estimates more sparse abundance maps while DSCNP yields more probabilistic results due to their architecture variations.
2 HyperSpectral Unmixing with Spectral Convolutions
In this section, we initially formulate the problem to clarify the understandability of the proposed method. Later, we provide theoretical explanations/observations of the modifications that are introduced throughout this paper. Lastly, the details of the method and related information about the architecture are explained.
2.1 Preliminary
Formulation. Let be a pixel of hyperspectral data that is mixed by constituent materials with various fractions . Generally, two steps should be defined as quantifying of abundance maps and reconstruction of a pixel in order to extract abundance maps and endmembers respectively from data. Here,
is basically equal to vector multiplications as in Eq. 1. On the otherhand, abundance maps
are estimated by as follows:(2) 
where is the set of trainable parameters to obtain optimum abundance map estimates. Remark that we singly focus on to improve step throughout this paper by using precomputed endmembers .
Impact of Spectral Convolution. As explained in detail [13], after the elimination of bias terms, fullyconnected linear layer is a simple affine transformation which projects data to a more separable space to ease the estimation process. However, as previously discussed for feature hashing/indexing [12, 7], when the dimensionality of data increases, irregularity of data leads to holes which hardens to realize an unsupervised method for the problem. A straightforward solution is to use supervised data to learn a more robust projection to fill these holes [12]. Another solution is to divide data into several overlapping/nonoverlapping parts to increase the representation capacity per element [7].
In particular, convolution layer shares similar objective as in the second approach (i.e. small/local parts) [10] and it extracts discriminative responses which indicate the local characteristics of data. Note that we use only 1D convolutions per pixel to identify the spectral characteristics of data, not their spatial information.
Moreover, since respectively less number of trainable parameters are learned in the convolution layers compared to fully linear operations, a deeper architecture can be promoted in our method. Ultimately, it boosts the discriminative power of the representation by extracting a sequence of abstracts from lowerlevel to higher ones as explained [10, 6].
Spectral Normalization. Practically, combination of ReLU with batch normalization enables a network to select the sparse outputs of an affine transformation (i.e. the responses of fully linear / convolution) based on batch characteristics of data. However, this is not completely practical to reveal the latent spectral characteristics of a pixel.
For this purpose, we utilize spectral normalization with ReLU activation after each convolution layer. Intuitively, spectral normalization normalizes the responses of convolution by regarding their responses for spectral values along with the batch characteristics. By this way, each layer (combination of convolution, spectral norm and ReLU) computes the most representative spectral responses about data while preserving batch characteristic of data. Most relevant normalization type in literature can be seen in [18] used for style transfer.
Root Mean Square Error (RMSE) ()  
VCA  DMaxD  SCM  DgSNMF  EndNet  EndNetSPU  EndNetDSCNS  EndNetDSCNP  
#1  17.686.2  17.030.0  23.870.0  11.660.2  10.120.6  8.240.4  6.041.2  5.770.2 
#2  13.451.9  21.340.0  13.300.0  4.130.0  11.480.8  6.170.3  3.960.7  4.520.3 
#3  38.937.9  14.340.0  28.470.0  11.130.3  9.530.3  8.980.2  8.931.5  13.070.4 
#4  29.134.2  11.210.0  19.870.0  5.680.1  12.290.4  8.550.1  9.311.3  14.360.4 
Avg.  24.805.1  15.980.0  23.040.0  8.150.2  10.850.6  7.960.3  7.060.9  9.430.3 

RMSE results on Jasper Ridge dataset. Mean and standard deviation are reported. Best results are shown in bold.
Fusion Layer. As indicated [13, 14], combination of ReLU and batch normalization yields robustness for abundance estimates to obtain sparse abundance maps and endmembers. However, for finer abundance values, probabilistic/distancebased approaches might be practical for several datasets as in [4, 5].
For this purpose, throughout the paper, we introduce two difference fusion configurations. If we explain the purpose of fusion layer in detail, the aim is to fuse the responses of spectral convolutions from the previous layers to estimate the true abundance maps perpixel at the final layer. Note that instead of feeding highdimensional data directly to a fullyconnected linear layer as in
[13, 14], the proposed method reduces the dimension of data iteratively while enriching the representation capacity of the input with sparse transformations (i.e. convolutions) for both configurations.First configuration, i.e. DSCNS, computes the ideal abundance values for each pixel with the combination of fullyconnected linear, batch norm, ReLU and l1norm layers by taking the hidden representation computed from the spectral convolutions as an input. Note that due to joint usage of ReLU and batch normalization, it is expected that the estimated abundance maps are quite sparse.
Second configuration, i.e. DSCNP, consists of fullyconnected linear and softmax activation layers. The softmax activation function response is as follows:
(3) 
where is the outputs of fullyconnected linear layer. Due to the architecture, this configuration yields probabilistic results and finer abundance maps for highlymixtured scenes. Lastly, both of these configuration layers is the final layer of .
2.2 Deep Spectral Convolution Network
Architecture. First, a pixel is filtered by two consecutive spectral convolution blocks. Note that the number of blocks and inner structure can still be tuned for different datasets (i.e. deeper networks). Each block consists of spectral normalization and ReLU activation layers after a spectral convolution. To reduce the dimensionality of responses, maxpool layer is exploited at each block. To this end, these blocks ultimately behave like a feature extractor.
At the third block, batch normalization and ReLU are utilized with spectral convolution to determine the responses based on only their batch characteristics. This implicitly corresponds to the mutual distribution of data as in [4, 5]. This is critical since the final convolution block reduces the depth size regarding to the overall data batch characteristic. Lastly, fusion layer (i.e. either DSCNS or DSCNP) is used to compute abundance maps for a pixel.
Learning. For parameter optimization, we use the loss function that is proposed in [13] and the parameters are updated by backpropagation scheme. This full loss function is written as:
(4) 
where , and are set to 10, 0.4 and respectively.
is the KullbackLeibler divergence term and
is the normalized SAD score between the original and reconstructed version of signatures [13].Note that finetuning of precomputed endmember is not allowed during the training. Moreover, unlike [13]
, denoising autoencoder scheme is not used for the method, since our aim is to obtain actual/finer abundance values rather than coarse estimation of constituent endmembers from data.
Adam stochastic optimizer [8] is used with the previously explained parameter settings [13]
. The number of iteration is set to 5K and the parameters are randomly initialized. Codes are implemented on Python by extensively leveraging Tensorflow framework.
3 Experiments
3.1 Datasets, Evaluation Metric and Baselines
Root Mean Square Error (RMSE) ()  
VCA  DMaxD  SCM  DgSNMF  EndNet  EndNetSPU  EndNetDSCNS  EndNetDSCNP  
#1  42.147.2  30.680.0  32.790.0  13.180.1  13.040.3  10.410.2  12.851.3  9.870.2 
#2  48.465.6  47.260.0  36.250.0  12.950.0  14.430.3  12.240.3  13.671.5  12.080.3 
#3  17.183.7  26.710.0  32.610.0  9.570.1  8.710.5  8.350.3  8.530.7  7.540.1 
#4  16.942.1  19.490.0  32.860.0  6.270.0  7.590.2  5.920.1  7.030.9  6.650.1 
Avg.  31.184.7  31.040.0  33.590.0  10.490.1  10.940.4  9.230.2  10.521.1  9.040.2 

To make fair and realistic comparisons, we evaluate the proposed method on two real datasets, namely Jasper Ridge [22] and Urban [22], which are extensively used in literature. Briefly, for Jasper Ridge dataset, the spectral and spatial resolutions are 198 and 100 100, respectively. There are four main materials: Tree (#1), Water (#2), Soil (#3) and Road (#4). The spatial resolution of Urban dataset is 307 307 and its spectral resolution is 162. Similarly, it has four constituent materials in the scene: Asphalt (#1), Grass (#2), Tree (#3) and Roof (#4). For reliable assessments, tests are repeated 20 times for each method, thus mean and standard deviation of the results are reported.
Furthermore, we compare the performance of the method with several baseline algorithms such as VCA [11], DMaxD [4], SCM [21], DgSNMF [22], EndNet [13] and EndNetSPU [13]. For a performance metric, we utilize Root Mean Square Error (RMSE) to measure the error between estimated abundance maps and ground truth. To preserve nonlinearity for VCA and DMaxD abundance estimates, we compute their endmember estimates with Multilinear Mixing Model (MLM) [5] throughout the experiments. Lastly, the proposed unmixing method (i.e. either DSCNS or DSCNP) aims to improve the abundance map results of endmembers estimated by EndNet [13] which recently achieves the stateoftheart performance in literature. You can find further detail about EndNet from [13].
3.2 Experimental Results
Jasper Ridge. Experimental results for this dataset are illustrated in Table 1. From these results, the sparse version of the proposed method (i.e. DSCNS) achieves the best overall accuracy. It approximately introduces 1% improvements to the second best result which is obtained by EndNetSPU combination. As identified [13], Soil (#3) and Road (#4) materials are highly correlated and it is only practical to quantify the fractions with supervised data or spatial reasoning as in DgSNMF method. For Tree (#1) and Water (#2) materials in particular, the proposed method nearly attains ideal abundance performance for the materials.
In addition, Fig. 1 shows the quantitative results of the method (DSCNS) for each materials (i.e. each row). Perceptually impressive results are obtained especially for Soil and Water. Similarly, the error concentrates at the boundaries of waterground as well as roadsoil intersections.
However, there is an issue for DSCNS that the variances in the accuracies are a bit high while DSCNP obtains more stable results. The main drawback arises primarily due to the lack of parameter convergence to a global solution for every initialization. We believe that this issue can be reduced by the detail experiments on the network architecture.
Urban. Table 2 shows the experimental results on Urban dataset. The probabilistic configuration of the proposed method, DSCNP, obtains the best overall results with small improvements over EndNetSPU.
Note that the actual abundance maps of data is quite dense (i.e. highlymixtured), thus the probabilistic version can be more appropriate for this case. The experiment results also support this assumption by yielding better performance. Lastly, the variation in the scores is still an issue for DSCNS while DSCNP generates more consistent results.
4 Conclusion
In this paper, we propose a deep spectral convolution network to unmix hyperspectral data with precomputed endmembers. Throughout the paper, we introduce three critical contributions for the unmixing problem. First, instead of a single layer fullyconnected linear operation, a network that is composed of several spectral convolution layers with a deeper architecture is proposed. Later, we present a novel spectral normalization layer that is able to normalize responses of filters to improve the selectivity of layers. Lastly, we introduce two configurations for the fusion of the responses of previous layers and the computation of abundance maps. The experimental results validate that the proposed method in this paper obtains the new stateoftheart performance on two real datasets.
5 Acknowledgments
We gratefully acknowledge the support of NVIDIA Corporation with the donation of Quadro P5000 used for this research.
References
 [1] C. M. Bachmann, T. L. Ainsworth, and R. A. Fusina, “Improved manifold coordinate representations of largescale hyperspectral scenes,” IEEE TGRS, pp. 2786–2803, 2006.

[2]
V. F. Haertel and Y. E. Shimabukuro, “Spectral linear mixing model in low spatial resolution image data,”
IEEE TGRS, pp. 2555–2562, 2005.  [3] J. C. Harsanyi and C.I. Chang, “Hyperspectral image classification and dimensionality reduction: An orthogonal subspace projection approach,” IEEE TGRS, pp. 779–785, 1994.
 [4] R. Heylen, D. Burazerovic, and P. Scheunders, “Nonlinear spectral unmixing by geodesic simplex volume maximization,” IEEE JSTSP, pp. 534–542, 2011.
 [5] R. Heylen, P. Scheunders, A. Rangarajan, and P. Gader, “Nonlinear unmixing by using different metrics in a linear unmixing chain,” IEEE JSTARS, pp. 2655–2664, 2015.
 [6] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks.” in Science, 2006, pp. 504–507.
 [7] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search.” IEEE PAMI, pp. 117–128, 2011.
 [8] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [9] F. Kizel, M. Shoshany, N. S. Netanyahu, G. EvenTzur, and J. A. Benediktsson, “A stepwise analytical projected gradient descent search for hyperspectral unmixing and its code vectorization,” IEEE TGRS, 2017.

[10]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
NIPS, 2012, pp. 1097–1105.  [11] J. M. Nascimento and J. M. Dias, “Vertex component analysis: A fast algorithm to unmix hyperspectral data,” IEEE TGRS, pp. 898–910, 2005.
 [12] M. Norouzi and D. Blei, “Minimal loss hashing for compact binary codes.” ICML, pp. 353–360, 2011.
 [13] S. Ozkan, B. Kaya, and G. B. Akar, “Endnet: Sparse autoencoder network for endmember extraction and hyperspectral unmixing.” arXiv preprint arXiv:1708.01894, 2017.
 [14] F. Palsson, J. Sigurdsson, J. Sveinsson, and M. Ulfarsson, “Neural network hyperspectral unmixing with spectral information divergence objective,” IGARSS, 2017.

[15]
B. Pan, Z. Shi, and X. Xu, “Rvcanet: A new deeplearningbased hyperspectral image classification method.”
IEEE JSTAR, pp. 1975–1986.  [16] J. Plaza, A. Plaza, R. Pérez, and P. Martinez, “Joint linear/nonlinear spectral unmixing of hyperspectral image data,” in IGARSS. IEEE, 2007, pp. 4037–4040.
 [17] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, pp. 2323–2326, 2000.
 [18] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization.” in arXiv preprint arXiv:1701.02096, 2016.
 [19] F.Y. Wang, C.Y. Chi, T.H. Chan, and Y. Wang, “Nonnegative leastcorrelated component analysis for separation of dependent sources by volume maximization,” IEEE PAMI, pp. 875–888, 2010.
 [20] M. E. Winter, “Nfindr: An algorithm for fast autonomous spectral endmember determination in hyperspectral data,” in SPIE, 1999, pp. 266–275.
 [21] Y. Zhou, A. Rangarajan, and P. D. Gader, “A spatial compositional model for linear unmixing and endmember uncertainty estimation,” IEEE TIP, pp. 5987–6002, 2016.
 [22] F. Zhu, Y. Wang, B. Fan, S. Xiang, G. Meng, and C. Pan, “Spectral unmixing via dataguided sparsity,” IEEE TIP, pp. 5412–5427, 2014.
Comments
There are no comments yet.