I Introduction
The recent advances in optics and photonics have stimulated the deployment of hyperspectral imaging sensors of high spatial and spectral resolution. These sensors are now placed on satellite, unmanned aerial vehicle, and ground acquisition platforms used for material, object and terrain land detection and classification [1]
. Although high spatial and spectral resolution improves classification accuracy, it also imposes several research challenges derived as a consequence of the socalled ”curse of dimensionality”; the difficulties arising when we need to analyze and organize data in high dimensional spaces. Hyperspectral data have their own unique characteristics, though being applied for a wide variety of applications, such as agriculture, surveillance, astronomy and biomedical imaging
[2]; i) high dimensional data, ii) limited number of labeled samples and iii) large spatial variability of spectral signatures
[3].Most of existing works, concerning hyperspectral image classification, follow the conventional workflow of pattern recognition process, consisting of two separate steps. First, features are extracted from the raw data, creating labelled training datasets. Second, classifiers, linear or nonlinear, such as Support Vector Machines (SVM) and Neural Networks (NNs)
[4], are used to map the extracted features to the target (desired) outputs. The key problem, however, in applying such conventional processes in classifying high dimensional hyperspectral data is that a large number of labelled training samples is required to model the statistical input diversities and consequently to well train the classifier. In remote sensing applications, collection of a large number of labelled data is an expensive and time consuming process. Another drawback is that classifiers are often used as ”black boxes” [5]. This means that there is no a direct interpretation of how spatial and spectral bands contribute to the final classification outcome.One way to address issues deriving from the high dimensionality and heterogeneity of the data is to employ statistical learning methods [6]
. However, even in this case, the problem of extracting a set of appropriate features remains. Features significantly affect the classification outcome and for data laying in high dimensional spaces is really an arduous task to estimate a suitable set of discriminant features so as to increase the accuracy of the classifier.
For this reason, recently deep learning paradigms have been investigated for classifying hyperspectral data [7, 8, 9, 10]
. Deep learning machines receive as inputs, instead of features, the raw sensory data. Then, they nonlinearly transform the raw inputs to hierarchies of representations which are used, in the following, as ”the most suitable features” in a supervised mapping phase. Thus, deep learning tackles featurerelated issues. This is also proven by the current research outcomes
[11, 12, 13, 14] indicating the outperformance of deep learning machines in accurately detecting various objects in hyperspectral imaging data. Examples include the detection of manmade constructions rather than natural ones [15, 16], vehicles’ detection [17], object tracking [18], land cover mapping [19] and critical infrastructure assessment [20].However, a typical deep learning architecture contains a huge number of tunable parameters implying that a large number of samples is also needed to accurately train the network. In addition, deep learning processes present high computational complexity.
Tensorbased machine learning models are promising alternatives for hyperspectral data classification
[21, 22]. In particular, Zhou et al. [21] and Tan et. al [22] have introduced a linear model using tensorbased regression with applications in neuroimaging data analysis [21] and for classification [22]. These aproaches are considered as the first works discussed the statistical inference procedure for the general raw tensor regression. In conventional learning models, usually the inputs are vector data. Therefore, in case of multidimensional input arrays, first tensor vectorization is carried out. However, vectorization destroys the inherent spatial and spectral structure of the input which can offer a physical interpretation of how spatial information and spectral bands contribute to the classification outcome. Furthermore, tensor vectorization fails to address issues stem from the high dimensionality of the data since again a large number of tunable parameters is required. To handle these limitations, we need to consider the input data as tensors, keeping the spatial and spectral structure of the data, and then, using principles of tensor algebra, to find out ways to reduce the number of parameters needed to be estimated during training.Ia Our Contribution
In this work we propose a new machine learning model which receives as inputs multidimensional tensor signals as they are the hyperspectral images, i.e., 3D tensor cubes. The model weight parameters satisfy the rank1 canonical decomposition property meaning that the weight parameters are decomposed as a linear combination of a minimal number of possibly nonorthogonal rank1 terms [23]. Using such decomposition of the model weights we are able to significantly reduce the number of parameters required to train the classifier. Thus, a smaller labelled dataset is needed than in conventional learning approaches where the tensor inputs are first vectorized.
Another advantage of the rank1 canonical decomposition property for the model weights is that it retains the structure of the spatial and spectral band information, which is a very important aspect for hyperspectral data classification. This is due to the fact that it actually permits the extraction of valuable information regarding the contribution of each of the hyperspectral bands to the classification. Thus, the proposed canonical decomposition provides a physical interpretation of the classification outcome, i.e., how the location of the pixels (spatial information) and the spectral bands (spectral information) influence the final classification performance.
Our work is motivated from [21] which proposes a single linear output tensor regression model for binary classification. In contrast to [21], in this paper, a multiclass classification problem is investigated using tensorbased logistic regression models of multiple outputs. In addition, the rank1 canonical decomposition property is also applied, apart from highorder linear, to nonlinear classifiers, which is not a straightforward process. The proposed high order nonlinear model is relied on a modification of a feedforward neural network (FNN), while it retains the universal approximation principles; capability of the network to approximate any unknown function, under some assumptions of continuity, within any degree of accuracy. The main difference is that the model weights satisfy the rank1 canonical decomposition property. Therefore, the number of parameters (and consequently the number of training samples) are significantly reduced, especially for cases where tensor inputs are considered. A new learning algorithm is also introduced to train the network without spoiling the canonical decomposition assumption. We call the proposed highorder nonlinear model as rank1FNN.
To sum up, the main contribution of this paper is threefold. First, we introduce new learning models, linear and nonlinear, that consider a rank1 canonical decomposition of their weight parameters. This way, a quite smaller number of weights is required compared to conventional learning paradigms. This, in the sequel, requires a smaller number of samples used to train the model, which is in the line of remote sensing applications, where a limited number of samples is available. The new models are suitable for tensor input data of high dimensions. Second, the rank1 canonical decomposition property allows for a physical interpretation of how each spatial and spectral band of the tensor inputs affects the classification outcome. This is an important attribute in analyzing hyperspectral data. Third, the introduction of a rank1 FNN nonlinear classifier allows for modelling of complex relationships, due to the universal approximation property of neural nets [24], while simultaneously keeping the aforementioned advantages.
The rest of this paper is organized as follows; Section II presents the problem formulation, as well as, the notation and tensor algebra operations that will be used throughout the paper. Sections III and IV present the development of the highorder linear and nonlinear models. Experimental evaluation of the developed models is presented in Section V and Section VI concludes this work.
Ii Problem Formulation and Tensor Algebra Notation
Iia Problem Formulation
Let us denote as the th patch of a hyperspectral image that we would like to classify into one of available classes. Each class expresses, for example, the type of vegetation or soil properties for the patch . In this paper, we consider as a 3D tensor, i.e., , where variables and refer to the spatial dimensions of the hyperspectral patch and to the number of spectral bands. In a more general case, variable can be seen as a dimensional tensor, that is, .
Let us also denote as , with
a relationship (linear or nonlinear) that expresses the probability of the observation
to belong to the th class. Subscript indicates dependence of the relationship on weight parameters. Aggregating the values over all classes, we form a classification vector, say , the elements of which . Then, the maximum probability value over all classes indicates the class to which the hyperspectral image patch belongs to.(1) 
The values of are estimated using machine learning algorithms. For this reason, a training dataset consisting samples is considered
(2) 
where is a dimensional vector, the elements of which are all zero except for one which equals unity indicating the class to which belongs to. That is, and . In the following we omit subscript for simplicity purposes if we refer to an input sample.
Multidimensional arrays are also known as tensors. Since tensors is a key concept of the proposed highorder learning model (linear and nonlinear), in the following some basic notations and definitions regarding tensor algebra are presented that will be used through out this work.
IiB Tensor Algebra Notations and Definitions
In this paper tensors are denoted with bold uppercase letters, vectors with bold lowercase letters and scalars with lowercase letters.
Tensor vectorization. The operator stacks the entries of a dimensional tensor on a column vector. Specifically, an entry maps to the entry of , in which .
Tensor products. Given two tensors and the following products can be defined;

Kronecker product. The Kronecker product is the matrix

KhatriRao product. The KhatriRao product is defined as the columnwise Kronecker product, i.e., , if and have the same number of columns, .

Outer product. The outer product of the vectors forms a tensor whose element at position equals .
Tensor matricization.The moded matricization, , maps a tensor to a matrix by arranging the moded fibers to be the columns of the resulting matrix. In particular, the element of maps to the element of , where .
RankR decomposition. A tensor admits a rankR decomposition if , where the symbol represents the vector outer product and , , . The decomposition can be represented by , where , . When a tensor admits a rankR decomposition the following results hold;
(3) 
and
(4) 
where is a vector of ones. For more information regarding tensor algebra please refer to [25].
Iii High Order Linear Modelling
Iiia Vector Logistic Regression
Let us first assume for simplicity that the input samples are of vector forms. We denote these input vectors as , where variable expresses vector dimension. Using the the logistic regression framework [26], one can model the probability function as
(5) 
where with stands for the weights with respect to the th class. Matrix includes all the weights involved in the model. The rational meaning of the weight parameters is that they express the degree of confidence of the vector input to belong to the th out of available classes. In addition, the elements of express the degree of significance of each element of the input vector with respect to the th class.
One simple way to extend Eq. (5) in case that the inputs are tensors. i.e., , is to take the vectorized forms of them (see the respective text on algebra notation of Section IItensor vectorization). The main limitation of such an approach is that (i) a large number of parameters is needed to be estimated, particularly () and (ii) vectorization spoils the spatial structure of tensor inputs, that is, pixels belonging to a neighboring region frequently present similar properties (spatial coherency). A large number of parameters also implies a large number of available labelled observations in order to successfully complete the training procedure. However, usually the available labeled samples are limited due to manual effort required to collect and annotate them.
IiiB Matrix Logistic Regression
In case of matrix input observations, , one can reduce the number of model parameters by taking into consideration concepts of [27], applied for electroencephalogram data classification. Then, the logistic regression model is given by
(6) 
where and . We also define as and the aggregate model parameters.
The key advantage of Eq. (6) is that the number of required parameters is instead of that would be needed if one vectorize matrix observation input data . This means that the weight parameters , in relation (6), for the th class are rank1 canonically decomposed into the weights of and , that is,
(7) 
where operator is the Kronecker product as defined in Section IIKronecker product.
IiiC Tensorbased Logistic Regression
The aforementioned concept can be extended in case of tensor inputs, , [21]. In this way, we have that
(9) 
where with is the th rank1 canonical decomposition of the weight parameters for the th class. This property is referred as CANDECOMP, stem from CANonical DECOMPosition, or PARAFAC, stem from PARAllel FACtors, in the literature [28, 29]. That is,
(10) 
Eq. (9) can be seen as an expression of the KhatriRao product, denoted as operator (see Section II), which is the columnwise Kronecker product of the rank1 canonical decomposition weight parameters ,
(11) 
In Eqs. (9) and (11), we can aggregate the total weight parameters as
(12) 
with . In other words, matrix contains all the weight parameters with respect to the th dimension of the tensor overall classes, i.e, related with the tensor dimension.
The representation of Eqs. (9) and (11) significantly reduces the number of model parameters needed for classifying the tensor inputs . In particular, the vectorbased logistic regression model derived through vectorization of tensor requires the estimation of parameters, while the tensorbased representation of (9) and (11) reduces this number to .
The advantages of the aforementioned proposed representation for the classification of hyperspectral images are twofold. First, as we reduce the number of weight parameters, a smaller number of labelled data samples is required to train the logistic regression model. This is an important factor for developing classifiers that can better generalize to unseen hyperspectral data. Usually, the manual effort for the annotation is laborious and therefore a small number of labelled training data is available. Second, the rank1 canonical decomposition of the model weights provides a physical interpretation of how the spatial and spectral information of the hyperspectral tensor input contributes to the classification. In particular, according to the statements made right after Eq. (5), the weights express the degree of importance of the tensor input to belong to the th class. Since these weights are canonically decomposed into separate weights , each indicating the contribution of tensor dimension to the belong to the th class, the proposed model provides a quantitative representation of how the elements of each tensor dimension tunes the classification performance.
More specifically, in case of hyperspectral imaging, the dimension of tensor inputs equals 3, i.e., . The first two dimensions refer to the spatial properties of the pixel data, while the third one to the spectral bands. Therefore, the first two decomposed weights, i.e., and , express how the pixel spatial coherency affects the classification outcome. On the other hand, the third decomposed weight vector indicates how the spectral bands influence the classification and which of the spectral bands are the most salient. This property of the proposed model is very important towards the analysis of hyperspectral data since it facilitates the interpretation of the results and quantifies the importance of the spectral bands on the classification compressing the influence of the bands that are of less importance.
IiiD Estimation of the Model Weights
The model weight parameters are estimated through a training set of samples as Eq. (2) indicates. We recall that if belongs to the class, then is a vector with all elements zero except the element , which equals one, i.e., and . By minimizing the negative loglikelihood function,
(13) 
the weight parameters can be estimated as
(14) 
In Eq. (14), matrices refer to the optimal estimates of the weight parameters, expressing the contribution of the dimension of the tensor input to the classification for all available classes.
Based on the statements of Section II, presenting some basic notations on tensor algebra (RankR decomposition and tensor matricization), it is held that
(15) 
The proof of Eq. (15) is given in Appendix A. In Eq. (15), denotes the mode matricization of tensor obtained by keeping the dimension intact and concatenating the slices of the rest dimensions into a long matrix [25]. We also recall that the operator refers to the KhatriRao product, while the to the inner product between two tensors.
The left hand of Eq. (15) expresses the arguments of involved in the linear logistic regression model of Eq. (11). Therefore, (11) can be rewritten by taking into account the righthand of (15). In the sequel, generation of (15) over all available classes is obtained as
(16) 
As we can see Eq. (15) is actually the inner product of two vectors. The first is the weight parameters expressing the contribution of the dimension of the tensor input as far as the th class is concerned. On the other hand, the second vector is independent from the values of . Therefore, one approach for optimally estimating the weights and consequently [see Eq. (10)] is one to consider all the weight parameters with apart from the fixed and then solving with respect to . This procedure is iteratively applied for all weight parameters until some termination criteria are met. Similar statements can be concluded for the matrix representation of Eq. (16).
Actually, the learning algorithm used to optimally estimate the model weights in case that Eq. (10) is held, i.e,. the rank1 canonical decomposition, simulates a regression learning where we use as inputs the data of the righthand of Eq. (16), that is,
(17) 
Therefore, the parameter estimation problem can be solved by using conventional software such as scikitlearn in python^{1}^{1}1http://scikitlearn.org/. The proposed learning procedure, considering the rank1 canonical decomposition of the weight parameters is described in Algorithm 1.
Although, the highorder linear model can provide physically interpretable results, due to its structure, it is restricted to produce decision boundaries that are linear in the input space. This implies that this model is not able to cope with more complicated problems, where nonlinear decision boundaries are necessary to provide classification results of high accuracy. This motivates us to extend the previous linear regression model to a nonlinear one. The proposed nonlinear model should assume again a
rank1 canonical decomposition of the weight parameters in order to retain the aforementioned advantages in hyperspectral classification.Iv HighOrder Nonlinear Modelling
The proposed highorder nonlinear model is based on the concepts of the previously discussed linear logistic regression filters with the difference that a nonlinear transformation is applied on the input data. This means, in other words, that the probability of an input observation to belong to one of the available classes is nonlinearly interwoven with respect to the input tensor data and the weight parameters that influence the importance of these data on classification performance through a function , i.e.,
(18) 
The main difficulty in implementing Eq. (18) is that function is actually unknown. One way to parameterize the unknown function
is to exploit the principles of the universal approximation theorem, stating that a function, under some assumptions of continuity, can be approximated by a feedforward neural network with a finite number of neurons within any degree of accuracy
[30]. Feedforward neural networks have been proven as a reliable framework for image classification [31].However, the main difficulty in applying a feedforward neural network (FNN) for hyperspectral classification problems is twofold. First, a large number of weight parameters is required to be learned, proportionally to , where variable refers to the number of hidden neurons of the network. This outcome derives as a consequence of the structure of the network as is briefly described below (see Section IVA). This, in the sequel, implies that a large number of labelled samples needed to successfully train the network. Second, the weights of the network are not directly related to the physical properties of hyperspectral data and how these properties affect the classification performance, since the inputs are vectorized and thus they do not preserve their structure. Networks are often treated as ”black boxes” when they are applied for classification of hyperspectral data. To overcome these problems, we propose a modification of conventional feedforward neural network structures so that network weights from the input to the hidden layer satisfy the rank1 canonical decomposition according to statements of the previous section. In addition, we introduce a learning algorithm to train the socalled rank1 canonical decomposition feedforward neural network  rank1 FNN. Before proposing rank1 FNN, we briefly describe how is modelled through a FNN.
Iva FeedForward Neural Network Modelling
Fig. 1 illustrates a feedforward neural network nonlinearly approximating the probability . Initially, the tensor input is vectorized creating a vector, , of size . The network is assumed to have
hidden neurons. Each neuron is associated with an activation function
. In this paper, the sigmoid function
is selected, where factor regulates the steepness of the function. The activation function of the th out of hidden neurons receives as input the inner product of and a weight vector associated with the th neuron and produces as output a scalar given by [32](19) 
where we recall that the operator expresses the inner product. It should be mentioned that in the current notation the superscript of the weights refers to the th hidden neuron. Gathering the responses of all hidden neurons in one vector , we have that
(20) 
where is a matrix containing the weights for all hidden neurons, . Thus, the output of the network is given as
(21) 
where stands for the softmax function, the weights between the hidden and the output layer and the superscript of the weights for the th class. The softmax function corresponds to the following conditional probability and is defined as follows for a class
(22) 
IvB Rank1 Canonical Decomposition Feedforward Neural Networks  Rank1 FNN
To reduce the number of parameters of the network and to relate the classification results to the spatial and spectral properties of hyperspectral input data, we rank1 canonically decomposed the weight parameters as in Eq. (10). We should stress that vector relates the input tensor data with the th hidden neuron and therefore .
Then, taking into account the properties of Eq. (15), the output of the th hidden neuron can be written as
(23) 
In Eq. (23), vector is a transformed version of tensor input . Vector is independent from .That is,
(24) 
where we recall that is the mode matricization of .
Eq. (23
) actually resembles the operation of a single perceptron with inputs the weights
and the transformed version of the input data. That is, if the rank1 canonically decomposed weights with are known then will be also known. Fig. 2 shows the structure of the proposed rank1 FNN. The model consists of one input layer, one hidden and one output layer. The main modification of this structure compared to a conventional FNN is in the hidden layer. More specifically, the weights of a hidden neuron are first decomposed into canonical factors, each expresses the spatial and spectral band contribution to the classification performance. In this figure, we have shaded the weight vector since all the rest weight vectors are used to estimate the transformed input .IvC The Learning Algorithm
To train the proposed rank1 FNN model the set containing samples is used. The learning algorithm minimizes the negative loglikelihood (see relation (13)) with respect to network responses , with , and targets over all training samples.
In case of using a conventional neural network training algorithm, the estimated weights do not satisfy the rank1 canonical decomposition assumption expressed by Eq. (10). For this reason, in the proposed learning algorithm we initially fix all the weights with . This way, we are able to estimate the transformed input vector . Therefore, the only unknown parameters of the hidden layer is the vector . This can be derived through a gradient based optimization algorithm, assuming that the derivative is known. Then, the weights are updated towards the negative direction of the partial derivative.
One way to estimate the partial derivative
is to exploit principles of the backpropagation algorithm which actually implements the chain rule property for estimating the derivative of the error with respect to the network weights. In particular, by using the backpropagation algorithm, we compute the partial derivatives
(25) 
The partial derivative can be estimated if we assume all the weights with fixed, since in this case we can estimate the vector . Therefore, estimation of the parameters of the Rank1 FNN is obtained by iteratively solving with respect to one of the canonical decomposed weight vectors, assuming the remaining fixed. Algorithm 2 presents the main steps of the proposed algorithm, applying for the calculation of the hidden layer weights of the network.
V Experimental Results on Hyperspectral Data
A natural question that arises is whether such a reduction in the number of parameters would limit the descriptive power of the models in (9) and in (21
). To answer this question we present quantitative results regarding classification accuracy. Furthermore, we compare the performance of the linear highorder model (tensorbased logistic regression) against two other wellknown linear models; vectorbased logistic regression and linear SVMs, and the performance of the highorder nonlinear model against Fully Connected Feed Forward Neural Nets (FCFFNN), nonlinear SVMs and two deep learning approaches for classifying hyperspectral data; an approach based on StackedAutoencoders
[11]and an approach based on the exploitation of Convolutional Neural Networks
[12]. These methods are the current state of the art in the literature.The model in (9) is linear in the feature space and the estimated parameters can be used to estimate the importance of the spectral bands. In particular, the most important feature elements should have the highest, in absolute value, weight coefficients, while feature elements uncorrelated with the output variables should have coefficient values close to zero. This way, we can reduce the dimensionality of raw data keeping feature elements with the highest weight coefficients.
Va Datasets
In our study we used AVIRIS and ROSIS sensors hyperspectral datasets ^{2}^{2}2The code for the Rank1 FNN is available at https://bit.ly/2GRMd85. In particular, we used i) the Indian Pines dataset, which depicts a test site in Northwestern Indiana and consists of 224 spectral reflectance bands and ii) the Pavia university datasets, whose number of spectral bands are 103.
A hyperspectral image is represented as a 3D tensor of dimensions , where and correspond to the height and width of the image and to the spectral bands. In order to classify a pixel at location on image plane and successfully fuse spectral and spatial information, we use a square patch of size centered at pixel . Let us denote as the class label of the pixel at location and as the tensor patch centered at pixel . Then, we can form a dataset for and . Each one of the patches is also a 3D tensor with dimension , which contains spectral and spatial information for the pixel located at . The dataset is used to train the classifiers. The Pavia University and the Indian Pines datasets contain and labeled pixels respectively.
VB Classification Accuracy of the Tensorbased Logistic Regression
The performance of the tensorbased logistic regression method is evaluated into four different experiments, each of different number of training samples to assess classification accuracy even in cases where a small number of samples is used. The evaluation is compared against vectorbased logistic regression and linear SVMs. For each of the four experiments, we use as training dataset a subset of that contains , , and samples from each class respectively. The remaining samples are used as test data.
The same training data are used to train all models. The vectorbased logistic regression as well as the linear SVMs are trained by using the vectorized versions of patches that belong to the training datasets. The tensorbased logistic regression model is trained by using the same patches without applying any transformation on them, and thus, keeps intact their spatial structure.
Figure 3 presents the classification accuracy, on the test set, for all tested methods. In both datasets and in all cases the tensorbased model outperforms linear SVMs and vectorbased logistic regression, despite the fact that it employs the smallest number of parameters. These results quantitatively answer the question regarding the capacity of the model in (9) under a small sample setting framework.
VC Tensorbased Dimensionality Reduction
In the following we evaluate the quality of the dimensionality reduction that can be conducted by the tensorbased logistic regression model. Towards this direction, we utilize the model presented in [12]
, where a deep Convolutional Neural Network (CNN) is used for classifying hyperspectral data. The authors reduce the dimensionality of the data along the spectral dimension by applying principal components analysis and by utilizing the
principal components that preserve at leastof the total dataset variance.
In this work we utilize the same CNN structure, but we reduce the dimensionality of the raw data by selecting the most significant spectral bands, i.e., spectral bands to which the tensorbased logistic regression model assigns the larger, in terms of absolute value, coefficients. Due to the fact that this method does not take the variation of estimation into account, we firstly normalized the data, so as to suppress the effect of variance. The results in terms of classification accuracy on the test set, using the same CNN for both dimensionality reduction methods are presented in Table I
. Each experiment has been executed at 10 different runs and the standard deviation across all different runs is also depicted in Table
I. In this table, we denote as TBCNN the tensorbased dimensionality reduction followed by a CNN classifier and as PCACNN the PCA dimensionality reduction followed again by a CNN model. We conducted three experiments where we use , and of the datasets as the training samples.Pavia University  
Splitting ratio  10%  20%  40% 
TBCNN  97.33 0.5  98.41 0.3  99.31 0.2 
PCACNN  97.58 0.3  98.53 0.3  99.42 0.1 
Indian Pines  
Splitting ratio  10%  20%  40% 
TBCNN  82.29 1.1  90.09 0.7  95.98 0.6 
PCACNN  84.34 0.9  91.72 0.7  96.37 0.5 
PCACNN performs slightly better than TBCNN. However, our proposed method presents an advantage over PCA. Although, PCA maps the raw data to a lower dimensional feature space, the resulted features are not interpretable. Using the tensorbased dimensionality reduction, features at the lower dimensional space correspond to the most significant spectral bands providing a physical meaning on how spectral information affects the classification performance.
Furthermore, we compare the features extracted by the proposed tensorbased logistic regression model (high order linear modelling  Section
III) with the features extracted using the Recursive Feature Elimination (RFE) technique [33]. RFE is an iterative procedure. At each iteration, it trains a model using all available spectral bands and then it discards the less informative ones until a predefined number of features is reached. Fig. 4 presents the 10 and 20 most important spectral bands selected from both techniques (i.e., the tensorbased, called TB in the figure and the RFE). In this figure, the common bands detected by both techniques are also identified for clarification purposes. The results have been conducted for the India Pines and Pavia University dataset. To verify the quality of the selected spectral bands as important, we measure the classification accuracy obtained if we keep only these bands. Specifically, the extracted bands of both techniques are used to train the CNN model presented in [12]. Please note that the training set used for both feature selection techniques (i.e., TB and RFE) is the same and it was built by selecting 100 random samples per class. The results are presented in Table
LABEL:table:dimesionality_reduction for Pavia dataset. In addition, Fig. 5shows the confusion matrix for the Pavia dataset when 10 or 20 salient bands are selected for the TB and RFE approach. The overall classification accuracy for both dimensionality reduction techniques is almost the same, indicating that the proposed TB method is also appropriate for dimensionality reduction apart from classification capabilities  see Section
VB.DR + Classifier  TDLA + FCFFNN  TDLA + SVM  TB + FCFFNN (10 bands)  TB + SVM (10 bands)  TB + FCFFNN (15 bands)  TB + SVM (15 bands)  TB + FCFFNN (20 bands)  TB + SVM (20 bands) 
OA (%)  71.22  89.54  64.53 1.4  68.14 0.6  73.28 0.8  73.77 0.7  74.62 0.7  73.95 0.6 
The features extracted by the tensorbased logistic regression model are also compared against the features selected by Tensor Discriminative Locality Alignment (TDLA) [34], a stateoftheart method for tensor data dimensionality reduction (see Table II). At this point, we should mention that TDLA has been designed for dimensionality reduction, while our method is a tensorbased classification modeling framework. Feature selection is an interesting side effect of our proposed methodology.
Again the quality of the data dimensionality reduction is measured through classification accuracy. In this experiment we have used different classification models, instead of the CNN adopted previously, to indicate the robustness of the proposed method in selecting salient spectral bands under different classification frameworks. Particularly, the experiments have been conducted using a FCFFNN and a SVM classifier which receives as inputs the selected salient bands of the TB and TDLA approach. We have used different classification models (e.g., CNN, FCFFNN, SVM) to indicate the robustness of the proposed method in selecting salient spectral bands under different classification frameworks. Three different experiments, for 10, 15, 20 most important spectral bands selection, are conducted. The experiments have been executed using 10 different runs and the standard deviation across all runs is depicted to indicate the robustness of the classification accuracy. As is observed, the best overall classification accuracy is achieved by TDLA and an SVM classifier. However, our approach gives better results when nonlinear models (such as FCFFNN) are adopted.
The conclusions of the aforementioned experiments are that, although the proposed method has been designed for classification purposes (see the respective experiments of Section VD), it also performs quite well for data dimensionality reduction, compared with state of the art methods that have been appropriately designed for this purpose.
Feature selection  Number of bands  Accuracy (%) Pavia University  Accuracy (%) Indian Pines 

Tensorbased  10  92.82  77.48 
RFE  10  92.61  83.42 
Tensorbased  20  93.02  77.46 
RFE  20  93.25  83.56 
VD HighOrder Nonlinear Classification Model
For evaluating the performance of the highorder nonlinear classification model, that is, the rank1 FNN, we also use the Pavia University and the Indian Pines datasets. A similar procedure as in Section VB is followed to form the training sets. We conduct different experiments using training sets of 50, 100, 150 and 200 random samples from each class respectively. A similar training map selection was made in [35] for the Indian Pines and Pavia University dataset. Please note that if for a specific class the number of samples are less than the number required for the experiment, then we select randomly 50% of the total available class samples. Fig. 6 depicts the effect of different training samples on classification accuracy for different window size and assuming a constant number of neurons in the hidden layer. Particularly, we set for the Pavia dataset and for the Indian Pines. These values are derived from descriptions below (see also Fig. 7). Additionally, Fig. 7 presents the effect of the number of training samples on misclassification error using different number of neurons for both datasets.
Then, we evaluate the performance of proposed tensorbased highorder nonlinear classifier in relation to the size of the spatial window around a pixel. We conduct experiments with a window size to be equal to 3, 5, 7, 9, and 11 as is depicted in Figure 6. These results suggest that the best overall classification accuracy is achieved when , i.e., the classification accuracy is mostly affected by the 24 closest neighbors of a pixel. When the spatial information is not adequate for achieving highly accurate classification results, while for the neighbourhood region is very probable to contain pixels that belong to different classes thus deteriorating the classification accuracy.
Moreover, we evaluate the performance of the highorder nonlinear classifier in relation to its complexity, i.e. the value of parameter , indicating the number of hidden neurons. Particularly, we set to be equal to 50, 75, 100 and 125 respectively. By increasing the value of , the complexity and the capacity of the model are also increased. The results of this evaluation are presented in Fig.7.
Regarding the Indian Pines dataset, we observe that the model with outperforms all other models. When the training set size is very small, i.e. 50 samples per class, the model with does not have the capacity to capture the statistical relation between the input and the output and thus underfits the data. On the other hand, the models with and , due to their high complexity, they slightly overfit the data. As training set increases, the misclassification error for all model decreases. This happens because increasing the training set size it also increases the amount of information that can be used during training to estimate the coefficients of the models.
As far as, the Pavia University dataset is concerned, we observe that the model with outperforms all other models, when the size of dataset is larger than 50 samples per class. When the training dataset size is 50 samples per class the model with outperforms all other models. The model with overfits the data, while the model with underfits them. When the size of the training dataset increases, the classification accuracy of all models also increases.
In the following, we compare the performance of the proposed rank
1 FNN against FCFFNN, Radial Basis Function SVM (RBFSVM), and two deep learning approaches that have been proposed for classifying hyperspectral data; the first one is based on StackedAutoencoders (SAE)
[11], while the second one on the exploitation of Convolutional Neural Networks (CNN) [12]. The FCFFNN receives as input the vectorized version of the Rank1 FNN inputs. The same applies for the input of RBFSVM. Moreover, FCFFNN has one hidden layer that has the same number of hidden neuron as the Rank1 FNN. SAE and CNN were developed as proposed in the original papers. The number of hidden neurons of FCFFNN is 75 for the Indian Pines dataset and 100 the Pavia University dataset (as derived from Fig.7). The architecture of the network that exploits StackedAutoencoders consists of three hidden layers, while each hidden layer contains 10% less neurons than its input. We choose to gradually reduce the number of hidden neurons from one hidden layer to the next, in order not to permit for the network to learn the identity function during pretraining. Regarding CNN, we utilize exactly the same architecture as the one presented in [12]. The performance of all these models is evaluated on varying size training sets; training sets that contain 50, 100, 150 and 200 samples from each one of the available classes.Pavia University  
Samples per class  50  100  150  200 
rank1 FNN (Q=100)  89.95 0.7  93.50 0.4  93.89 0.3  95.11 0.3 
FCFFNN  67.79 3.6  76.53 2.7  78.48 2.2  82.59 2.1 
RBFSVM  86.98 1.6  88.99 1.3  89.86 1.2  91.82 1.2 
SAE  86.54 2.1  91.90 1.8  92.38 1.4  93.29 1.1 
CNN  88.89 0.6  92.74 0.3  94.68 0.2  95.89 0.2 
Indian Pines  
Samples per class  50  100  150  200 
rank1 FNN (Q=75)  85.20 1.2  91.63 0.8  92.82 0.5  94.15 0.4 
FCFFNN  73.88 4.2  81.10 3.9  84.14 2.7  85.86 2.5 
RBFSVM  73.18 2.3  77.86 2.0  82.11 1.5  84.99 1.5 
SAE  65.51 4.4  70.66 3.6  74.03 3.6  76.49 3.2 
CNN  82.43 0.9  85.48 0.7  92.28 0.2  94.81 0.2 
Indian Pines  
Method  STM  STMMPCA  Rank1 FNN 
OA  62.6 2.6  80.6 1.9  73.9 1.4 
Pavia University  
Method  STM  STMMPCA  Rank1 FNN 
OA  79.5 1.4  89.4 0.5  88.2 0.8 
Table IV presents the outcome of this comparison. In this table, we also depict the standard deviation across 10 different experiment runs to beter reveal the classification accuracy. When the training set size is small, our approach outperforms all other models. This stems from the fact that the proposed highorder nonlinear model exploits tensor algebra operations to reduce the number of coefficients that need to be estimated during training, while at the same time it is able to retain the spatial information of the input. Although, the FCFFNN utilizes the same number of hidden neurons as our proposed model, it seems to overfit training sets when a small size of data is used, due to the fact that it employs a larger number of coefficients. RBFSVM performs better than the FCFFNN on the Pavia University dataset, but slightly worse on the Indian Pines dataset. The deep learning architecture based on the exploitation of SAE, is actually a FCFFNN with three hidden layers, which employs an unsupervised pretraining phase to initialize its coefficients. The fully connectivity property of this architecture implies very high complexity, which is responsible for the poor performance, due to overfitting issues, on the Indian Pines dataset. Finally, the deep learning architecture based on CNN performs better that FCFFNN, RBFSVM and SAE mainly due to its sparse connectivity (low complexity) and the fact that it can inherently exploit the spatial information of the input. When the training set consists of 150 and 200 samples per class, for the Pavia University dataset, and 200 samples for the Indian Pines dataset, the deep learning architecture based on CNN seems to outperform even the proposed highorder nonlinear classification model. This happens because the CNNbased deep learning model has higher capacity than the proposed model, which implies that it is capable of capturing better the statistical relation between the input and the output, when the training set contains sufficient information. However, when the size of the training set is small, our proposed model, due to its lower complexity, outperforms the CNNbased deep learning model. Fig. 8 and Fig. 10 depict the classification accuracy for all tested models when the trainig set contains 50 and 100 samples per class (white labeled pixels correspond to misclassified samples), while Fig. 9 presents the convergence curves for the Rank1 FNN. In these figures, the false images (misclassified pixels) are depicted.
Finally, we compare our proposed method against the Support Tensor Machine (STM) proposed in [36] using Indian Pines and Pavia University datasets. The work of [36] evaluates the performance of STM using both raw data and data of a reduced dimension derived through a Multilinear Principle Component Analysis (MPCA) [37]. The results of this comparison are presented in Table V. In this table, we also depict the standard deviation of the results. As is observed, the proposed Rank1 FNN outperforms the STM model when raw hyperspectral data are used. Instead, MPCASTM works slightly better than our approach. This is mainly due to the fact that better features are used as inputs.
Vi Conclusions
In this work, we present a linear and a nonlinear tensorbased scheme for hyperspectral data classication and analysis. The advantages of the presented model is that i) it reduces the number of parameters required for the classification and thus it reduces the respective number of training samples, ii) it provides a physical interpretation regarding the model coefficients on the classification output and iii) it retains the spatial and spectral coherency of the input samples. By utilizing tensor algebra operations, we introduce learning algorithms to train the tensorbased classifiers (linear and nonlinear). The linear classifier is based on a modification of a logistic regression model, while the nonlinear one on a modification of a feedforward neural network. Both the proposed models assume a rank1 canonical decomposition of the weight parameters. For this reason, we properly modified existing learning schemes to train these models so as to keep the rank1 canonical decomposition property. We call the new proposed nonlinear classifier as rank1 FNN.
We have evaluated the performance of the presented model in terms of classification accuracy and for dimensionality reduction. As far as the dimensionality reduction is concerned, the tensorbased scheme allows selection of the most discriminative spectral bands. It produces an interpretable dimensionality reduction, which can be used with more complicated classifiers without deteriorating their performance. In terms of classification, the linear classifier outperforms conventional linear models such as logistic regression and SVM.
However, the linear classifier is characterized by low capacity, since it produces classification rules that are linear in the input space. For this reason, in this paper, we have introduced nonlinear classification models the weights of which satisfy the rank1 canocial decomposition property. We have also introduced suitable learning algorithms to train the new proposed nonlinear model. The performance of the nonlinear model was compared against other nonlinear classifiers, including the stateoftheart deep learning classifiers. The main results are that when the size of the training set is small, our newly proposed model presents superior performance against the compared methods.
As future work, one can evaluate the performance of the proposed tensorbased classifier on fused datasets that contain hyperspectral data and LiDAR data [38]. In this case, we need to investigate on the extraction of new features that better model how LiDAR data are related with hyperspectral particularities [39].
Appendix A
Aa Proof of relation (15)
Proof: For simplicity, we omit the superscripts of . Let us denote as the tensor
having the same dimensions with the tensor . Then,
We also have that
for any . Furthermore, from relation (3) of the main manuscript, we have
(A.1) 
Now by using (A.1) and the property of inner and dot products, which states that for any three matrices , and , the following
holds, we conclude that
The proof is completed.
References
 [1] E. Wycoff, T.H. Chan, K. Jia, W.K. Ma, and Y. Ma, “A nonnegative sparse promoting algorithm for high resolution hyperspectral imaging,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 1409–1413.
 [2] C.I. Chang, Hyperspectral data processing: algorithm design and analysis. John Wiley & Sons, 2013.
 [3] G. CampsValls and L. Bruzzone, “Kernelbased methods for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 43, no. 6, pp. 1351–1362, 2005.
 [4] ——, Kernel methods for remote sensing data analysis. John Wiley & Sons, 2009.
 [5] J. M. Benítez, J. L. Castro, and I. Requena, “Are artificial neural networks black boxes?” IEEE Transactions on neural networks, vol. 8, no. 5, pp. 1156–1164, 1997.
 [6] G. CampsValls, D. Tuia, L. Bruzzone, and J. Atli Benediktsson, “Advances in hyperspectral image classification: Earth monitoring with statistical learning methods,” Signal Processing Magazine, IEEE, vol. 31, no. 1, pp. 45–54, 2014.
 [7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [8] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
 [9] G. E. Hinton, S. Osindero, and Y.W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
 [10] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle et al., “Greedy layerwise training of deep networks,” Advances in neural information processing systems, vol. 19, p. 153, 2007.
 [11] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learningbased classification of hyperspectral data,” IEEE Journal of Selected topics in applied earth observations and remote sensing, vol. 7, no. 6, pp. 2094–2107, 2014.

[12]
K. Makantasis, K. Karantzalos, A. Doulamis, and N. Doulamis, “Deep Supervised Learning for Hyperspectral Data Classification through Convolutional Neural Networks,” in
IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2015), July 2015.  [13] M. Vakalopoulou, K. Karantzalos, N. Komodakis, and N. Paragios, “Building detection in very high resolution multispectral data with deep learning features,” in Geoscience and Remote Sensing Symposium (IGARSS), 2015 IEEE International. IEEE, 2015, pp. 1873–1876.
 [14] M. Papadomanolaki, M. Vakalopoulou, S. Zagoruyko, and K. Karantzalos, “Benchmarking deep learning frameworks for the classification of very high resolution satellite multispectral data,” ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences, vol. 3, no. 7, 2016.
 [15] V. Mnih and G. E. Hinton, “Learning to label aerial images from noisy data,” in Proceedings of the 29th International Conference on Machine Learning (ICML12), 2012, pp. 567–574.
 [16] K. Makantasis, K. Karantzalos, A. Doulamis, and K. Loupos, “Deep learningbased manmade object detection from hyperspectral data,” in International Symposium on Visual Computing. Springer, 2015, pp. 717–727.
 [17] G. B. Orr and K.R. Müller, Neural networks: tricks of the trade. Springer, 2003.
 [18] Z. Kandylakis, K. Karantzalos, A. Doulamis, and N. Doulamis, “Multiple object tracking with background estimation in hyperspectral video sequences,” in IEEE Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), 2015.
 [19] N. Kussul, A. Shelestov, M. Lavreniuk, I. Butko, and S. Skakun, “Deep learning approach for large scale land cover mapping based on remote sensing data fusion,” in Geoscience and Remote Sensing Symposium (IGARSS), 2016 IEEE International. IEEE, 2016, pp. 198–201.
 [20] K. Makantasis, E. Protopapadakis, A. Doulamis, N. Doulamis, and C. Loupos, “Deep convolutional neural networks for efficient vision based tunnel inspection,” in Intelligent Computer Communication and Processing (ICCP), 2015 IEEE International Conference on. IEEE, 2015, pp. 335–342.
 [21] H. Zhou, L. Li, and H. Zhu, “Tensor regression with applications in neuroimaging data analysis,” Journal of the American Statistical Association, vol. 108, no. 502, pp. 540–552, 2013.
 [22] X. Tan, Y. Zhang, S. Tang, J. Shao, F. Wu, and Y. Zhuang, “Logistic tensor regression for classification,” in International Conference on Intelligent Science and Intelligent Data Engineering. Springer, 2012, pp. 573–581.
 [23] L. De Lathauwer, B. De Moor, and J. Vandewalle, “Computation of the canonical decomposition by means of a simultaneous generalized schur decomposition,” SIAM journal on Matrix Analysis and Applications, vol. 26, no. 2, pp. 295–327, 2004.
 [24] B. C. Csáji, “Approximation with artificial neural networks,” Faculty of Sciences, Etvs Lornd University, Hungary, vol. 24, p. 48, 2001.
 [25] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.
 [26] C. M. Bishop, Pattern recognition and machine learning. springer, 2006.
 [27] H. Hung and C.C. Wang, “Matrix variate logistic regression model with application to eeg data,” Biostatistics, vol. 14, no. 1, pp. 189–202, 2013.
 [28] R. A. Harshman, “Foundations of the parafac procedure: Models and conditions for an” explanatory” multimodal factor analysis,” 1970.
 [29] J. D. Carroll and J.J. Chang, “Analysis of individual differences in multidimensional scaling via an nway generalization of “eckartyoung” decomposition,” Psychometrika, vol. 35, no. 3, pp. 283–319, 1970.
 [30] A. D. Doulamis, N. D. Doulamis, and S. D. Kollias, “Online retrainable neural networks: improving the performance of neural networks in image analysis problems,” IEEE Transactions on Neural Networks, vol. 11, no. 1, pp. 137–155, 2000.
 [31] A. Doulamis, N. Doulamis, K. Ntalianis, and S. Kollias, “An efficient fully unsupervised video object segmentation scheme using an adaptive neuralnetwork classifier architecture,” IEEE Transactions on Neural Networks, vol. 14, no. 3, pp. 616–630, 2003.
 [32] A. D. Doulamis, N. D. Doulamis, and S. D. Kollias, “An adaptable neuralnetwork model for recursive nonlinear traffic prediction and modeling of mpeg video sources,” IEEE Transactions on Neural Networks, vol. 14, no. 1, pp. 150–166, 2003.
 [33] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,” Machine learning, vol. 46, no. 13, pp. 389–422, 2002.
 [34] L. Zhang, L. Zhang, D. Tao, and X. Huang, “Tensor discriminative locality alignment for hyperspectral image spectral–spatial feature extraction,” IEEE Transactions on Geoscience and Remote Sensing, vol. 51, no. 1, pp. 242–256, 2013.
 [35] J. Li, X. Huang, P. Gamba, J. BioucasDias, L. Zhang, J. Benediktsson, and A. Plaza, “Multiple feature learning for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 3, pp. 1592–1606, 2015.
 [36] X. Guo, X. Huang, L. Zhang, L. Zhang, A. Plaza, and J. A. Benediktsson, “Support tensor machines for classification of hyperspectral remote sensing imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 6, pp. 3248–3264, 2016.
 [37] H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, “Mpca: Multilinear principal component analysis of tensor objects,” IEEE transactions on Neural Networks, vol. 19, no. 1, pp. 18–39, 2008.
 [38] F. Pacifici, Q. Du, and S. Prasad, “Report on the 2013 ieee grss data fusion contest: Fusion of hyperspectral and lidar data [technical committees],” IEEE Transactions on Geoscience and Remote Sensing, vol. 1, no. 3, pp. 136–38, 2013.
 [39] E. Maltezos, N. Doulamis, A. Doulamis, and C. Ioannidis, “Deep convolutional neural networks for building extraction from orthoimages and dense image matching point clouds,” Journal of Applied Remote Sensing, vol. 11, no. 4, 2017.
Comments
There are no comments yet.