1 Introduction
Recent advances in sensing technologies have stimulated the development and deployment of sensors that can generate large amounts of highorder data. Interdependencies between information from different data modalities can improve the performance of data classification techniques [1]. However, exploitation of highorder data raises new research challenges mainly due to the high dimensionality of the acquired information and, depending on the application at hand, the limited number of labeled examples [2].
Tensor subspace learning methods, such as HOSVD, Tucker decomposition and CANDECOMP [3], MPCA [4] and probabilistic decompositions [5, 6, 7] have been proposed to tackle the dimensionality problem. These methods project the raw data to a lower dimensional space, in which the projected data can be considered as highly descriptive features of the raw information. The key problem in applying such methods in classifying highorder data, is that they do not take into consideration the data labels; therefore, the resulting features may not be sufficiently discriminative with respect to the classification task. Tensorbased classifiers capable of mapping highorder data to desired outputs have also been proposed [8, 9, 10, 11, 12]. However, these methods are restricted to producing linear decision boundaries in feature space, and are therefore unable to cope with complex problems, where nonlinear decision boundaries are necessary to obtain classification results of high accuracy. In order to better disentangle the inputoutput statistical relationships, deep learning approaches [13, 14] have been investigated for highorder data classification [15, 16, 17, 18]. Nevertheless, a typical deep learning architecture contains a huge number of tunable parameters, implying that a large number of labeled samples is also needed for accurate training.
The present work draws its inspiration from [9], which proposes a linear tensor regression model for binary classification. In contrast to [9]
, the paper at hand investigates a multiclass classification problem using a nonlinear tensorbased classifier. The proposed classifier is able to (i) handle raw highorder data without vectorizing them, and (ii) produce nonlinear decision boundaries, thus capturing complex statistical relationships between the data. The proposed scheme, henceforth called
Rank1 FNN, is based on a modification of a feedforward neural network (FNN), such that its weights satisfy the rank1 canonical decomposition property, i.e., the weights are decomposed as a linear combination of a minimal number of possibly nonorthogonal rank1 terms [19]. Thence, the number of model parameters, and thus of training samples required, can be significantly reduced. We also introduce a new learning algorithm to train the network without violating the canonical decomposition property.2 Problem Formulation and Tensor algebra Notation
2.1 Problem Formulation
Let us denote as the th order tensor example that we aim at classifying into one of classes. Let us also denote as
the probability of
belonging to the th class. Aggregating the values over all classes, we form a classification vector, , the elements of which . Then, the maximum value over all classes indicates the class to which the belongs. The values ofare estimated by minimizing a loss function over a dataset
during the training phase of a machine learning model. Vector
and its elements are all zero except for one which equals unity indicating the class to which belongs. In the following, we omit subscript for simplicity purposes if we refer to an input sample.2.2 Tensor Algebra Notations and Definitions
In this paper, tensors, vectors and scalars are denoted in bold uppercase, bold lowercase and lowercase letters, respectively. We hereby present some definitions that will be used through out this work.
Tensor vectorization. The operator stacks the entries of a order tensor into a column vector.
Tensor matricization.The moded matricization, , maps a tensor into a matrix by arranging the moded fibers to be the columns of the resulting matrix.
RankR decomposition. A tensor admits a rankR decomposition if , where . The decomposition can be represented by , where . When a tensor admits a rankR decomposition, it holds that:
(1) 
where stands for the KhatriRao product. For more information on tensor algebra see [3].
3 Highorder nonlinear modeling
The proposed Rank1 FNN is based on the concepts of [9]; however, in our case, the probability of an input example belonging to the th class is nonlinearly interwoven with respect to the input tensor data and the weight parameters through a function , i.e., . The main difficulty in implementing is that is actually unknown. One way to parameterize
is to exploit the principles of the universal approximation theorem, stating that a function can be approximated by a FNN with a finite number of neurons within any degree of accuracy.
However, applying a FNN for highorder data classification involves two drawbacks. First, a large number of weights has to be learned; , where refers to the number of hidden neurons. This, in the sequel, implies that a large number of labeled samples are needed to successfully train the network. Second, the weights of the network are not directly related to the physical properties of the information belonging to different modes of the data, since the inputs are vectorized and thus they do not preserve their structure.
To overcome these problems, we propose a modification of FNN so that network weights from the input to the hidden layer satisfy the rank1 canonical decomposition. Before presenting the Rank1 FNN, we briefly describe how is modeled through a FNN.
3.1 FNN Modeling
A FNN, with hidden neurons, nonlinearly approximates the probability
by associating a nonlinear activation function
with each one of its hidden neurons. In this paper, the sigmoid function
is selected. The activation function of the th neuron receives as input the inner product of and a weight vector and produces as output a scalar given by(2) 
Gathering the responses of all hidden neurons in one vector , we have that
(3) 
where is a matrix containing the weights . Thus, the output of the network is given as
(4) 
where stands for the softmax function, the weights between the hidden and the output layer and the superscript for the th class.
3.2 Rank1 FNN Modeling
To reduce the number of parameters of the network and to relate the classification results to the information belonging to different modes of the input data, we rank1 canonically decompose the weight parameters as:
(5) 
Eq. (5) can be seen as an expression of the KhatriRao product, which is the columnwise Kronecker product, denoted as , of the rank1 canonical decomposition weight parameters . Thus, and the total number of Rank1 FNN is . Based on the statements of Section 2.2, it holds that
(6) 
In Eq. (6), denotes the mode matricization of tensor .Then, taking into account the properties of Eq. (6), the output of the th hidden neuron can be written as
(7) 
Vector is a transformed version of input , that is,
(8) 
and is independent from . Eq. (7
) actually resembles the operation of a single perceptron having as inputs the weights
and the transformed version of the input data. In other words, if the rank1 canonically decomposed weights with are known, then will be also known. The main modification of this structure compared to a typical FNN lies in the hidden layer, where the weights of a hidden neuron are first decomposed into canonical factors.3.3 The Learning Algorithm
Let us aggregate the total Rank1 FNN weight parameters as
(9) 
with . In order to train the proposed model a set is used. The learning algorithm minimizes the negative loglikelihood
(10) 
with respect to network responses , with , and targets over all training samples.
The weights of the Rank1 FNN must satisfy the rank1 canonical decomposition expressed by Eq. (5). Assuming that all weights and with are fixed, vector can be estimated; therefore, vector is the only unknown parameter of the network. This vector can be derived through a gradient based optimization algorithm, assuming that the derivative
is known. This derivative can be computed using the backpropagation algorithm. Therefore, an estimation of the parameters of the
Rank1 FNN is obtained by iteratively solving with respect to one of the canonical decomposed weight vectors, assuming the remaining fixed. Algorithm 1 presents the steps of the proposed algorithm.4 Evaluation on Hyperspectral Data
To investigate whether the reduced number of parameters would limit the descriptive power of the Rank1 FNN, we conduct experiments and present quantitative results regarding its classification accuracy on 3order hyperspectral data. In our study, we used (i) the Indian Pines dataset [20], which consists of 224 spectral bands and labeled pixels and (ii) the Pavia University dataset [21], consisting of 103 spectral bands and labeled pixels.
A hyperspectral image is represented as a 3order tensor of dimensions , where and correspond to the height and width of the image and to the spectral bands. In order to classify a pixel at location on image plane and fuse spectral and spatial information, we use a square patch of size centered at . Let us denote as the label of and as the tensor patch centered at . Then, we can form a dataset , which is used to train the classifier.
To evaluate the performance of the Rank1 FNN, we conducted different experiments using training datasets of 50, 100, 150 and 200 samples from each class respectively. Initially, we evaluate the performance of the Rank1 FNN with respect to its complexity, i.e., the value of , indicating the number of hidden neurons. Particularly, we set to be equal to 50, 75, 100 and 125. Greater values of imply a more complex model. The results of this evaluation are presented in Fig.1. Regarding the Indian Pines dataset, we observe that the model with outperforms all other models. When the training set size is very small, i.e. 50 samples per class, the model with underfits the data. On the other hand, the models with and slightly overfit the data due to their high complexity. As far as the Pavia University dataset is concerned, we observe that the model with outperforms all other models, when the dataset size is larger than 50 samples per class. When the training dataset size is 50 samples per class the model with outperforms all other models. The model with overfits the data, while the model with underfits them. In both datasets, as training set increases, the misclassification error decreases.
In the following, we compare the performance of Rank
1 FNN against FNN, RBFSVM, and two deep learning approaches that have been proposed for classifying hyperspectral data; the first one is based on StackedAutoencoders (SAE)
[15], while the second on the exploitation of Convolutional Neural Networks (CNN)
[16]. The FNN consists of one hidden layer with 75 hidden neurons when trained on Indian Pines dataset and 100 hidden neurons for the Pavia University dataset (as derived from Fig.1). The architecture of the network based on SAE consists of three hidden layers, while each hidden layer contains 10% less neurons than its input. The number of hidden neurons from one hidden layer to the next is gradually reduced, so as not to allow the network to learn the identity function during pretraining. Regarding CNN, we utilize exactly the same architecture as the one presented in [16]. The performance of all these models is evaluated on varying size training sets.Pavia University  
Samples per class  50  100  150  200 
Rank1 FNN (Q=100)  89.95  93.50  93.89  95.11 
FCFFNN  67.79  76.53  78.48  82.59 
RBFSVM  86.98  88.99  89.86  91.82 
SAE  86.54  91.90  92.38  93.29 
CNN  88.89  92.74  94.68  95.89 
Indian Pines  
Samples per class  50  100  150  200 
Rank1 FNN (Q=75)  85.20  91.63  92.82  94.15 
FCFFNN  73.88  81.10  84.14  85.86 
RBFSVM  73.18  77.86  82.11  84.99 
SAE  65.51  70.66  74.03  76.49 
CNN  82.43  85.48  92.28  94.81 
Table 1 presents the outcome of this comparison. When the training set size is small, our approach outperforms all other models. This stems from the fact that the proposed Rank1 FNN exploits tensor algebra operations to reduce the number of coefficients that need to be estimated during training, while at the same time it is able to retain the spatial structure of the input. Although the FNN utilizes the same number of hidden neurons as our proposed model, it seems to overfit training sets when a small size dataset is used, due to the fact that it employs a larger number of coefficients. RBFSVM performs better than the FNN on the Pavia University dataset, but slightly worse on the Indian Pines dataset. The full connectivity property of SAE implies very high complexity, which is responsible for its poor performance, due to overfitting in the Indian Pines dataset. Finally, the CNNbased approach performs better than FNN, RBFSVM and SAE mainly because of its sparse connectivity (low complexity) and the fact that it can exploit the spatial information of the input. When the training set consists of 150 and 200 samples per class, for the Pavia University dataset, and 200 samples for the Indian Pines dataset, the CNNbased approach seems to even outperform the Rank1 FNN. This happens because the CNNbased model has higher capacity than the proposed model, which implies that it is capable of better capturing the statistical relationships between the input and the output, when the training set contains sufficient information. However, when the size of the training set is small, which is often the case, Rank1 FNN, due to its lower complexity, consistently outperforms the CNNbased model.
5 Conclusions
In this work, we present a nonlinear tensorbased scheme for highorder data classification. The proposed model is characterized by (i) the small number of weight parameters and (ii) its ability to retain the spatial structure of the highorder input samples. We have evaluated the performance of the model on 3order hyperspectral data in terms of classification accuracy by comparing it against other nonlinear classifiers, including stateoftheart deep learning models. The results indicate that in cases where the size of the training set is small, the proposed Rank1 FNN presents superior performance against the compared methods, including deep learning based ones.
References
 [1] Guoxu Zhou, Qibin Zhao, Yu Zhang, Tülay Adalı, Shengli Xie, and Andrzej Cichocki, “Linked component analysis from matrices to highorder tensors: Applications to biomedical data,” Proceedings of the IEEE, vol. 104, no. 2, pp. 310–331, 2016.
 [2] Gustavo CampsValls and Lorenzo Bruzzone, “Kernelbased methods for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 43, no. 6, pp. 1351–1362, 2005.
 [3] Tamara G Kolda and Brett W Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.

[4]
Haiping Lu, Konstantinos N Plataniotis, and Anastasios N Venetsanopoulos,
“Mpca: Multilinear principal component analysis of tensor objects,”
IEEE Transactions on Neural Networks, vol. 19, no. 1, pp. 18–39, 2008.  [5] Piyush Rai, Yingjian Wang, Shengbo Guo, Gary Chen, David Dunson, and Lawrence Carin, “Scalable bayesian lowrank decomposition of incomplete multiway tensors,” in International Conference on Machine Learning, 2014, pp. 1800–1808.
 [6] Wei Chu and Zoubin Ghahramani, “Probabilistic models for incomplete multidimensional arrays,” in Artificial Intelligence and Statistics, 2009, pp. 89–96.
 [7] Zenglin Xu, Feng Yan, and Yuan Qi, “Bayesian nonparametric models for multiway data analysis,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 2, pp. 475–487, 2015.
 [8] Xu Tan, Yin Zhang, Siliang Tang, Jian Shao, Fei Wu, and Yueting Zhuang, “Logistic tensor regression for classification,” in International Conference on Intelligent Science and Intelligent Data Engineering. Springer, 2012, pp. 573–581.
 [9] Hua Zhou, Lexin Li, and Hongtu Zhu, “Tensor regression with applications in neuroimaging data analysis,” Journal of the American Statistical Association, vol. 108, no. 502, pp. 540–552, 2013.
 [10] Qun Li and Dan Schonfeld, “Multilinear discriminant analysis for higherorder tensor data classification,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 12, pp. 2524–2537, 2014.
 [11] Dacheng Tao, Xuelong Li, Xindong Wu, and Stephen J Maybank, “General tensor discriminant analysis and gabor features for gait recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 10, 2007.
 [12] Peter D Hoff, “Multilinear tensor regression for longitudinal relational data,” The annals of applied statistics, vol. 9, no. 3, pp. 1169, 2015.
 [13] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [14] Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al., “Greedy layerwise training of deep networks,” Advances in neural information processing systems, vol. 19, pp. 153, 2007.
 [15] Yushi Chen, Zhouhan Lin, Xing Zhao, Gang Wang, and Yanfeng Gu, “Deep learningbased classification of hyperspectral data,” IEEE Journal of Selected topics in applied earth observations and remote sensing, vol. 7, no. 6, pp. 2094–2107, 2014.

[16]
K. Makantasis, K. Karantzalos, A. Doulamis, and N. Doulamis,
“Deep Supervised Learning for Hyperspectral Data Classification through Convolutional Neural Networks,”
in IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2015), July 2015.  [17] M. Vakalopoulou, K. Karantzalos, N. Komodakis, and N. Paragios, “Building detection in very high resolution multispectral data with deep learning features,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2015), July 2015.
 [18] K. Makantasis, K. Karantzalos, A. Doulamis, and K. Loupos, “Deep learningbased manmade object detection from hyperspectral data,” in International Symposium on Visual Computing. Springer, 2015, pp. 717–727.
 [19] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle, “Computation of the canonical decomposition by means of a simultaneous generalized schur decomposition,” SIAM journal on Matrix Analysis and Applications, vol. 26, no. 2, pp. 295–327, 2004.
 [20] Marion F Baumgardner, Larry L Biehl, and David A Landgrebe, “220 band aviris hyperspectral image data set: June 12, 1992 indian pine test site 3,” Purdue University Research Repository, 2015.
 [21] Antonio Plaza, Jon Atli Benediktsson, Joseph W Boardman, Jason Brazile, Lorenzo Bruzzone, Gustavo CampsValls, Jocelyn Chanussot, Mathieu Fauvel, Paolo Gamba, Anthony Gualtieri, et al., “Recent advances in techniques for hyperspectral image processing,” Remote sensing of environment, vol. 113, pp. S110–S122, 2009.