Recent advances in sensing technologies have stimulated the development and deployment of sensors that can generate large amounts of high-order data. Interdependencies between information from different data modalities can improve the performance of data classification techniques . However, exploitation of high-order data raises new research challenges mainly due to the high dimensionality of the acquired information and, depending on the application at hand, the limited number of labeled examples .
Tensor subspace learning methods, such as HOSVD, Tucker decomposition and CANDECOMP , MPCA  and probabilistic decompositions [5, 6, 7] have been proposed to tackle the dimensionality problem. These methods project the raw data to a lower dimensional space, in which the projected data can be considered as highly descriptive features of the raw information. The key problem in applying such methods in classifying high-order data, is that they do not take into consideration the data labels; therefore, the resulting features may not be sufficiently discriminative with respect to the classification task. Tensor-based classifiers capable of mapping high-order data to desired outputs have also been proposed [8, 9, 10, 11, 12]. However, these methods are restricted to producing linear decision boundaries in feature space, and are therefore unable to cope with complex problems, where nonlinear decision boundaries are necessary to obtain classification results of high accuracy. In order to better disentangle the input-output statistical relationships, deep learning approaches [13, 14] have been investigated for high-order data classification [15, 16, 17, 18]. Nevertheless, a typical deep learning architecture contains a huge number of tunable parameters, implying that a large number of labeled samples is also needed for accurate training.
, the paper at hand investigates a multi-class classification problem using a nonlinear tensor-based classifier. The proposed classifier is able to (i) handle raw high-order data without vectorizing them, and (ii) produce nonlinear decision boundaries, thus capturing complex statistical relationships between the data. The proposed scheme, henceforth calledRank-1 FNN, is based on a modification of a feedforward neural network (FNN), such that its weights satisfy the rank-1 canonical decomposition property, i.e., the weights are decomposed as a linear combination of a minimal number of possibly non-orthogonal rank-1 terms . Thence, the number of model parameters, and thus of training samples required, can be significantly reduced. We also introduce a new learning algorithm to train the network without violating the canonical decomposition property.
2 Problem Formulation and Tensor algebra Notation
2.1 Problem Formulation
Let us denote as the -th -order tensor example that we aim at classifying into one of classes. Let us also denote as
the probability ofbelonging to the -th class. Aggregating the values over all classes, we form a classification vector, , the elements of which . Then, the maximum value over all classes indicates the class to which the belongs. The values of
during the training phase of a machine learning model. Vectorand its elements are all zero except for one which equals unity indicating the class to which belongs. In the following, we omit subscript for simplicity purposes if we refer to an input sample.
2.2 Tensor Algebra Notations and Definitions
In this paper, tensors, vectors and scalars are denoted in bold uppercase, bold lowercase and lowercase letters, respectively. We hereby present some definitions that will be used through out this work.
Tensor vectorization. The operator stacks the entries of a -order tensor into a column vector.
Tensor matricization.The mode-d matricization, , maps a tensor into a matrix by arranging the mode-d fibers to be the columns of the resulting matrix.
Rank-R decomposition. A tensor admits a rank-R decomposition if , where . The decomposition can be represented by , where . When a tensor admits a rank-R decomposition, it holds that:
where stands for the Khatri-Rao product. For more information on tensor algebra see .
3 High-order nonlinear modeling
The proposed Rank-1 FNN is based on the concepts of ; however, in our case, the probability of an input example belonging to the -th class is nonlinearly interwoven with respect to the input tensor data and the weight parameters through a function , i.e., . The main difficulty in implementing is that is actually unknown. One way to parameterize
is to exploit the principles of the universal approximation theorem, stating that a function can be approximated by a FNN with a finite number of neurons within any degree of accuracy.
However, applying a FNN for high-order data classification involves two drawbacks. First, a large number of weights has to be learned; , where refers to the number of hidden neurons. This, in the sequel, implies that a large number of labeled samples are needed to successfully train the network. Second, the weights of the network are not directly related to the physical properties of the information belonging to different modes of the data, since the inputs are vectorized and thus they do not preserve their structure.
To overcome these problems, we propose a modification of FNN so that network weights from the input to the hidden layer satisfy the rank-1 canonical decomposition. Before presenting the Rank-1 FNN, we briefly describe how is modeled through a FNN.
3.1 FNN Modeling
A FNN, with hidden neurons, nonlinearly approximates the probability
by associating a nonlinear activation function
with each one of its hidden neurons. In this paper, the sigmoid functionis selected. The activation function of the -th neuron receives as input the inner product of and a weight vector and produces as output a scalar given by
Gathering the responses of all hidden neurons in one vector , we have that
where is a matrix containing the weights . Thus, the output of the network is given as
where stands for the softmax function, the weights between the hidden and the output layer and the superscript for the -th class.
3.2 Rank-1 FNN Modeling
To reduce the number of parameters of the network and to relate the classification results to the information belonging to different modes of the input data, we rank-1 canonically decompose the weight parameters as:
Eq. (5) can be seen as an expression of the Khatri-Rao product, which is the column-wise Kronecker product, denoted as , of the rank-1 canonical decomposition weight parameters . Thus, and the total number of Rank-1 FNN is . Based on the statements of Section 2.2, it holds that
Vector is a transformed version of input , that is,
and is independent from . Eq. (7
) actually resembles the operation of a single perceptron having as inputs the weightsand the transformed version of the input data. In other words, if the rank-1 canonically decomposed weights with are known, then will be also known. The main modification of this structure compared to a typical FNN lies in the hidden layer, where the weights of a hidden neuron are first decomposed into canonical factors.
3.3 The Learning Algorithm
Let us aggregate the total Rank-1 FNN weight parameters as
with . In order to train the proposed model a set is used. The learning algorithm minimizes the negative log-likelihood
with respect to network responses , with , and targets over all training samples.
The weights of the Rank-1 FNN must satisfy the rank-1 canonical decomposition expressed by Eq. (5). Assuming that all weights and with are fixed, vector can be estimated; therefore, vector is the only unknown parameter of the network. This vector can be derived through a gradient based optimization algorithm, assuming that the derivative
is known. This derivative can be computed using the backpropagation algorithm. Therefore, an estimation of the parameters of theRank-1 FNN is obtained by iteratively solving with respect to one of the canonical decomposed weight vectors, assuming the remaining fixed. Algorithm 1 presents the steps of the proposed algorithm.
4 Evaluation on Hyperspectral Data
To investigate whether the reduced number of parameters would limit the descriptive power of the Rank-1 FNN, we conduct experiments and present quantitative results regarding its classification accuracy on 3-order hyperspectral data. In our study, we used (i) the Indian Pines dataset , which consists of 224 spectral bands and labeled pixels and (ii) the Pavia University dataset , consisting of 103 spectral bands and labeled pixels.
A hyperspectral image is represented as a 3-order tensor of dimensions , where and correspond to the height and width of the image and to the spectral bands. In order to classify a pixel at location on image plane and fuse spectral and spatial information, we use a square patch of size centered at . Let us denote as the label of and as the tensor patch centered at . Then, we can form a dataset , which is used to train the classifier.
To evaluate the performance of the Rank-1 FNN, we conducted different experiments using training datasets of 50, 100, 150 and 200 samples from each class respectively. Initially, we evaluate the performance of the Rank-1 FNN with respect to its complexity, i.e., the value of , indicating the number of hidden neurons. Particularly, we set to be equal to 50, 75, 100 and 125. Greater values of imply a more complex model. The results of this evaluation are presented in Fig.1. Regarding the Indian Pines dataset, we observe that the model with outperforms all other models. When the training set size is very small, i.e. 50 samples per class, the model with underfits the data. On the other hand, the models with and slightly overfit the data due to their high complexity. As far as the Pavia University dataset is concerned, we observe that the model with outperforms all other models, when the dataset size is larger than 50 samples per class. When the training dataset size is 50 samples per class the model with outperforms all other models. The model with overfits the data, while the model with underfits them. In both datasets, as training set increases, the misclassification error decreases.
In the following, we compare the performance of Rank
-1 FNN against FNN, RBF-SVM, and two deep learning approaches that have been proposed for classifying hyperspectral data; the first one is based on Stacked-Autoencoders (SAE)
, while the second on the exploitation of Convolutional Neural Networks (CNN). The FNN consists of one hidden layer with 75 hidden neurons when trained on Indian Pines dataset and 100 hidden neurons for the Pavia University dataset (as derived from Fig.1). The architecture of the network based on SAE consists of three hidden layers, while each hidden layer contains 10% less neurons than its input. The number of hidden neurons from one hidden layer to the next is gradually reduced, so as not to allow the network to learn the identity function during pre-training. Regarding CNN, we utilize exactly the same architecture as the one presented in . The performance of all these models is evaluated on varying size training sets.
|Samples per class||50||100||150||200|
|Rank-1 FNN (Q=100)||89.95||93.50||93.89||95.11|
|Samples per class||50||100||150||200|
|Rank-1 FNN (Q=75)||85.20||91.63||92.82||94.15|
Table 1 presents the outcome of this comparison. When the training set size is small, our approach outperforms all other models. This stems from the fact that the proposed Rank-1 FNN exploits tensor algebra operations to reduce the number of coefficients that need to be estimated during training, while at the same time it is able to retain the spatial structure of the input. Although the FNN utilizes the same number of hidden neurons as our proposed model, it seems to overfit training sets when a small size dataset is used, due to the fact that it employs a larger number of coefficients. RBF-SVM performs better than the FNN on the Pavia University dataset, but slightly worse on the Indian Pines dataset. The full connectivity property of SAE implies very high complexity, which is responsible for its poor performance, due to overfitting in the Indian Pines dataset. Finally, the CNN-based approach performs better than FNN, RBF-SVM and SAE mainly because of its sparse connectivity (low complexity) and the fact that it can exploit the spatial information of the input. When the training set consists of 150 and 200 samples per class, for the Pavia University dataset, and 200 samples for the Indian Pines dataset, the CNN-based approach seems to even outperform the Rank-1 FNN. This happens because the CNN-based model has higher capacity than the proposed model, which implies that it is capable of better capturing the statistical relationships between the input and the output, when the training set contains sufficient information. However, when the size of the training set is small, which is often the case, Rank-1 FNN, due to its lower complexity, consistently outperforms the CNN-based model.
In this work, we present a nonlinear tensor-based scheme for high-order data classification. The proposed model is characterized by (i) the small number of weight parameters and (ii) its ability to retain the spatial structure of the high-order input samples. We have evaluated the performance of the model on 3-order hyperspectral data in terms of classification accuracy by comparing it against other nonlinear classifiers, including state-of-the-art deep learning models. The results indicate that in cases where the size of the training set is small, the proposed Rank-1 FNN presents superior performance against the compared methods, including deep learning based ones.
-  Guoxu Zhou, Qibin Zhao, Yu Zhang, Tülay Adalı, Shengli Xie, and Andrzej Cichocki, “Linked component analysis from matrices to high-order tensors: Applications to biomedical data,” Proceedings of the IEEE, vol. 104, no. 2, pp. 310–331, 2016.
-  Gustavo Camps-Valls and Lorenzo Bruzzone, “Kernel-based methods for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 43, no. 6, pp. 1351–1362, 2005.
-  Tamara G Kolda and Brett W Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.
Haiping Lu, Konstantinos N Plataniotis, and Anastasios N Venetsanopoulos,
“Mpca: Multilinear principal component analysis of tensor objects,”IEEE Transactions on Neural Networks, vol. 19, no. 1, pp. 18–39, 2008.
-  Piyush Rai, Yingjian Wang, Shengbo Guo, Gary Chen, David Dunson, and Lawrence Carin, “Scalable bayesian low-rank decomposition of incomplete multiway tensors,” in International Conference on Machine Learning, 2014, pp. 1800–1808.
-  Wei Chu and Zoubin Ghahramani, “Probabilistic models for incomplete multi-dimensional arrays,” in Artificial Intelligence and Statistics, 2009, pp. 89–96.
-  Zenglin Xu, Feng Yan, and Yuan Qi, “Bayesian nonparametric models for multiway data analysis,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 2, pp. 475–487, 2015.
-  Xu Tan, Yin Zhang, Siliang Tang, Jian Shao, Fei Wu, and Yueting Zhuang, “Logistic tensor regression for classification,” in International Conference on Intelligent Science and Intelligent Data Engineering. Springer, 2012, pp. 573–581.
-  Hua Zhou, Lexin Li, and Hongtu Zhu, “Tensor regression with applications in neuroimaging data analysis,” Journal of the American Statistical Association, vol. 108, no. 502, pp. 540–552, 2013.
-  Qun Li and Dan Schonfeld, “Multilinear discriminant analysis for higher-order tensor data classification,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 12, pp. 2524–2537, 2014.
-  Dacheng Tao, Xuelong Li, Xindong Wu, and Stephen J Maybank, “General tensor discriminant analysis and gabor features for gait recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 10, 2007.
-  Peter D Hoff, “Multilinear tensor regression for longitudinal relational data,” The annals of applied statistics, vol. 9, no. 3, pp. 1169, 2015.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al., “Greedy layer-wise training of deep networks,” Advances in neural information processing systems, vol. 19, pp. 153, 2007.
-  Yushi Chen, Zhouhan Lin, Xing Zhao, Gang Wang, and Yanfeng Gu, “Deep learning-based classification of hyperspectral data,” IEEE Journal of Selected topics in applied earth observations and remote sensing, vol. 7, no. 6, pp. 2094–2107, 2014.
K. Makantasis, K. Karantzalos, A. Doulamis, and N. Doulamis,
“Deep Supervised Learning for Hyperspectral Data Classification through Convolutional Neural Networks,”in IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2015), July 2015.
-  M. Vakalopoulou, K. Karantzalos, N. Komodakis, and N. Paragios, “Building detection in very high resolution multispectral data with deep learning features,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2015), July 2015.
-  K. Makantasis, K. Karantzalos, A. Doulamis, and K. Loupos, “Deep learning-based man-made object detection from hyperspectral data,” in International Symposium on Visual Computing. Springer, 2015, pp. 717–727.
-  Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle, “Computation of the canonical decomposition by means of a simultaneous generalized schur decomposition,” SIAM journal on Matrix Analysis and Applications, vol. 26, no. 2, pp. 295–327, 2004.
-  Marion F Baumgardner, Larry L Biehl, and David A Landgrebe, “220 band aviris hyperspectral image data set: June 12, 1992 indian pine test site 3,” Purdue University Research Repository, 2015.
-  Antonio Plaza, Jon Atli Benediktsson, Joseph W Boardman, Jason Brazile, Lorenzo Bruzzone, Gustavo Camps-Valls, Jocelyn Chanussot, Mathieu Fauvel, Paolo Gamba, Anthony Gualtieri, et al., “Recent advances in techniques for hyperspectral image processing,” Remote sensing of environment, vol. 113, pp. S110–S122, 2009.