Dimensionality reduction is effective for learning high-dimensional data. An important effect is avoiding overfitting by revealing a few essential parameters. Another important effect is providing better understanding of decision boundaries in human-readable low dimensional subspace. We can understand the reason of prediction by analyzing the positional relationship between data points and decision boundaries. Several dimensionality reduction methods, such as t-SNE Maaten08 , kernel PCA Barshan11 ; Hosseini19 ; Scholkopf98 , and Isomap tenenbaum_global_2000 , nonlinearly reduce dimensionality; however, the relationship between original variables and projections is not straightforward and difficult to understand in many cases. On the other hand, it may be difficult for variable selection methods such as the celebrated LASSO approach tibshirani96 to explain the phenomenon such as biological mechanism that should be described as the interaction between the sets of the variables that slightly correlate with each other. Linear methods without variable selection, such as principal component analysis (PCA) and linear discriminant analysis (LDA), are desirable from the interpretation point of view because we can easily understand the correlation between many variables in the projected low-dimensional subspace. We can apply any prediction model to the projected data point; however, there is no guarantee that the important features for prediction are kept in the projected subspace. To highlight these issues, we give a vivid example using artificial data of a -class classification task with -dimensional variables. These data are created by embedding the two-dimensional spiral distribution (Fig. 1(a)) in a -dimensional space by applying -dimensional noise (Fig. 1(b)) and performing rotation transformation (see details in Section 5). Figure 1(c) shows a distribution of the data points projected to two-dimensional subspace using LDA, a supervised method. However, it overfit the noise and the spiral distribution cannot be observed at all. By contrast, PCA seems not to overfit the noise; however, it is difficult to recognize the spiral distribution (Fig. 1
(e)). Thus, it is difficult for linear projection methods to deal with strong nonlinearities in high-dimensional space. In this paper, we propose a learning method of searching for a subspace that maximizes the prediction accuracy while retaining as much of the original data variance as possible to optimize the projection matrixto minimize regularized-by-reconstruction loss such as
where is the -th data point, is a loss of the prediction model to predict a scalar response under the subspace, are the parameters of the prediction model,
is an identity matrix, andis the regularization parameter. This strategy meets two requirements simultaneously. First, the subspace keeps the variance of the data points as large as possible by minimizing the reconstruction errors , the same as with PCA. Second, can be predicted as accurately as possible by minimizing , even if the prediction model requires strong nonlinearity. We can successfully find the spiral distribution, as shown in Fig. 1(d), by calculating an optimal satisfying Eq. 1.
The issues described above also apply to the tensor data. Several tensor-regression methods have been proposed for the data represented by matrices or tensors that describe the relationships between variables guhaniyogi_bayesian_2017 ; he_boosted_2018 ; imaizumi_doubly_2016 ; yu_learning_2016 . However, these methods cannot cope with strong nonlinearities. On the other hand, several methods using the kernel method he_kernelized_2017 ; Tao07 lose the correlation between many variables of the original data. We extend Eq. 1 to tensor expressions so that they can be applied to tensor data.
To the best of our knowledge, this is the first study to propose a method that allows searching for a linear projection without variable selection that maintains the maximum data variation and allows high-accuracy prediction even with strong nonlinearity, including matrix and tensor data. We report the results of evaluating the proposed method using artificial data and real-world data including higher-order data. We empirically show that our method can classify high-dimensional data with high accuracy while having high interpretability.
2 Related Works
PCA is the most well-known unsupervised dimensionality-reduction method that represents a high-dimensional space by a few orthogonal variables that retain large variance of the data jolliffe_principal_2002 . Several methods have been proposed to find low-dimensional subspace in a supervised manner, such as LDA and partial least squares (PLS) Martinez01 ; Rosipal06
. However, the distribution of samples in each class is strongly assumed as a normal distribution and cannot be applied to datasets with strong nonlinearity as shown in Fig.1(a). Several dimensionality-reduction methods, such as t-SNE Maaten08 , kernel PCA Barshan11 ; Hosseini19 ; Scholkopf98 , and Isomap tenenbaum_global_2000 , nonlinearly reduces dimensionality; however, the relationship between original variables and projections is not straightforward and difficult to understand in many cases.
Tucker decomposition Kolda09 is a well-known tensor decomposition using unsupervised methods. Several studies on tensor regression task have been reported guhaniyogi_bayesian_2017 ; he_boosted_2018 ; imaizumi_doubly_2016 ; yu_learning_2016 . Most of these methods reduce the number of parameters by expressing the partial regression coefficients in a low rank and can be regarded as the multi-linear projection into lower-dimensional tensor subspace. However, these methods cannot cope with the nonlinearity inherent in the data. Kernel methods have been applied to deal with nonlinearities he_kernelized_2017 ; Tao07 ; however, they lose the correlation between many variables.
Prediction interpretation and justification is one of the most important topics in recent machine learning researchAdadi18 ; Lipton18 . One effective approach is to produce interpretable models. Ribeiro et al. introduced a model-agnostic approach that provides explanation to blackbox models by learning interpretable models Ribeiro16 . Our method allows us to better understand the interpretable model by visualizing it along with the decision boundaries in a lower human-readable dimension.
Scalars are denoted as lowercase letters (
), vectors as boldface lowercase letters (), matrices as boldface capital letters (), and tensors as boldface Euler script letters (). A transverse matrix of is . A matrix is an identity matrix, and , are square matrices in which all values are , . The square root of the sum of the squared values in tensor is denoted as . We refer to the dimensionality of a tensor as order and to each dimension as a mode. The inner product is denoted as . The outer product of vectors is denoted as , where the -th element is by using as the -th element of . The element-wise multiplication and division are denoted as and , respectively. A Khatri-Rao product between of size and of size is a matrix of size by using a Kronecker product . The vectorization of a tensor creates a vector . The mode- matricization of a tensor of size creates a matrix of size , where is a tensor of the -th slice of for the -th mode. A -mode product of a tensor with of size is a tensor of size , where . In case is a vector of length , we remove the mode and resize to . We denote as . is a tensor of size with of size , and . We denote as . A vector and matrix are differentials of over the elements of and , respectively.
3.2 Problem Definition
As we mention in Section 1, we extend our problem to tensor expression. Among several ways to project a tensor of size to a tensor of size , we propose projecting by the -mode product of each mode using a projection matrix , i.e., , from the interpretation point of view. Thus, our problem can be defined as follows:
Problem 1 (Linear Tensor Projection for Nonlinear Prediction (LTPNP))
Given a set of training samples where is a th-order tensor of size and is a scalar response, how can we learn the projection matrices to project into along with the prediction model with strong nonlinearity that can accurately predict the response based on the projected tensor , where are parameters of the prediction model?
4 Proposed Method
We propose a practical method of solving the LTPNP problem. We will refer to this method as Tensor Reconstruction based Interpretable Prediction (TRIP).
As we explain later, we propose learning the projection matrices
and the prediction model by using stochastic gradient descent (SGD). Typical prediction models that can be trained by using SGD include neural networks (NNs) and linear/softmax regression. Such a prediction model can be written as
where , is a weight tensor of size , is a prediction model obtained by removing from with input of size vector, and are the remaining parameters other than . If the prediction model is linear/softmax regression, is the partial regression coefficients. Moreover, if the prediction model is an NN, is the parameters between and the first layer. In any case, the number of elements of increases by the multiplication of the number of modes, so that over-fitting easily occurs. As a solution to this problem, we set to be a low-rank tensor as
by using , so that the number of parameters can be reduced.
We show a method of learning , , and , using SGD. We extend the regularized-by-reconstruction loss (Eq. 1) to tensor expression as
where is the regularization parameter. It is necessary to satisfy the orthonormal condition . We propose a similar idea as Maruhashi18 , that uses of the same size as as latent variables and calculate as a matrix that satisfies the orthonormal conditions derived from . That is,
is calculated by singular value decomposition (SVD) and set as
where and are matrices having left and right singular vectors as column vectors, and is a diagonal matrix having singular values as diagonal components. Our idea is to calculate as a derivative of by each element of and calculate to update by updating . As a result, we can calculate
where is calculated using methods specific to the prediction models, such as the back-propagation algorithm in NNs, by which can also be calculated. Moreover, gradient of can be calculated as
The prediction and training algorithm are shown in Algorithm 1.
4.3 Computational Complexity
The computation mainly consists of two parts: the projection and the propagation in an NN. Calculation of the projection scales linearly to the total number of non-zero elements in the dataset and the size of , which means the time and space complexity is . The complexity of forward- and back-propagation in an NN is time and space, where is the number of edges in the NN. Because the number of edges between and the first layer is the size of , where is the number of remaining edges.
4.4 Understanding Subspace
4.4.1 Extraction of Independent Components
Even if and are changed to and by a rotation transform , the value of Eq. 4 does not change. We propose to determine so that can be understood as the coefficient of the independent component after regression in modes other than the -th mode. When the prediction model is a linear model such as , can be calculated so that the elements of are independent of each other, where . That is, the column vectors of are the left singular vectors of having the normalized as the column vectors. If the prediction model is a nonlinear model such as an NN, we can use the same strategy as the linear model by learning a linear surrogate model , where are regression coefficients and is a bias. and are calculated so that can produce similar results to by minimizing . We can use the alternating least squares (ALS) method Kolda09 with which s are alternately calculated by fixing the coefficients of other modes.
4.4.2 Local Regression Coefficients
It might be difficult to understand a strongly nonlinear decision boundary by using a linear surrogate model. We propose to learn a local linear surrogate model , which can be applied only to the data points around the -th point, where are local regression coefficients (LRC) and is a bias. and are calculated by minimizing , where is a similarity between and and is a parameter. This is almost the same as the local interpretable model proposed by Ribeiro et al. Ribeiro16 except that we apply it to the projected tensor data. We can calculate the parameters by using the ALS method, the same as the linear surrogate model. LRC can be observed using calculated under the local linear surrogate model. For , we should observe . For , LRC should be projected in modes other than the -th mode, resulting in where . Also, we can understand LRC as in the language of the original variable.
5 Empirical Results
5.1 Evaluation Settings
|Rand1 (artificial)||1,000||100||32||(2 class)|
|Connectionist 111Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: Z-Alizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: First-order theorem proving, Epileptic: Epileptic Seizure Recognition.||Dua:2019||60||208||64|
|Colposcopies 111Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: Z-Alizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: First-order theorem proving, Epileptic: Epileptic Seizure Recognition.||Dua:2019 ; fernandes_transfer_2017||62||287||64|
|Alizadeh 111Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: Z-Alizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: First-order theorem proving, Epileptic: Epileptic Seizure Recognition.||Dua:2019 ; alizadehsani_data_2013||57||303||64|
|Sports 111Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: Z-Alizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: First-order theorem proving, Epileptic: Epileptic Seizure Recognition.||Dua:2019 ; Hajj2018ASC||59||1,000||256|
|Ozone 111Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: Z-Alizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: First-order theorem proving, Epileptic: Epileptic Seizure Recognition.||Dua:2019||72||1,848||256|
|Theorem 111Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: Z-Alizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: First-order theorem proving, Epileptic: Epileptic Seizure Recognition.||Dua:2019 ; Bridge14||51||6,118||1,024|
|Epileptic 111Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: Z-Alizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: First-order theorem proving, Epileptic: Epileptic Seizure Recognition.||Dua:2019 ; Andrzejak02||178||11,500||2,048|
|Rand2 (artificial)||1,000 1,000||100||32||classification|
|3||Rand3 (artificial)||1,000 1,000 1,000||100||32||(2 class)|
Table 1 shows the summary of the datasets used in this study. All the vector datasets (), except the artificial datasets, are downloaded from the website of UCI Machine Learning Repository222https://archive.ics.uci.edu/ml/index.php Dua:2019 . We obtain Genes from a website333http://bonsai.hgc.jp/~shima/NetworkProfiler/ Shimamura_11
. We normalize all the variables of the vector datasets by making the average values zero and the standard deviations one.
5.1.2 Comparison Methods
We compare TRIP with methods that can be applied to the LTPNP problem, which linearly project data points into low-dimensional subspace and predict scalar responses using prediction models with strong nonlinearity. As in TRIP, NNs are used for the prediction models. The compared methods for vector datasets use PCA and LDA for linear projection, and the methods combined with NNs are called PCANN and LDANN, respectively. For tensor datasets (), we compare TRIP with a method of applying NNs to tensors projected to the low-dimensional subspace by Tucker decomposition Kolda09 , i.e., core tensors. This method is called TuckerNN.
In all the experiments other than evaluation using Spiral and scalability tests, we select parameters of each method based on the average accuracies of trials of -fold cross validation. For Spiral, we use the average test accuracies of trials. The number of hidden layers of NNs are chosen from zero (linear/softmax regression) tos for TRIP are chosen from
. The number of epochs in SGD is determined based on sufficient convergence,i.e., for Spiral, for the datasets of -class classification task other than artificial datasets, (TRIP) and (PCANN, LDANN) for Iris and Wine, (TRIP) and (TuckerNN) for Genes, and for Rand1, Rand2, and Rand3. We use Adam optimization algorithm with learning rate of , whereas we use for Genes.
5.2 Evaluation using Artificial Data
5.2.1 Data Generation
We create artificial data Spiral in which essential decision boundaries are carefully hidden in high-dimensional space. First, we generate a two-dimensional distribution in which the data points of the two classes are distributed in a two-dimensional spiral, as shown in Fig. 1(a). Specifically, we randomly select
from a uniform distribution ofand calculate the two-dimensional coordinates, i.e., for the samples of the first class and for the samples of the second class. We also add samples to each class as noise samples, with the two-dimensional coordinates of selected from the uniform distribution of . We normalize the standard deviation of the coordinates in each dimension of these samples to . We also create the coordinates of an additional one dimension as a major noise dimension from a normal distribution with the standard deviation of , which cannot be distinguished from the dimensions with spiral distribution with PCA because of the same variance. In addition, we add the coordinates of dimensions as minor noise dimensions from a normal distribution with the standard deviation of , which should cause model overfitting when the regularization of reconstruction errors does not sufficiently work (Fig. 1(b)). Finally, the spiral distribution is hidden in the -dimensional space by transforming the data using a randomly generated rotation transform. As a result, we generate artificial dataset containing samples with -dimensional coordinates. We generate the two sets of samples by selecting coordinates independently, i.e., training dataset and test dataset, using the procedure described above with the same rotation transform.
5.2.2 Accuracy Evaluation and Visualization
Because we know that the essential decision boundaries have the two-dimensional distribution with spiral shape, we conduct the experiments using two projection vectors, i.e., . The highest test accuracies are achieved with an NN with hidden layers and . Figure 2(a) shows the fitting accuracies on the training dataset and the test accuracies on the test dataset with the NN with hidden layers for several values of , along with the results of LDANN and PCANN. Overall, lower cause overfitting in which fitting accuracies are high whereas test accuracies are low. On the other hand, higher can balance between the fitting accuracies and the test accuracies, which indicates that the reconstruction errors work as regularization functions. The result of LDANN is similar to those of low , and the result of PCANN is similar to those of high .
Figure. 2(b-g) plot the projected data points of the training dataset with the NN with hidden layers for several values of . All the plots are chosen from trials because they reduce the objective function the most. We can recognize the spiral boundaries for the of and (Fig. 2(d)(e)). For the less than , the samples of the two classes are well separated (Fig. 2(b)(c)); however, the test accuracies are very low (Fig. 2(a)), suggesting the models overfit to the training dataset for these . When we use more than , the samples of the two classes are not clearly separated (Fig. 2(f)(g)), the distributions of which are similar to that of PCANN (Fig. 1(e)). The results on more than suggest that the projection vectors are learned in an almost unsupervised manner. Thus, by using the appropriate , the reconstruction error works properly as regularization, and TRIP extracts the subspace that captures the inherently important features for separation in high-dimensional data.
5.3 Evaluation by Real-World Data
5.3.1 Evaluation of Accuracies
To quantitatively show the effectiveness of TRIP, the area under the curves (AUCs) calculated by -fold cross validations for various -class classification tasks are shown in Fig. 3. TRIP achieves higher accuracies than PCANN and LDANN in most of the datasets, especially when the data points are projected into - or - dimensions. Also, datasets with many samples require the NNs with many hidden layers (Fig. 3(g)(h)(i)). This indicates that the decision boundaries of these datasets contain relatively strong nonlinearities. These results suggest that TRIP is good at linear projection while maintaining high prediction accuracy in a low-dimensional subspace that is easy for humans to understand, even when the decision boundary has high nonlinearity.
5.3.2 Visualizing Subspace
We show how to understand the reason for the prediction using two datasets of -class classification, Iris and Wine. We use two-dimensional subspace, i.e., , to understand the results in the two-dimensional plots. Figure 4(a)(e) shows the accuracies of PCANN, LDANN, and TRIP using NNs with to hidden layers. Though the best accuracies are almost same among the methods, TRIP achieves high accuracies even with the model without hidden layers of the NN. We choose the number of hidden layers of the NN as for Iris and for Wine, and for Iris and for Wine, by which TRIP achieves the highest accuracies.
We show the softmax value of TRIP as contours for each class of Iris (Fig. 4 (b)(c)(d)) and Wine (Fig. 4 (f)(g)(h)). The two-axis are rotated as described in Section 4. We also represent LRC of each data point with an arrow. In Iris, data points of Versicolor and Virginica are very close, and LRC of most of these data points have almost same direction (Fig. 4 (c)(d)). These LRC can be interpreted in the language of the original variables, i.e., . Roughly speaking, the direction of LRC means Virginica has larger petal and smaller sepal than Versicolor. In Wine, it can be observed that large LRC of data points near the decision boundary separating the class and (Fig. 4 (f)) or the class and (Fig. 4 (g)) have almost same direction. In short, in order for a class wine close to the decision boundary to be a class or class wine, at least the value of Color intensity, Proline, Alcohol, and Ash must be higher and the value of Hue must be lower. In addition, the value of Alcalinity of ash needs to be smaller to be a class wine, whereas the value of Flavanoids must be smaller to be a class wine. Thus, LRC makes it easy to understand the reason for prediction, in a linearly projected low-dimensional subspace.
5.3.3 Results on Higher-Order Dataset
As a higher-order dataset, we use Genes in which each data point is a matrix, i.e., the second order tensor. This dataset describes varying coefficients (VC) between target genes (TG) and regulator genes (RG) in cell lines Shimamura_11 . VC are computed based on the expression levels of genes, where the strengths of relationship between genes is allowed to vary with epithelial-mesenchymal transition (EMT)-related modulator value (hereinafter called EMT value), which describe cell lines as epithelial-like or mesenchymal-like. We set VC with the absolute values less than as zero, resulting in non-zero elements on average. We focus on the regression task to predict the EMT values. We project each data point of the cell line to the subspace of the size of , i.e., and . Figure 5(a) shows the root mean squared error (RMSE) of TRIP and TuckerNN for predicting EMT value using NNs with different numbers of hidden layers. TRIP shows much better accuracy than TuckerNN, even when using NNs with no or only one hidden layer. We choose the model of one hidden layer of the NN and as the best parameters. Figure 5(b)(c) plots the data points projected to the two-axis. As described in Section 4, the two axes of each mode are rotated and LRC of every data points indicated by the arrows are calculated. Note that the predicted values have a high Pearson’s correlation coefficient with TG and RG on the first axis ( and ), while low on the second axis ( and ). Interestingly, there is positive correlation between value of the first axes and the second axes for lower value of the first axes, and negative correlation for higher value. Moreover, a mixture of LRC in two different directions is observed throughout the data. This suggests that there are at least two gene groups in both TG and RG, which interact in a complex manner.
We show TG and RG that have the highest absolute scores of on the two axes in Table 2. Note that we only consider absolute scores because the sign of the scores becomes unstable since the projection is performed by the outer product of the projection vectors of each mode. We analyze the top- genes by the bioinformatics tool DAVID 555https://david.ncifcrf.gov/ huang_systematic_2009 , which provides functional annotation clustering of genes and their biological interpretation. It can be seen through annotation summary results of Tissue expression (t_e) that there is a clear difference between the axes; the Platelet, Thyroid, and Thymus are derived as annotations related to the first axis of TG, whereas annotations Ovary and Eye are extracted for the second axis of TG. On the other hand, the results for RG include only Epithelium in the first axis, whereas Muscle, Pancreatic cancer, Colon cancer, Uterus, Colon adenocarcinoma, Colon tumor, and Peripheral blood lymphocyte in the second axis. These observations suggest that there are other biomechanisms involved in various ways with EMT, independent of the relationships between TG and RG extracted in the first axis that can directly predict the EMT value. We hope that these results will be analyzed more deeply by biological researchers, leading to the discovery of new biological mechanisms.
|Target # 1||Target # 2||Regulator # 1||Regulator # 2|
|0.083||SLC6A8||0.025||KIAA* 666KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsa-miR-223, m451: hsa-miR-451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsa-miR-422a*.||0.118||IKZF1||0.035||m451 666KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsa-miR-223, m451: hsa-miR-451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsa-miR-422a*.|
|0.072||MPP1||TM||0.024||TIGD1L||0.110||LYL1||0.034||SAP* 666KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsa-miR-223, m451: hsa-miR-451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsa-miR-422a*.||MS,UT|
|0.062||CHKA||0.022||LOC* 666KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsa-miR-223, m451: hsa-miR-451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsa-miR-422a*.||0.090||AIP||0.032||NAT8 666KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsa-miR-223, m451: hsa-miR-451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsa-miR-422a*.||CT|
|0.058||SASH1||TR||0.022||PRR11||EY||0.089||SOX10||0.032||m422a* 666KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsa-miR-223, m451: hsa-miR-451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsa-miR-422a*.|
|0.055||AFTPH||0.022||ERN2||0.088||m223 666KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsa-miR-223, m451: hsa-miR-451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsa-miR-422a*.||0.031||LMCD1|
5.4 Evaluation of Scalability
To empirically prove the scalability of TRIP, we create artificial datasets of first-, second-, and third-order tensors, i.e., Rand1, Rand2, Rand3 in Table 1, respectively, by randomly selecting non-zero elements. We only show the results of the model using two hidden layers of the NN and ; however, we obtained almost the same results for other settings (not shown). Figure 6 (a) shows that the training time of TRIP increases linearly to the number of non-zero element of the datasets. Figure 6 (b) also shows that the training time increases linearly to the size of the projected tensor, where we change the size of one dimension of the projected tensor while fixing the size of the other dimensions as . Note that the number of non-zero elements are fixed to .
We proposed a method, TRIP, for searching linear projection that maximizes prediction accuracy while retaining the original data variance as much as possible. Experiments using artificial data showed that TRIP correctly finds decision boundaries among many variables even with strong nonlinearity. Experiments with real-world data, including higher-order tensors, also showed that TRIP makes it possible to find various events behind the data, using human-readable low-dimensional subspace. Moreover, we empirically showed that TRIP scales linearly to the number of non-zero elements and the size of projected tensor.
- (1) Laurens van der Maaten and Geoffrey Hinton. Viualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
- (2) Elnaz Barshan, Ali Ghodsi, Zohreh Azimifar, and Mansoor Zolghadri Jahromi. Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recognit., 44(7):1357–1371, 2011.
- (3) Babak Hosseini and Barbara Hammer. Interpretable discriminative dimensionality reduction and feature selection on the manifold. CoRR, abs/1909.09218, 2019.
Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller.
Nonlinear component analysis as a kernel eigenvalue problem.Neural Computation, 10(5):1299–1319, 1998.
- (5) Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
- (6) Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
- (7) Rajarshi Guhaniyogi, Shaan Qamar, and David B. Dunson. Bayesian tensor regression. J. Mach. Learn. Res., 18:79:1–79:31, 2017.
- (8) Lifang He, Kun Chen, Wanwan Xu, Jiayu Zhou, and Fei Wang. Boosted sparse and low-rank tensor regression. In NeurIPS 2018, pages 1017–1026, 2018.
- (9) Masaaki Imaizumi and Kohei Hayashi. Doubly decomposing nonparametric tensor regression. In ICML, pages 727–736, 2016.
- (10) Rose Yu and Yan Liu. Learning from multiway data: Simple and efficient tensor regression. In ICML, pages 373–381, 2016.
- (11) Lifang He, Chun-Ta Lu, Guixiang Ma, Shen Wang, Linlin Shen, Philip S. Yu, and Ann B. Ragin. Kernelized support tensor machines. In ICML, pages 1442–1451, 2017.
- (12) Dacheng Tao, Xuelong Li, Xindong Wu, Weiming Hu, and Stephen J. Maybank. Supervised tensor learning. Knowl. Inf. Syst., 13(1):1–42, 2007.
- (13) I.T. Jolliffe and Springer-Verlag. Principal Component Analysis. Springer Series in Statistics. Springer, 2002.
- (14) Aleix M. Martínez and Avinash C. Kak. PCA versus LDA. IEEE Trans. Pattern Anal. Mach. Intell., 23(2):228–233, 2001.
Roman Rosipal and Nicole Krämer.
Overview and recent advances in partial least squares.
Subspace, Latent Structure and Feature Selection, pages 34–51. Springer, 2006.
- (16) Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.
Amina Adadi and Mohammed Berrada.
Peeking inside the black-box: A survey on explainable artificial intelligence (XAI).IEEE Access, 6:52138–52160, 2018.
- (18) Zachary C. Lipton. The mythos of model interpretability. Commun. ACM, 61(10):36–43, 2018.
- (19) Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. "Why should I trust you?": Explaining the predictions of any classifier. In KDD, pages 1135–1144, 2016.
- (20) Koji Maruhashi, Masaru Todoriki, Takuya Ohwa, Keisuke Goto, Yu Hasegawa, Hiroya Inakoshi, and Hirokazu Anai. Learning multi-way relations via tensor decomposition with neural networks. In AAAI, pages 3770–3777, 2018.
- (21) Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
- (22) Kelwin Fernandes, Jaime S. Cardoso, and Jessica Fernandes. Transfer learning with partial observability applied to cervical cancer screening. In Iberian Conference on Pattern Recognition and Image Analysis, pages 243–250. Springer, 2017.
- (23) Roohallah Alizadehsani, Jafar Habibi, Mohammad Javad Hosseini, Hoda Mashayekhi, Reihane Boghrati, Asma Ghandeharioun, Behdad Bahadorian, and Zahra Alizadeh Sani. A data mining approach for diagnosis of coronary artery disease. Computer Methods and Programs in Biomedicine, 111(1):52 – 61, 2013.
- (24) Nadine Hajj, Yara Rizk, and Mariette Awad. A subjectivity classification framework for sports articles using improved cortical algorithms. Neural Computing and Applications, 31:8069–8085, 2018.
James Bridge, Sean Holden, and Lawrence Paulson.
Machine learning for first-order theorem proving.
Journal of Automated Reasoning, 53, 2014.
- (26) Ralph Andrzejak, Klaus Lehnertz, Florian Mormann, Christoph Rieke, Peter David, and Christian Elger. Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Physical review. E, Statistical, nonlinear, and soft matter physics, 64:061907, 2002.
- (27) Teppei Shimamura, Seiya Imoto, Yukako Shimada, Yasuyuki Hosono, Atsushi Niida, Masao Nagasaki, Rui Yamaguchi, Takashi Takahashi, and Satoru Miyano. A novel network profiling analysis reveals system changes in epithelial-mesenchymal transition. PloS one, 6:e20804, 2011.
- (28) Da Wei Huang, Brad T. Sherman, and Richard A. Lempicki. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols, 4(1):44–57, 2009.
Appendix A Calculating Gradient in Orthonormal Conditions
because the former equation of Eqs. 11 can be written as by using Eq. 5 and this is regarded as a SVD, where the diagonal components of are singular values and the column vectors of are left singular vectors. Thus can be obtained. By differentiating both sides of both equations of Eqs. 11 with the -th component of , we can obtain
The is for the -th component otherwise . Thus, we can derive Eq. 7.