1 Introduction
Dimensionality reduction is effective for learning highdimensional data. An important effect is avoiding overfitting by revealing a few essential parameters. Another important effect is providing better understanding of decision boundaries in humanreadable low dimensional subspace. We can understand the reason of prediction by analyzing the positional relationship between data points and decision boundaries. Several dimensionality reduction methods, such as tSNE Maaten08 , kernel PCA Barshan11 ; Hosseini19 ; Scholkopf98 , and Isomap tenenbaum_global_2000 , nonlinearly reduce dimensionality; however, the relationship between original variables and projections is not straightforward and difficult to understand in many cases. On the other hand, it may be difficult for variable selection methods such as the celebrated LASSO approach tibshirani96 to explain the phenomenon such as biological mechanism that should be described as the interaction between the sets of the variables that slightly correlate with each other. Linear methods without variable selection, such as principal component analysis (PCA) and linear discriminant analysis (LDA), are desirable from the interpretation point of view because we can easily understand the correlation between many variables in the projected lowdimensional subspace. We can apply any prediction model to the projected data point; however, there is no guarantee that the important features for prediction are kept in the projected subspace. To highlight these issues, we give a vivid example using artificial data of a class classification task with dimensional variables. These data are created by embedding the twodimensional spiral distribution (Fig. 1(a)) in a dimensional space by applying dimensional noise (Fig. 1(b)) and performing rotation transformation (see details in Section 5). Figure 1(c) shows a distribution of the data points projected to twodimensional subspace using LDA, a supervised method. However, it overfit the noise and the spiral distribution cannot be observed at all. By contrast, PCA seems not to overfit the noise; however, it is difficult to recognize the spiral distribution (Fig. 1
(e)). Thus, it is difficult for linear projection methods to deal with strong nonlinearities in highdimensional space. In this paper, we propose a learning method of searching for a subspace that maximizes the prediction accuracy while retaining as much of the original data variance as possible to optimize the projection matrix
to minimize regularizedbyreconstruction loss such as(1) 
where is the th data point, is a loss of the prediction model to predict a scalar response under the subspace, are the parameters of the prediction model,
is an identity matrix, and
is the regularization parameter. This strategy meets two requirements simultaneously. First, the subspace keeps the variance of the data points as large as possible by minimizing the reconstruction errors , the same as with PCA. Second, can be predicted as accurately as possible by minimizing , even if the prediction model requires strong nonlinearity. We can successfully find the spiral distribution, as shown in Fig. 1(d), by calculating an optimal satisfying Eq. 1.The issues described above also apply to the tensor data. Several tensorregression methods have been proposed for the data represented by matrices or tensors that describe the relationships between variables guhaniyogi_bayesian_2017 ; he_boosted_2018 ; imaizumi_doubly_2016 ; yu_learning_2016 . However, these methods cannot cope with strong nonlinearities. On the other hand, several methods using the kernel method he_kernelized_2017 ; Tao07 lose the correlation between many variables of the original data. We extend Eq. 1 to tensor expressions so that they can be applied to tensor data.
To the best of our knowledge, this is the first study to propose a method that allows searching for a linear projection without variable selection that maintains the maximum data variation and allows highaccuracy prediction even with strong nonlinearity, including matrix and tensor data. We report the results of evaluating the proposed method using artificial data and realworld data including higherorder data. We empirically show that our method can classify highdimensional data with high accuracy while having high interpretability.
2 Related Works
PCA is the most wellknown unsupervised dimensionalityreduction method that represents a highdimensional space by a few orthogonal variables that retain large variance of the data jolliffe_principal_2002 . Several methods have been proposed to find lowdimensional subspace in a supervised manner, such as LDA and partial least squares (PLS) Martinez01 ; Rosipal06
. However, the distribution of samples in each class is strongly assumed as a normal distribution and cannot be applied to datasets with strong nonlinearity as shown in Fig.
1(a). Several dimensionalityreduction methods, such as tSNE Maaten08 , kernel PCA Barshan11 ; Hosseini19 ; Scholkopf98 , and Isomap tenenbaum_global_2000 , nonlinearly reduces dimensionality; however, the relationship between original variables and projections is not straightforward and difficult to understand in many cases.Tucker decomposition Kolda09 is a wellknown tensor decomposition using unsupervised methods. Several studies on tensor regression task have been reported guhaniyogi_bayesian_2017 ; he_boosted_2018 ; imaizumi_doubly_2016 ; yu_learning_2016 . Most of these methods reduce the number of parameters by expressing the partial regression coefficients in a low rank and can be regarded as the multilinear projection into lowerdimensional tensor subspace. However, these methods cannot cope with the nonlinearity inherent in the data. Kernel methods have been applied to deal with nonlinearities he_kernelized_2017 ; Tao07 ; however, they lose the correlation between many variables.
Prediction interpretation and justification is one of the most important topics in recent machine learning research
Adadi18 ; Lipton18 . One effective approach is to produce interpretable models. Ribeiro et al. introduced a modelagnostic approach that provides explanation to blackbox models by learning interpretable models Ribeiro16 . Our method allows us to better understand the interpretable model by visualizing it along with the decision boundaries in a lower humanreadable dimension.3 Preliminaries
3.1 Notations
Scalars are denoted as lowercase letters (
), vectors as boldface lowercase letters (
), matrices as boldface capital letters (), and tensors as boldface Euler script letters (). A transverse matrix of is . A matrix is an identity matrix, and , are square matrices in which all values are , . The square root of the sum of the squared values in tensor is denoted as . We refer to the dimensionality of a tensor as order and to each dimension as a mode. The inner product is denoted as . The outer product of vectors is denoted as , where the th element is by using as the th element of . The elementwise multiplication and division are denoted as and , respectively. A KhatriRao product between of size and of size is a matrix of size by using a Kronecker product . The vectorization of a tensor creates a vector . The mode matricization of a tensor of size creates a matrix of size , where is a tensor of the th slice of for the th mode. A mode product of a tensor with of size is a tensor of size , where . In case is a vector of length , we remove the mode and resize to . We denote as . is a tensor of size with of size , and . We denote as . A vector and matrix are differentials of over the elements of and , respectively.3.2 Problem Definition
As we mention in Section 1, we extend our problem to tensor expression. Among several ways to project a tensor of size to a tensor of size , we propose projecting by the mode product of each mode using a projection matrix , i.e., , from the interpretation point of view. Thus, our problem can be defined as follows:
Problem 1 (Linear Tensor Projection for Nonlinear Prediction (LTPNP))
Given a set of training samples where is a thorder tensor of size and is a scalar response, how can we learn the projection matrices to project into along with the prediction model with strong nonlinearity that can accurately predict the response based on the projected tensor , where are parameters of the prediction model?
4 Proposed Method
We propose a practical method of solving the LTPNP problem. We will refer to this method as Tensor Reconstruction based Interpretable Prediction (TRIP).
4.1 Prediction
As we explain later, we propose learning the projection matrices
and the prediction model by using stochastic gradient descent (SGD). Typical prediction models that can be trained by using SGD include neural networks (NNs) and linear/softmax regression. Such a prediction model can be written as
(2) 
where , is a weight tensor of size , is a prediction model obtained by removing from with input of size vector, and are the remaining parameters other than . If the prediction model is linear/softmax regression, is the partial regression coefficients. Moreover, if the prediction model is an NN, is the parameters between and the first layer. In any case, the number of elements of increases by the multiplication of the number of modes, so that overfitting easily occurs. As a solution to this problem, we set to be a lowrank tensor as
(3) 
by using , so that the number of parameters can be reduced.
4.2 Training
We show a method of learning , , and , using SGD. We extend the regularizedbyreconstruction loss (Eq. 1) to tensor expression as
(4) 
where is the regularization parameter. It is necessary to satisfy the orthonormal condition . We propose a similar idea as Maruhashi18 , that uses of the same size as as latent variables and calculate as a matrix that satisfies the orthonormal conditions derived from . That is,
(5) 
is calculated by singular value decomposition (SVD) and set as
(6) 
where and are matrices having left and right singular vectors as column vectors, and is a diagonal matrix having singular values as diagonal components. Our idea is to calculate as a derivative of by each element of and calculate to update by updating . As a result, we can calculate
(7) 
by using
(8)  
where , , and are obtained by a SVD same as Eq. 5. See Appendix. A in which we show how to derive Eqs. 7 and 8. Here, can be obtained by differentiating Eq. 4, resulting as
(9) 
where is calculated using methods specific to the prediction models, such as the backpropagation algorithm in NNs, by which can also be calculated. Moreover, gradient of can be calculated as
(10) 
The prediction and training algorithm are shown in Algorithm 1.
4.3 Computational Complexity
The computation mainly consists of two parts: the projection and the propagation in an NN. Calculation of the projection scales linearly to the total number of nonzero elements in the dataset and the size of , which means the time and space complexity is . The complexity of forward and backpropagation in an NN is time and space, where is the number of edges in the NN. Because the number of edges between and the first layer is the size of , where is the number of remaining edges.
4.4 Understanding Subspace
4.4.1 Extraction of Independent Components
Even if and are changed to and by a rotation transform , the value of Eq. 4 does not change. We propose to determine so that can be understood as the coefficient of the independent component after regression in modes other than the th mode. When the prediction model is a linear model such as , can be calculated so that the elements of are independent of each other, where . That is, the column vectors of are the left singular vectors of having the normalized as the column vectors. If the prediction model is a nonlinear model such as an NN, we can use the same strategy as the linear model by learning a linear surrogate model , where are regression coefficients and is a bias. and are calculated so that can produce similar results to by minimizing . We can use the alternating least squares (ALS) method Kolda09 with which s are alternately calculated by fixing the coefficients of other modes.
4.4.2 Local Regression Coefficients
It might be difficult to understand a strongly nonlinear decision boundary by using a linear surrogate model. We propose to learn a local linear surrogate model , which can be applied only to the data points around the th point, where are local regression coefficients (LRC) and is a bias. and are calculated by minimizing , where is a similarity between and and is a parameter. This is almost the same as the local interpretable model proposed by Ribeiro et al. Ribeiro16 except that we apply it to the projected tensor data. We can calculate the parameters by using the ALS method, the same as the linear surrogate model. LRC can be observed using calculated under the local linear surrogate model. For , we should observe . For , LRC should be projected in modes other than the th mode, resulting in where . Also, we can understand LRC as in the language of the original variable.
5 Empirical Results
5.1 Evaluation Settings
5.1.1 Datasets
dataset  cite  task  
1  Spiral (artificial)  100  220  20  classification  
Rand1 (artificial)  1,000  100  32  (2 class)  
Connectionist ^{1}^{1}1Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: ZAlizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: Firstorder theorem proving, Epileptic: Epileptic Seizure Recognition.  Dua:2019  60  208  64  
Colposcopies ^{1}^{1}1Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: ZAlizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: Firstorder theorem proving, Epileptic: Epileptic Seizure Recognition.  Dua:2019 ; fernandes_transfer_2017  62  287  64  
Alizadeh ^{1}^{1}1Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: ZAlizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: Firstorder theorem proving, Epileptic: Epileptic Seizure Recognition.  Dua:2019 ; alizadehsani_data_2013  57  303  64  
Sports ^{1}^{1}1Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: ZAlizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: Firstorder theorem proving, Epileptic: Epileptic Seizure Recognition.  Dua:2019 ; Hajj2018ASC  59  1,000  256  
SECOM  Dua:2019  590  1,567  256  
Ozone ^{1}^{1}1Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: ZAlizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: Firstorder theorem proving, Epileptic: Epileptic Seizure Recognition.  Dua:2019  72  1,848  256  
Spambase  Dua:2019  57  4,601  1,024  
Theorem ^{1}^{1}1Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: ZAlizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: Firstorder theorem proving, Epileptic: Epileptic Seizure Recognition.  Dua:2019 ; Bridge14  51  6,118  1,024  
Epileptic ^{1}^{1}1Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: ZAlizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: Firstorder theorem proving, Epileptic: Epileptic Seizure Recognition.  Dua:2019 ; Andrzejak02  178  11,500  2,048  
Iris  Dua:2019  4  150  32  classification  
Wine  Dua:2019  13  178  32  (3 class)  
2  Genes  Shimamura_11  13,508 1,732  762  128  regression 
Rand2 (artificial)  1,000 1,000  100  32  classification  
3  Rand3 (artificial)  1,000 1,000 1,000  100  32  (2 class) 
Table 1 shows the summary of the datasets used in this study. All the vector datasets (), except the artificial datasets, are downloaded from the website of UCI Machine Learning Repository^{2}^{2}2https://archive.ics.uci.edu/ml/index.php Dua:2019 . We obtain Genes from a website^{3}^{3}3http://bonsai.hgc.jp/~shima/NetworkProfiler/ Shimamura_11
. We normalize all the variables of the vector datasets by making the average values zero and the standard deviations one.
5.1.2 Comparison Methods
We compare TRIP with methods that can be applied to the LTPNP problem, which linearly project data points into lowdimensional subspace and predict scalar responses using prediction models with strong nonlinearity. As in TRIP, NNs are used for the prediction models. The compared methods for vector datasets use PCA and LDA for linear projection, and the methods combined with NNs are called PCANN and LDANN, respectively. For tensor datasets (), we compare TRIP with a method of applying NNs to tensors projected to the lowdimensional subspace by Tucker decomposition Kolda09 , i.e., core tensors. This method is called TuckerNN.
In all the experiments other than evaluation using Spiral and scalability tests, we select parameters of each method based on the average accuracies of trials of fold cross validation. For Spiral, we use the average test accuracies of trials. The number of hidden layers of NNs are chosen from zero (linear/softmax regression) to
, with 10 neurons in each layer. We use ReLU as activation function. Also, we use softmax cross entropy loss for classification and squared error loss for regression. The best
s for TRIP are chosen from. The number of epochs in SGD is determined based on sufficient convergence,
i.e., for Spiral, for the datasets of class classification task other than artificial datasets, (TRIP) and (PCANN, LDANN) for Iris and Wine, (TRIP) and (TuckerNN) for Genes, and for Rand1, Rand2, and Rand3. We use Adam optimization algorithm with learning rate of , whereas we use for Genes.TRIP, PCANN, LDANN, and TuckerNN are implemented in Python 3.7 using PyTorch 1.4
^{4}^{4}4https://pytorch.org/. All the experiments are conducted on an Intel(R) Xeon(R) Gold 6130 CPU 2.10GHz with 32GB of memory, running Linux.5.2 Evaluation using Artificial Data
5.2.1 Data Generation
We create artificial data Spiral in which essential decision boundaries are carefully hidden in highdimensional space. First, we generate a twodimensional distribution in which the data points of the two classes are distributed in a twodimensional spiral, as shown in Fig. 1(a). Specifically, we randomly select
from a uniform distribution of
and calculate the twodimensional coordinates, i.e., for the samples of the first class and for the samples of the second class. We also add samples to each class as noise samples, with the twodimensional coordinates of selected from the uniform distribution of . We normalize the standard deviation of the coordinates in each dimension of these samples to . We also create the coordinates of an additional one dimension as a major noise dimension from a normal distribution with the standard deviation of , which cannot be distinguished from the dimensions with spiral distribution with PCA because of the same variance. In addition, we add the coordinates of dimensions as minor noise dimensions from a normal distribution with the standard deviation of , which should cause model overfitting when the regularization of reconstruction errors does not sufficiently work (Fig. 1(b)). Finally, the spiral distribution is hidden in the dimensional space by transforming the data using a randomly generated rotation transform. As a result, we generate artificial dataset containing samples with dimensional coordinates. We generate the two sets of samples by selecting coordinates independently, i.e., training dataset and test dataset, using the procedure described above with the same rotation transform.5.2.2 Accuracy Evaluation and Visualization
Because we know that the essential decision boundaries have the twodimensional distribution with spiral shape, we conduct the experiments using two projection vectors, i.e., . The highest test accuracies are achieved with an NN with hidden layers and . Figure 2(a) shows the fitting accuracies on the training dataset and the test accuracies on the test dataset with the NN with hidden layers for several values of , along with the results of LDANN and PCANN. Overall, lower cause overfitting in which fitting accuracies are high whereas test accuracies are low. On the other hand, higher can balance between the fitting accuracies and the test accuracies, which indicates that the reconstruction errors work as regularization functions. The result of LDANN is similar to those of low , and the result of PCANN is similar to those of high .
Figure. 2(bg) plot the projected data points of the training dataset with the NN with hidden layers for several values of . All the plots are chosen from trials because they reduce the objective function the most. We can recognize the spiral boundaries for the of and (Fig. 2(d)(e)). For the less than , the samples of the two classes are well separated (Fig. 2(b)(c)); however, the test accuracies are very low (Fig. 2(a)), suggesting the models overfit to the training dataset for these . When we use more than , the samples of the two classes are not clearly separated (Fig. 2(f)(g)), the distributions of which are similar to that of PCANN (Fig. 1(e)). The results on more than suggest that the projection vectors are learned in an almost unsupervised manner. Thus, by using the appropriate , the reconstruction error works properly as regularization, and TRIP extracts the subspace that captures the inherently important features for separation in highdimensional data.
5.3 Evaluation by RealWorld Data
5.3.1 Evaluation of Accuracies
To quantitatively show the effectiveness of TRIP, the area under the curves (AUCs) calculated by fold cross validations for various class classification tasks are shown in Fig. 3. TRIP achieves higher accuracies than PCANN and LDANN in most of the datasets, especially when the data points are projected into  or  dimensions. Also, datasets with many samples require the NNs with many hidden layers (Fig. 3(g)(h)(i)). This indicates that the decision boundaries of these datasets contain relatively strong nonlinearities. These results suggest that TRIP is good at linear projection while maintaining high prediction accuracy in a lowdimensional subspace that is easy for humans to understand, even when the decision boundary has high nonlinearity.
5.3.2 Visualizing Subspace
We show how to understand the reason for the prediction using two datasets of class classification, Iris and Wine. We use twodimensional subspace, i.e., , to understand the results in the twodimensional plots. Figure 4(a)(e) shows the accuracies of PCANN, LDANN, and TRIP using NNs with to hidden layers. Though the best accuracies are almost same among the methods, TRIP achieves high accuracies even with the model without hidden layers of the NN. We choose the number of hidden layers of the NN as for Iris and for Wine, and for Iris and for Wine, by which TRIP achieves the highest accuracies.
We show the softmax value of TRIP as contours for each class of Iris (Fig. 4 (b)(c)(d)) and Wine (Fig. 4 (f)(g)(h)). The twoaxis are rotated as described in Section 4. We also represent LRC of each data point with an arrow. In Iris, data points of Versicolor and Virginica are very close, and LRC of most of these data points have almost same direction (Fig. 4 (c)(d)). These LRC can be interpreted in the language of the original variables, i.e., . Roughly speaking, the direction of LRC means Virginica has larger petal and smaller sepal than Versicolor. In Wine, it can be observed that large LRC of data points near the decision boundary separating the class and (Fig. 4 (f)) or the class and (Fig. 4 (g)) have almost same direction. In short, in order for a class wine close to the decision boundary to be a class or class wine, at least the value of Color intensity, Proline, Alcohol, and Ash must be higher and the value of Hue must be lower. In addition, the value of Alcalinity of ash needs to be smaller to be a class wine, whereas the value of Flavanoids must be smaller to be a class wine. Thus, LRC makes it easy to understand the reason for prediction, in a linearly projected lowdimensional subspace.
5.3.3 Results on HigherOrder Dataset
As a higherorder dataset, we use Genes in which each data point is a matrix, i.e., the second order tensor. This dataset describes varying coefficients (VC) between target genes (TG) and regulator genes (RG) in cell lines Shimamura_11 . VC are computed based on the expression levels of genes, where the strengths of relationship between genes is allowed to vary with epithelialmesenchymal transition (EMT)related modulator value (hereinafter called EMT value), which describe cell lines as epitheliallike or mesenchymallike. We set VC with the absolute values less than as zero, resulting in nonzero elements on average. We focus on the regression task to predict the EMT values. We project each data point of the cell line to the subspace of the size of , i.e., and . Figure 5(a) shows the root mean squared error (RMSE) of TRIP and TuckerNN for predicting EMT value using NNs with different numbers of hidden layers. TRIP shows much better accuracy than TuckerNN, even when using NNs with no or only one hidden layer. We choose the model of one hidden layer of the NN and as the best parameters. Figure 5(b)(c) plots the data points projected to the twoaxis. As described in Section 4, the two axes of each mode are rotated and LRC of every data points indicated by the arrows are calculated. Note that the predicted values have a high Pearson’s correlation coefficient with TG and RG on the first axis ( and ), while low on the second axis ( and ). Interestingly, there is positive correlation between value of the first axes and the second axes for lower value of the first axes, and negative correlation for higher value. Moreover, a mixture of LRC in two different directions is observed throughout the data. This suggests that there are at least two gene groups in both TG and RG, which interact in a complex manner.
We show TG and RG that have the highest absolute scores of on the two axes in Table 2. Note that we only consider absolute scores because the sign of the scores becomes unstable since the projection is performed by the outer product of the projection vectors of each mode. We analyze the top genes by the bioinformatics tool DAVID ^{5}^{5}5https://david.ncifcrf.gov/ huang_systematic_2009 , which provides functional annotation clustering of genes and their biological interpretation. It can be seen through annotation summary results of Tissue expression (t_e) that there is a clear difference between the axes; the Platelet, Thyroid, and Thymus are derived as annotations related to the first axis of TG, whereas annotations Ovary and Eye are extracted for the second axis of TG. On the other hand, the results for RG include only Epithelium in the first axis, whereas Muscle, Pancreatic cancer, Colon cancer, Uterus, Colon adenocarcinoma, Colon tumor, and Peripheral blood lymphocyte in the second axis. These observations suggest that there are other biomechanisms involved in various ways with EMT, independent of the relationships between TG and RG extracted in the first axis that can directly predict the EMT value. We hope that these results will be analyzed more deeply by biological researchers, leading to the discovery of new biological mechanisms.
Target # 1  Target # 2  Regulator # 1  Regulator # 2  

score  gene  t_e  score  gene  t_e  score  gene  t_e  score  gene  t_e 
0.084  C17orf39  0.025  GTF2H3  EY  0.128  NCOR1  EP  0.097  DIP2A  UT  
0.083  SLC6A8  0.025  KIAA* ^{6}^{6}6KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsamiR223, m451: hsamiR451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsamiR422a*.  0.118  IKZF1  0.035  m451 ^{6}^{6}6KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsamiR223, m451: hsamiR451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsamiR422a*.  
0.072  MPP1  TM  0.024  TIGD1L  0.110  LYL1  0.034  SAP* ^{6}^{6}6KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsamiR223, m451: hsamiR451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsamiR422a*.  MS,UT  
0.067  PLOD3  0.023  TCF7  0.109  DIP2A  0.032  HSF1  MS  
0.062  CHKA  0.022  LOC* ^{6}^{6}6KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsamiR223, m451: hsamiR451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsamiR422a*.  0.090  AIP  0.032  NAT8 ^{6}^{6}6KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsamiR223, m451: hsamiR451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsamiR422a*.  CT  
0.058  SASH1  TR  0.022  PRR11  EY  0.089  SOX10  0.032  m422a* ^{6}^{6}6KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsamiR223, m451: hsamiR451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsamiR422a*.  
0.055  AFTPH  0.022  ERN2  0.088  m223 ^{6}^{6}6KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsamiR223, m451: hsamiR451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsamiR422a*.  0.031  LMCD1  
0.054  GSPT1  0.022  MBTD1  EY  0.086  ELF3  0.030  TAF12  
0.053  MSX2  0.021  MTRF1L  EY  0.086  TTF1  EP  0.030  LBA1  
0.051  COASY  PL  0.021  UBQLN4  OV  0.081  CREB3  0.030  NOTCH2 
5.4 Evaluation of Scalability
To empirically prove the scalability of TRIP, we create artificial datasets of first, second, and thirdorder tensors, i.e., Rand1, Rand2, Rand3 in Table 1, respectively, by randomly selecting nonzero elements. We only show the results of the model using two hidden layers of the NN and ; however, we obtained almost the same results for other settings (not shown). Figure 6 (a) shows that the training time of TRIP increases linearly to the number of nonzero element of the datasets. Figure 6 (b) also shows that the training time increases linearly to the size of the projected tensor, where we change the size of one dimension of the projected tensor while fixing the size of the other dimensions as . Note that the number of nonzero elements are fixed to .
6 Conclusion
We proposed a method, TRIP, for searching linear projection that maximizes prediction accuracy while retaining the original data variance as much as possible. Experiments using artificial data showed that TRIP correctly finds decision boundaries among many variables even with strong nonlinearity. Experiments with realworld data, including higherorder tensors, also showed that TRIP makes it possible to find various events behind the data, using humanreadable lowdimensional subspace. Moreover, we empirically showed that TRIP scales linearly to the number of nonzero elements and the size of projected tensor.
References
 (1) Laurens van der Maaten and Geoffrey Hinton. Viualizing data using tSNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
 (2) Elnaz Barshan, Ali Ghodsi, Zohreh Azimifar, and Mansoor Zolghadri Jahromi. Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recognit., 44(7):1357–1371, 2011.
 (3) Babak Hosseini and Barbara Hammer. Interpretable discriminative dimensionality reduction and feature selection on the manifold. CoRR, abs/1909.09218, 2019.

(4)
Bernhard Schölkopf, Alexander Smola, and KlausRobert Müller.
Nonlinear component analysis as a kernel eigenvalue problem.
Neural Computation, 10(5):1299–1319, 1998.  (5) Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
 (6) Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
 (7) Rajarshi Guhaniyogi, Shaan Qamar, and David B. Dunson. Bayesian tensor regression. J. Mach. Learn. Res., 18:79:1–79:31, 2017.
 (8) Lifang He, Kun Chen, Wanwan Xu, Jiayu Zhou, and Fei Wang. Boosted sparse and lowrank tensor regression. In NeurIPS 2018, pages 1017–1026, 2018.
 (9) Masaaki Imaizumi and Kohei Hayashi. Doubly decomposing nonparametric tensor regression. In ICML, pages 727–736, 2016.
 (10) Rose Yu and Yan Liu. Learning from multiway data: Simple and efficient tensor regression. In ICML, pages 373–381, 2016.
 (11) Lifang He, ChunTa Lu, Guixiang Ma, Shen Wang, Linlin Shen, Philip S. Yu, and Ann B. Ragin. Kernelized support tensor machines. In ICML, pages 1442–1451, 2017.
 (12) Dacheng Tao, Xuelong Li, Xindong Wu, Weiming Hu, and Stephen J. Maybank. Supervised tensor learning. Knowl. Inf. Syst., 13(1):1–42, 2007.
 (13) I.T. Jolliffe and SpringerVerlag. Principal Component Analysis. Springer Series in Statistics. Springer, 2002.
 (14) Aleix M. Martínez and Avinash C. Kak. PCA versus LDA. IEEE Trans. Pattern Anal. Mach. Intell., 23(2):228–233, 2001.

(15)
Roman Rosipal and Nicole Krämer.
Overview and recent advances in partial least squares.
In
Subspace, Latent Structure and Feature Selection
, pages 34–51. Springer, 2006.  (16) Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.

(17)
Amina Adadi and Mohammed Berrada.
Peeking inside the blackbox: A survey on explainable artificial intelligence (XAI).
IEEE Access, 6:52138–52160, 2018.  (18) Zachary C. Lipton. The mythos of model interpretability. Commun. ACM, 61(10):36–43, 2018.
 (19) Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. "Why should I trust you?": Explaining the predictions of any classifier. In KDD, pages 1135–1144, 2016.
 (20) Koji Maruhashi, Masaru Todoriki, Takuya Ohwa, Keisuke Goto, Yu Hasegawa, Hiroya Inakoshi, and Hirokazu Anai. Learning multiway relations via tensor decomposition with neural networks. In AAAI, pages 3770–3777, 2018.
 (21) Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
 (22) Kelwin Fernandes, Jaime S. Cardoso, and Jessica Fernandes. Transfer learning with partial observability applied to cervical cancer screening. In Iberian Conference on Pattern Recognition and Image Analysis, pages 243–250. Springer, 2017.
 (23) Roohallah Alizadehsani, Jafar Habibi, Mohammad Javad Hosseini, Hoda Mashayekhi, Reihane Boghrati, Asma Ghandeharioun, Behdad Bahadorian, and Zahra Alizadeh Sani. A data mining approach for diagnosis of coronary artery disease. Computer Methods and Programs in Biomedicine, 111(1):52 – 61, 2013.
 (24) Nadine Hajj, Yara Rizk, and Mariette Awad. A subjectivity classification framework for sports articles using improved cortical algorithms. Neural Computing and Applications, 31:8069–8085, 2018.

(25)
James Bridge, Sean Holden, and Lawrence Paulson.
Machine learning for firstorder theorem proving.
Journal of Automated Reasoning
, 53, 2014.  (26) Ralph Andrzejak, Klaus Lehnertz, Florian Mormann, Christoph Rieke, Peter David, and Christian Elger. Indications of nonlinear deterministic and finitedimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Physical review. E, Statistical, nonlinear, and soft matter physics, 64:061907, 2002.
 (27) Teppei Shimamura, Seiya Imoto, Yukako Shimada, Yasuyuki Hosono, Atsushi Niida, Masao Nagasaki, Rui Yamaguchi, Takashi Takahashi, and Satoru Miyano. A novel network profiling analysis reveals system changes in epithelialmesenchymal transition. PloS one, 6:e20804, 2011.
 (28) Da Wei Huang, Brad T. Sherman, and Richard A. Lempicki. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols, 4(1):44–57, 2009.
Appendix A Calculating Gradient in Orthonormal Conditions
How we derive Eqs. 7 and 8 is shown below. For readability, superscripts are omitted. The defined by Eqs. 5 and 6 is calculated as a matrix that satisfies
(11) 
because the former equation of Eqs. 11 can be written as by using Eq. 5 and this is regarded as a SVD, where the diagonal components of are singular values and the column vectors of are left singular vectors. Thus can be obtained. By differentiating both sides of both equations of Eqs. 11 with the th component of , we can obtain
(12) 
The that satisfies Eqs. 12 is by using Eq. 8. If we use the property of , we obtain
(13) 
The is for the th component otherwise . Thus, we can derive Eq. 7.
Comments
There are no comments yet.