Linear Tensor Projection Revealing Nonlinearity

07/08/2020 ∙ by Koji Maruhashi, et al. ∙ FUJITSU 0

Dimensionality reduction is an effective method for learning high-dimensional data, which can provide better understanding of decision boundaries in human-readable low-dimensional subspace. Linear methods, such as principal component analysis and linear discriminant analysis, make it possible to capture the correlation between many variables; however, there is no guarantee that the correlations that are important in predicting data can be captured. Moreover, if the decision boundary has strong nonlinearity, the guarantee becomes increasingly difficult. This problem is exacerbated when the data are matrices or tensors that represent relationships between variables. We propose a learning method that searches for a subspace that maximizes the prediction accuracy while retaining as much of the original data information as possible, even if the prediction model in the subspace has strong nonlinearity. This makes it easier to interpret the mechanism of the group of variables behind the prediction problem that the user wants to know. We show the effectiveness of our method by applying it to various types of data including matrices and tensors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dimensionality reduction is effective for learning high-dimensional data. An important effect is avoiding overfitting by revealing a few essential parameters. Another important effect is providing better understanding of decision boundaries in human-readable low dimensional subspace. We can understand the reason of prediction by analyzing the positional relationship between data points and decision boundaries. Several dimensionality reduction methods, such as t-SNE Maaten08 , kernel PCA Barshan11 ; Hosseini19 ; Scholkopf98 , and Isomap tenenbaum_global_2000 , nonlinearly reduce dimensionality; however, the relationship between original variables and projections is not straightforward and difficult to understand in many cases. On the other hand, it may be difficult for variable selection methods such as the celebrated LASSO approach tibshirani96 to explain the phenomenon such as biological mechanism that should be described as the interaction between the sets of the variables that slightly correlate with each other. Linear methods without variable selection, such as principal component analysis (PCA) and linear discriminant analysis (LDA), are desirable from the interpretation point of view because we can easily understand the correlation between many variables in the projected low-dimensional subspace. We can apply any prediction model to the projected data point; however, there is no guarantee that the important features for prediction are kept in the projected subspace. To highlight these issues, we give a vivid example using artificial data of a -class classification task with -dimensional variables. These data are created by embedding the two-dimensional spiral distribution (Fig. 1(a)) in a -dimensional space by applying -dimensional noise (Fig. 1(b)) and performing rotation transformation (see details in Section 5). Figure 1(c) shows a distribution of the data points projected to two-dimensional subspace using LDA, a supervised method. However, it overfit the noise and the spiral distribution cannot be observed at all. By contrast, PCA seems not to overfit the noise; however, it is difficult to recognize the spiral distribution (Fig. 1

(e)). Thus, it is difficult for linear projection methods to deal with strong nonlinearities in high-dimensional space. In this paper, we propose a learning method of searching for a subspace that maximizes the prediction accuracy while retaining as much of the original data variance as possible to optimize the projection matrix

to minimize regularized-by-reconstruction loss such as

(1)

where is the -th data point, is a loss of the prediction model to predict a scalar response under the subspace, are the parameters of the prediction model,

is an identity matrix, and

is the regularization parameter. This strategy meets two requirements simultaneously. First, the subspace keeps the variance of the data points as large as possible by minimizing the reconstruction errors , the same as with PCA. Second, can be predicted as accurately as possible by minimizing , even if the prediction model requires strong nonlinearity. We can successfully find the spiral distribution, as shown in Fig. 1(d), by calculating an optimal satisfying Eq. 1.

Figure 1: Two-dimensional spiral distribution (a) buried in -dimensional noise (b) by a rotation transform are extracted by proposed method (d), whereas LDA overfits the noise (c) and PCA fails to extract the spiral distribution (e). Contours represent the softmax values for class

of the neural network applied to the projection.

The issues described above also apply to the tensor data. Several tensor-regression methods have been proposed for the data represented by matrices or tensors that describe the relationships between variables guhaniyogi_bayesian_2017 ; he_boosted_2018 ; imaizumi_doubly_2016 ; yu_learning_2016 . However, these methods cannot cope with strong nonlinearities. On the other hand, several methods using the kernel method he_kernelized_2017 ; Tao07 lose the correlation between many variables of the original data. We extend Eq. 1 to tensor expressions so that they can be applied to tensor data.

To the best of our knowledge, this is the first study to propose a method that allows searching for a linear projection without variable selection that maintains the maximum data variation and allows high-accuracy prediction even with strong nonlinearity, including matrix and tensor data. We report the results of evaluating the proposed method using artificial data and real-world data including higher-order data. We empirically show that our method can classify high-dimensional data with high accuracy while having high interpretability.

2 Related Works

PCA is the most well-known unsupervised dimensionality-reduction method that represents a high-dimensional space by a few orthogonal variables that retain large variance of the data jolliffe_principal_2002 . Several methods have been proposed to find low-dimensional subspace in a supervised manner, such as LDA and partial least squares (PLS) Martinez01 ; Rosipal06

. However, the distribution of samples in each class is strongly assumed as a normal distribution and cannot be applied to datasets with strong nonlinearity as shown in Fig. 

1(a). Several dimensionality-reduction methods, such as t-SNE Maaten08 , kernel PCA Barshan11 ; Hosseini19 ; Scholkopf98 , and Isomap tenenbaum_global_2000 , nonlinearly reduces dimensionality; however, the relationship between original variables and projections is not straightforward and difficult to understand in many cases.

Tucker decomposition Kolda09 is a well-known tensor decomposition using unsupervised methods. Several studies on tensor regression task have been reported guhaniyogi_bayesian_2017 ; he_boosted_2018 ; imaizumi_doubly_2016 ; yu_learning_2016 . Most of these methods reduce the number of parameters by expressing the partial regression coefficients in a low rank and can be regarded as the multi-linear projection into lower-dimensional tensor subspace. However, these methods cannot cope with the nonlinearity inherent in the data. Kernel methods have been applied to deal with nonlinearities he_kernelized_2017 ; Tao07 ; however, they lose the correlation between many variables.

Prediction interpretation and justification is one of the most important topics in recent machine learning research 

Adadi18 ; Lipton18 . One effective approach is to produce interpretable models. Ribeiro et al. introduced a model-agnostic approach that provides explanation to blackbox models by learning interpretable models Ribeiro16 . Our method allows us to better understand the interpretable model by visualizing it along with the decision boundaries in a lower human-readable dimension.

3 Preliminaries

3.1 Notations

Scalars are denoted as lowercase letters (

), vectors as boldface lowercase letters (

), matrices as boldface capital letters (), and tensors as boldface Euler script letters (). A transverse matrix of is . A matrix is an identity matrix, and , are square matrices in which all values are , . The square root of the sum of the squared values in tensor is denoted as . We refer to the dimensionality of a tensor as order and to each dimension as a mode. The inner product is denoted as . The outer product of vectors is denoted as , where the -th element is by using as the -th element of . The element-wise multiplication and division are denoted as and , respectively. A Khatri-Rao product between of size and of size is a matrix of size by using a Kronecker product . The vectorization of a tensor creates a vector . The mode- matricization of a tensor of size creates a matrix of size , where is a tensor of the -th slice of for the -th mode. A -mode product of a tensor with of size is a tensor of size , where . In case is a vector of length , we remove the mode and resize to . We denote as . is a tensor of size with of size , and . We denote as . A vector and matrix are differentials of over the elements of and , respectively.

3.2 Problem Definition

As we mention in Section 1, we extend our problem to tensor expression. Among several ways to project a tensor of size to a tensor of size , we propose projecting by the -mode product of each mode using a projection matrix , i.e., , from the interpretation point of view. Thus, our problem can be defined as follows:

Problem 1 (Linear Tensor Projection for Nonlinear Prediction (LTPNP))

Given a set of training samples where is a th-order tensor of size and is a scalar response, how can we learn the projection matrices to project into along with the prediction model with strong nonlinearity that can accurately predict the response based on the projected tensor , where are parameters of the prediction model?

4 Proposed Method

We propose a practical method of solving the LTPNP problem. We will refer to this method as Tensor Reconstruction based Interpretable Prediction (TRIP).

4.1 Prediction

As we explain later, we propose learning the projection matrices

and the prediction model by using stochastic gradient descent (SGD). Typical prediction models that can be trained by using SGD include neural networks (NNs) and linear/softmax regression. Such a prediction model can be written as

(2)

where , is a weight tensor of size , is a prediction model obtained by removing from with input of size vector, and are the remaining parameters other than . If the prediction model is linear/softmax regression, is the partial regression coefficients. Moreover, if the prediction model is an NN, is the parameters between and the first layer. In any case, the number of elements of increases by the multiplication of the number of modes, so that over-fitting easily occurs. As a solution to this problem, we set to be a low-rank tensor as

(3)

by using , so that the number of parameters can be reduced.

4.2 Training

We show a method of learning , , and , using SGD. We extend the regularized-by-reconstruction loss (Eq. 1) to tensor expression as

(4)

where is the regularization parameter. It is necessary to satisfy the orthonormal condition . We propose a similar idea as  Maruhashi18 , that uses of the same size as as latent variables and calculate as a matrix that satisfies the orthonormal conditions derived from . That is,

(5)

is calculated by singular value decomposition (SVD) and set as

(6)

where and are matrices having left and right singular vectors as column vectors, and is a diagonal matrix having singular values as diagonal components. Our idea is to calculate as a derivative of by each element of and calculate to update by updating . As a result, we can calculate

(7)

by using

(8)

where , , and are obtained by a SVD same as Eq. 5. See Appendix. A in which we show how to derive Eqs. 7 and  8. Here, can be obtained by differentiating Eq. 4, resulting as

(9)

where is calculated using methods specific to the prediction models, such as the back-propagation algorithm in NNs, by which can also be calculated. Moreover, gradient of can be calculated as

(10)

The prediction and training algorithm are shown in Algorithm 1.

1:procedure prediction()
2:     
3:      Eqs. 2,3
4:     return
5:procedure training(, )
6:     ,, initialize
7:      orthonormalization( ) Eqs. 5,6
8:     for  to  do
9:         set to zero
10:         for  to  do
11:               PREDICTION()
12:               grad(,) e.g. back-propagation in NNs
13:               gradCG(,,,) Eqs. 9,10          
14:          Eqs. 7,8
15:          update(,,)
16:          orthonormalization( ) Eqs. 5,6      
17:     return ,,
Algorithm 1 TRIP

4.3 Computational Complexity

The computation mainly consists of two parts: the projection and the propagation in an NN. Calculation of the projection scales linearly to the total number of non-zero elements in the dataset and the size of , which means the time and space complexity is . The complexity of forward- and back-propagation in an NN is time and space, where is the number of edges in the NN. Because the number of edges between and the first layer is the size of , where is the number of remaining edges.

4.4 Understanding Subspace

4.4.1 Extraction of Independent Components

Even if and are changed to and by a rotation transform , the value of Eq. 4 does not change. We propose to determine so that can be understood as the coefficient of the independent component after regression in modes other than the -th mode. When the prediction model is a linear model such as , can be calculated so that the elements of are independent of each other, where . That is, the column vectors of are the left singular vectors of having the normalized as the column vectors. If the prediction model is a nonlinear model such as an NN, we can use the same strategy as the linear model by learning a linear surrogate model , where are regression coefficients and is a bias. and are calculated so that can produce similar results to by minimizing . We can use the alternating least squares (ALS) method Kolda09 with which s are alternately calculated by fixing the coefficients of other modes.

4.4.2 Local Regression Coefficients

It might be difficult to understand a strongly nonlinear decision boundary by using a linear surrogate model. We propose to learn a local linear surrogate model , which can be applied only to the data points around the -th point, where are local regression coefficients (LRC) and is a bias. and are calculated by minimizing , where is a similarity between and and is a parameter. This is almost the same as the local interpretable model proposed by Ribeiro et al. Ribeiro16 except that we apply it to the projected tensor data. We can calculate the parameters by using the ALS method, the same as the linear surrogate model. LRC can be observed using calculated under the local linear surrogate model. For , we should observe . For , LRC should be projected in modes other than the -th mode, resulting in where . Also, we can understand LRC as in the language of the original variable.

5 Empirical Results

5.1 Evaluation Settings

5.1.1 Datasets

dataset cite task
1 Spiral (artificial) 100 220 20 classification
Rand1 (artificial) 1,000 100 32 (2 class)
Connectionist 111Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: Z-Alizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: First-order theorem proving, Epileptic: Epileptic Seizure Recognition. Dua:2019 60 208 64
Colposcopies 111Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: Z-Alizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: First-order theorem proving, Epileptic: Epileptic Seizure Recognition. Dua:2019 ; fernandes_transfer_2017 62 287 64
Alizadeh 111Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: Z-Alizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: First-order theorem proving, Epileptic: Epileptic Seizure Recognition. Dua:2019 ; alizadehsani_data_2013 57 303 64
Sports 111Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: Z-Alizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: First-order theorem proving, Epileptic: Epileptic Seizure Recognition. Dua:2019 ; Hajj2018ASC 59 1,000 256
SECOM Dua:2019 590 1,567 256
Ozone 111Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: Z-Alizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: First-order theorem proving, Epileptic: Epileptic Seizure Recognition. Dua:2019 72 1,848 256
Spambase Dua:2019 57 4,601 1,024
Theorem 111Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: Z-Alizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: First-order theorem proving, Epileptic: Epileptic Seizure Recognition. Dua:2019 ; Bridge14 51 6,118 1,024
Epileptic 111Connectionist: Connectionist Bench (Sonar, Mines vs. Rocks), Colposcopies: Quality Assessment of Digital Colposcopies, Alizadeh: Z-Alizadeh Sani, Sports: Sports articles for objectivity analysis, Ozone: Ozone Level Detection, Theorem: First-order theorem proving, Epileptic: Epileptic Seizure Recognition. Dua:2019 ; Andrzejak02 178 11,500 2,048
Iris Dua:2019 4 150 32 classification
Wine Dua:2019 13 178 32 (3 class)
2 Genes Shimamura_11 13,508 1,732 762 128 regression
Rand2 (artificial) 1,000 1,000 100 32 classification
3 Rand3 (artificial) 1,000 1,000 1,000 100 32 (2 class)
Table 1: Summary of datasets. : order of the dataset. : number of variables. : number of samples. : mini batch size.

Table 1 shows the summary of the datasets used in this study. All the vector datasets (), except the artificial datasets, are downloaded from the website of UCI Machine Learning Repository222https://archive.ics.uci.edu/ml/index.php Dua:2019 . We obtain Genes from a website333http://bonsai.hgc.jp/~shima/NetworkProfiler/ Shimamura_11

. We normalize all the variables of the vector datasets by making the average values zero and the standard deviations one.

5.1.2 Comparison Methods

We compare TRIP with methods that can be applied to the LTPNP problem, which linearly project data points into low-dimensional subspace and predict scalar responses using prediction models with strong nonlinearity. As in TRIP, NNs are used for the prediction models. The compared methods for vector datasets use PCA and LDA for linear projection, and the methods combined with NNs are called PCANN and LDANN, respectively. For tensor datasets (), we compare TRIP with a method of applying NNs to tensors projected to the low-dimensional subspace by Tucker decomposition Kolda09 , i.e., core tensors. This method is called TuckerNN.

In all the experiments other than evaluation using Spiral and scalability tests, we select parameters of each method based on the average accuracies of trials of -fold cross validation. For Spiral, we use the average test accuracies of trials. The number of hidden layers of NNs are chosen from zero (linear/softmax regression) to

, with 10 neurons in each layer. We use ReLU as activation function. Also, we use softmax cross entropy loss for classification and squared error loss for regression. The best

s for TRIP are chosen from

. The number of epochs in SGD is determined based on sufficient convergence,

i.e., for Spiral, for the datasets of -class classification task other than artificial datasets, (TRIP) and (PCANN, LDANN) for Iris and Wine, (TRIP) and (TuckerNN) for Genes, and for Rand1, Rand2, and Rand3. We use Adam optimization algorithm with learning rate of , whereas we use for Genes.

TRIP, PCANN, LDANN, and TuckerNN are implemented in Python 3.7 using PyTorch 1.4

444https://pytorch.org/. All the experiments are conducted on an Intel(R) Xeon(R) Gold 6130 CPU 2.10GHz with 32GB of memory, running Linux.

5.2 Evaluation using Artificial Data

5.2.1 Data Generation

We create artificial data Spiral in which essential decision boundaries are carefully hidden in high-dimensional space. First, we generate a two-dimensional distribution in which the data points of the two classes are distributed in a two-dimensional spiral, as shown in Fig. 1(a). Specifically, we randomly select

from a uniform distribution of

and calculate the two-dimensional coordinates, i.e., for the samples of the first class and for the samples of the second class. We also add samples to each class as noise samples, with the two-dimensional coordinates of selected from the uniform distribution of . We normalize the standard deviation of the coordinates in each dimension of these samples to . We also create the coordinates of an additional one dimension as a major noise dimension from a normal distribution with the standard deviation of , which cannot be distinguished from the dimensions with spiral distribution with PCA because of the same variance. In addition, we add the coordinates of dimensions as minor noise dimensions from a normal distribution with the standard deviation of , which should cause model overfitting when the regularization of reconstruction errors does not sufficiently work (Fig. 1(b)). Finally, the spiral distribution is hidden in the -dimensional space by transforming the data using a randomly generated rotation transform. As a result, we generate artificial dataset containing samples with -dimensional coordinates. We generate the two sets of samples by selecting coordinates independently, i.e., training dataset and test dataset, using the procedure described above with the same rotation transform.

5.2.2 Accuracy Evaluation and Visualization

Because we know that the essential decision boundaries have the two-dimensional distribution with spiral shape, we conduct the experiments using two projection vectors, i.e., . The highest test accuracies are achieved with an NN with hidden layers and . Figure 2(a) shows the fitting accuracies on the training dataset and the test accuracies on the test dataset with the NN with hidden layers for several values of , along with the results of LDANN and PCANN. Overall, lower cause overfitting in which fitting accuracies are high whereas test accuracies are low. On the other hand, higher can balance between the fitting accuracies and the test accuracies, which indicates that the reconstruction errors work as regularization functions. The result of LDANN is similar to those of low , and the result of PCANN is similar to those of high .

Figure. 2(b-g) plot the projected data points of the training dataset with the NN with hidden layers for several values of . All the plots are chosen from trials because they reduce the objective function the most. We can recognize the spiral boundaries for the of and (Fig. 2(d)(e)). For the less than , the samples of the two classes are well separated (Fig. 2(b)(c)); however, the test accuracies are very low (Fig. 2(a)), suggesting the models overfit to the training dataset for these . When we use more than , the samples of the two classes are not clearly separated (Fig. 2(f)(g)), the distributions of which are similar to that of PCANN (Fig. 1(e)). The results on more than suggest that the projection vectors are learned in an almost unsupervised manner. Thus, by using the appropriate , the reconstruction error works properly as regularization, and TRIP extracts the subspace that captures the inherently important features for separation in high-dimensional data.

Figure 2: (a) Fitting and test accuracies of Spiral for several values of using hidden layers of NNs applied to the -dimensional subspace. Line plot shows the value of . The number of hidden layers of NNs is indicated by the number of asterisks (). (b-g) Distribution of the projected data points of the training data of Spiral for several values of using hidden layers of NNs. Contours represent the softmax value for class . Blue circles: class , red triangles: class .

5.3 Evaluation by Real-World Data

5.3.1 Evaluation of Accuracies

To quantitatively show the effectiveness of TRIP, the area under the curves (AUCs) calculated by -fold cross validations for various -class classification tasks are shown in Fig. 3. TRIP achieves higher accuracies than PCANN and LDANN in most of the datasets, especially when the data points are projected into - or - dimensions. Also, datasets with many samples require the NNs with many hidden layers (Fig. 3(g)(h)(i)). This indicates that the decision boundaries of these datasets contain relatively strong nonlinearities. These results suggest that TRIP is good at linear projection while maintaining high prediction accuracy in a low-dimensional subspace that is easy for humans to understand, even when the decision boundary has high nonlinearity.

Figure 3: AUCs when projected into - to -dimensional subspace using PCANN, LDANN, and TRIP on -class classification tasks. Line plot shows the value of . The number of hidden layers of NNs is indicated by the number of asterisks ().

5.3.2 Visualizing Subspace

We show how to understand the reason for the prediction using two datasets of -class classification, Iris and Wine. We use two-dimensional subspace, i.e., , to understand the results in the two-dimensional plots. Figure 4(a)(e) shows the accuracies of PCANN, LDANN, and TRIP using NNs with to hidden layers. Though the best accuracies are almost same among the methods, TRIP achieves high accuracies even with the model without hidden layers of the NN. We choose the number of hidden layers of the NN as for Iris and for Wine, and for Iris and for Wine, by which TRIP achieves the highest accuracies.

We show the softmax value of TRIP as contours for each class of Iris (Fig. 4 (b)(c)(d)) and Wine (Fig. 4 (f)(g)(h)). The two-axis are rotated as described in Section 4. We also represent LRC of each data point with an arrow. In Iris, data points of Versicolor and Virginica are very close, and LRC of most of these data points have almost same direction (Fig. 4 (c)(d)). These LRC can be interpreted in the language of the original variables, i.e., . Roughly speaking, the direction of LRC means Virginica has larger petal and smaller sepal than Versicolor. In Wine, it can be observed that large LRC of data points near the decision boundary separating the class and (Fig. 4 (f)) or the class and (Fig. 4 (g)) have almost same direction. In short, in order for a class wine close to the decision boundary to be a class or class wine, at least the value of Color intensity, Proline, Alcohol, and Ash must be higher and the value of Hue must be lower. In addition, the value of Alcalinity of ash needs to be smaller to be a class wine, whereas the value of Flavanoids must be smaller to be a class wine. Thus, LRC makes it easy to understand the reason for prediction, in a linearly projected low-dimensional subspace.

Figure 4: (a)(e) Accuracies of PCANN, LDANN, and TRIP on Iris and Wine. We use NNs with to hidden layers indicated by the number of asterisks (), applied to the -dimensional subspace. Line plot shows the value of . (b)(c)(d)(f)(g)(h) Softmax value of TRIP for each class of Iris and Wine is represented as contours. Arrows indicate LRC for the data points calculated using . Red, blue, and green arrows are Setosa, Versicolor, and Virginica for Iris, and Class , , and for Wine, respectively.

5.3.3 Results on Higher-Order Dataset

As a higher-order dataset, we use Genes in which each data point is a matrix, i.e., the second order tensor. This dataset describes varying coefficients (VC) between target genes (TG) and regulator genes (RG) in cell lines Shimamura_11 . VC are computed based on the expression levels of genes, where the strengths of relationship between genes is allowed to vary with epithelial-mesenchymal transition (EMT)-related modulator value (hereinafter called EMT value), which describe cell lines as epithelial-like or mesenchymal-like. We set VC with the absolute values less than as zero, resulting in non-zero elements on average. We focus on the regression task to predict the EMT values. We project each data point of the cell line to the subspace of the size of , i.e., and . Figure 5(a) shows the root mean squared error (RMSE) of TRIP and TuckerNN for predicting EMT value using NNs with different numbers of hidden layers. TRIP shows much better accuracy than TuckerNN, even when using NNs with no or only one hidden layer. We choose the model of one hidden layer of the NN and as the best parameters. Figure 5(b)(c) plots the data points projected to the two-axis. As described in Section 4, the two axes of each mode are rotated and LRC of every data points indicated by the arrows are calculated. Note that the predicted values have a high Pearson’s correlation coefficient with TG and RG on the first axis ( and ), while low on the second axis ( and ). Interestingly, there is positive correlation between value of the first axes and the second axes for lower value of the first axes, and negative correlation for higher value. Moreover, a mixture of LRC in two different directions is observed throughout the data. This suggests that there are at least two gene groups in both TG and RG, which interact in a complex manner.

We show TG and RG that have the highest absolute scores of on the two axes in Table 2. Note that we only consider absolute scores because the sign of the scores becomes unstable since the projection is performed by the outer product of the projection vectors of each mode. We analyze the top- genes by the bioinformatics tool DAVID 555https://david.ncifcrf.gov/ huang_systematic_2009 , which provides functional annotation clustering of genes and their biological interpretation. It can be seen through annotation summary results of Tissue expression (t_e) that there is a clear difference between the axes; the Platelet, Thyroid, and Thymus are derived as annotations related to the first axis of TG, whereas annotations Ovary and Eye are extracted for the second axis of TG. On the other hand, the results for RG include only Epithelium in the first axis, whereas Muscle, Pancreatic cancer, Colon cancer, Uterus, Colon adenocarcinoma, Colon tumor, and Peripheral blood lymphocyte in the second axis. These observations suggest that there are other biomechanisms involved in various ways with EMT, independent of the relationships between TG and RG extracted in the first axis that can directly predict the EMT value. We hope that these results will be analyzed more deeply by biological researchers, leading to the discovery of new biological mechanisms.

Figure 5: (a) RMSE of TuckerNN and TRIP on Genes. We use NNs with to hidden layers indicated by the number of asterisks (), applied to the dimensional subspace. Line plot shows the value of . (b)(c) Values of data points in Genes projected on first (horizontal) and second (vertical) axes by TRIP. Arrows indicate LRC for the data points calculated using . Color indicates the predicted EMT values.
Target # 1 Target # 2 Regulator # 1 Regulator # 2
score gene t_e score gene t_e score gene t_e score gene t_e
0.084 C17orf39 0.025 GTF2H3 EY 0.128 NCOR1 EP 0.097 DIP2A UT
0.083 SLC6A8 0.025 KIAA* 666KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsa-miR-223, m451: hsa-miR-451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsa-miR-422a*. 0.118 IKZF1 0.035 m451 666KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsa-miR-223, m451: hsa-miR-451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsa-miR-422a*.
0.072 MPP1 TM 0.024 TIGD1L 0.110 LYL1 0.034 SAP* 666KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsa-miR-223, m451: hsa-miR-451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsa-miR-422a*. MS,UT
0.067 PLOD3 0.023 TCF7 0.109 DIP2A 0.032 HSF1 MS
0.062 CHKA 0.022 LOC* 666KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsa-miR-223, m451: hsa-miR-451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsa-miR-422a*. 0.090 AIP 0.032 NAT8 666KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsa-miR-223, m451: hsa-miR-451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsa-miR-422a*. CT
0.058 SASH1 TR 0.022 PRR11 EY 0.089 SOX10 0.032 m422a* 666KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsa-miR-223, m451: hsa-miR-451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsa-miR-422a*.
0.055 AFTPH 0.022 ERN2 0.088 m223 666KIAA*: KIAA0894, LOC*: LOC100272216, m223: hsa-miR-223, m451: hsa-miR-451, SAP*: SAP30BP, NAT8: NAT8 /// NAT8B, m422a*: hsa-miR-422a*. 0.031 LMCD1
0.054 GSPT1 0.022 MBTD1 EY 0.086 ELF3 0.030 TAF12
0.053 MSX2 0.021 MTRF1L EY 0.086 TTF1 EP 0.030 LBA1
0.051 COASY PL 0.021 UBQLN4 OV 0.081 CREB3 0.030 NOTCH2
Table 2: Genes with high absolute scores. PL: Platelet, TR: Thyroid, TM: Thymus, OV: Ovary, EY: Eye, EP: Epithelium, MS: Muscle, UT: Uterus, and CT: Colon tumor.

5.4 Evaluation of Scalability

To empirically prove the scalability of TRIP, we create artificial datasets of first-, second-, and third-order tensors, i.e., Rand1, Rand2, Rand3 in Table 1, respectively, by randomly selecting non-zero elements. We only show the results of the model using two hidden layers of the NN and ; however, we obtained almost the same results for other settings (not shown). Figure 6 (a) shows that the training time of TRIP increases linearly to the number of non-zero element of the datasets. Figure 6 (b) also shows that the training time increases linearly to the size of the projected tensor, where we change the size of one dimension of the projected tensor while fixing the size of the other dimensions as . Note that the number of non-zero elements are fixed to .

Figure 6: Scalability of TRIP. The training time of TRIP scales linearly to the number of non-zero elements (a) and the size of projected data (b).

6 Conclusion

We proposed a method, TRIP, for searching linear projection that maximizes prediction accuracy while retaining the original data variance as much as possible. Experiments using artificial data showed that TRIP correctly finds decision boundaries among many variables even with strong nonlinearity. Experiments with real-world data, including higher-order tensors, also showed that TRIP makes it possible to find various events behind the data, using human-readable low-dimensional subspace. Moreover, we empirically showed that TRIP scales linearly to the number of non-zero elements and the size of projected tensor.

References

  • (1) Laurens van der Maaten and Geoffrey Hinton. Viualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
  • (2) Elnaz Barshan, Ali Ghodsi, Zohreh Azimifar, and Mansoor Zolghadri Jahromi. Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recognit., 44(7):1357–1371, 2011.
  • (3) Babak Hosseini and Barbara Hammer. Interpretable discriminative dimensionality reduction and feature selection on the manifold. CoRR, abs/1909.09218, 2019.
  • (4) Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller.

    Nonlinear component analysis as a kernel eigenvalue problem.

    Neural Computation, 10(5):1299–1319, 1998.
  • (5) Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
  • (6) Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • (7) Rajarshi Guhaniyogi, Shaan Qamar, and David B. Dunson. Bayesian tensor regression. J. Mach. Learn. Res., 18:79:1–79:31, 2017.
  • (8) Lifang He, Kun Chen, Wanwan Xu, Jiayu Zhou, and Fei Wang. Boosted sparse and low-rank tensor regression. In NeurIPS 2018, pages 1017–1026, 2018.
  • (9) Masaaki Imaizumi and Kohei Hayashi. Doubly decomposing nonparametric tensor regression. In ICML, pages 727–736, 2016.
  • (10) Rose Yu and Yan Liu. Learning from multiway data: Simple and efficient tensor regression. In ICML, pages 373–381, 2016.
  • (11) Lifang He, Chun-Ta Lu, Guixiang Ma, Shen Wang, Linlin Shen, Philip S. Yu, and Ann B. Ragin. Kernelized support tensor machines. In ICML, pages 1442–1451, 2017.
  • (12) Dacheng Tao, Xuelong Li, Xindong Wu, Weiming Hu, and Stephen J. Maybank. Supervised tensor learning. Knowl. Inf. Syst., 13(1):1–42, 2007.
  • (13) I.T. Jolliffe and Springer-Verlag. Principal Component Analysis. Springer Series in Statistics. Springer, 2002.
  • (14) Aleix M. Martínez and Avinash C. Kak. PCA versus LDA. IEEE Trans. Pattern Anal. Mach. Intell., 23(2):228–233, 2001.
  • (15) Roman Rosipal and Nicole Krämer. Overview and recent advances in partial least squares. In

    Subspace, Latent Structure and Feature Selection

    , pages 34–51. Springer, 2006.
  • (16) Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.
  • (17) Amina Adadi and Mohammed Berrada.

    Peeking inside the black-box: A survey on explainable artificial intelligence (XAI).

    IEEE Access, 6:52138–52160, 2018.
  • (18) Zachary C. Lipton. The mythos of model interpretability. Commun. ACM, 61(10):36–43, 2018.
  • (19) Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. "Why should I trust you?": Explaining the predictions of any classifier. In KDD, pages 1135–1144, 2016.
  • (20) Koji Maruhashi, Masaru Todoriki, Takuya Ohwa, Keisuke Goto, Yu Hasegawa, Hiroya Inakoshi, and Hirokazu Anai. Learning multi-way relations via tensor decomposition with neural networks. In AAAI, pages 3770–3777, 2018.
  • (21) Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
  • (22) Kelwin Fernandes, Jaime S. Cardoso, and Jessica Fernandes. Transfer learning with partial observability applied to cervical cancer screening. In Iberian Conference on Pattern Recognition and Image Analysis, pages 243–250. Springer, 2017.
  • (23) Roohallah Alizadehsani, Jafar Habibi, Mohammad Javad Hosseini, Hoda Mashayekhi, Reihane Boghrati, Asma Ghandeharioun, Behdad Bahadorian, and Zahra Alizadeh Sani. A data mining approach for diagnosis of coronary artery disease. Computer Methods and Programs in Biomedicine, 111(1):52 – 61, 2013.
  • (24) Nadine Hajj, Yara Rizk, and Mariette Awad. A subjectivity classification framework for sports articles using improved cortical algorithms. Neural Computing and Applications, 31:8069–8085, 2018.
  • (25) James Bridge, Sean Holden, and Lawrence Paulson. Machine learning for first-order theorem proving.

    Journal of Automated Reasoning

    , 53, 2014.
  • (26) Ralph Andrzejak, Klaus Lehnertz, Florian Mormann, Christoph Rieke, Peter David, and Christian Elger. Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Physical review. E, Statistical, nonlinear, and soft matter physics, 64:061907, 2002.
  • (27) Teppei Shimamura, Seiya Imoto, Yukako Shimada, Yasuyuki Hosono, Atsushi Niida, Masao Nagasaki, Rui Yamaguchi, Takashi Takahashi, and Satoru Miyano. A novel network profiling analysis reveals system changes in epithelial-mesenchymal transition. PloS one, 6:e20804, 2011.
  • (28) Da Wei Huang, Brad T. Sherman, and Richard A. Lempicki. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols, 4(1):44–57, 2009.

Appendix A Calculating Gradient in Orthonormal Conditions

How we derive Eqs. 7 and  8 is shown below. For readability, superscripts are omitted. The defined by Eqs. 5 and  6 is calculated as a matrix that satisfies

(11)

because the former equation of Eqs. 11 can be written as by using Eq. 5 and this is regarded as a SVD, where the diagonal components of are singular values and the column vectors of are left singular vectors. Thus can be obtained. By differentiating both sides of both equations of Eqs. 11 with the -th component of , we can obtain

(12)

The that satisfies Eqs. 12 is by using Eq. 8. If we use the property of , we obtain

(13)

The is for the -th component otherwise . Thus, we can derive Eq. 7.