Gene Shaving using influence function of a kernel method

09/05/2018 ∙ by Md. Ashad Alam, et al. ∙ 0

Identifying significant subsets of the genes, gene shaving is an essential and challenging issue for biomedical research for a huge number of genes and the complex nature of biological networks,. Since positive definite kernel based methods on genomic information can improve the prediction of diseases, in this paper we proposed a new method, "kernel gene shaving (kernel canonical correlation analysis (kernel CCA) based gene shaving). This problem is addressed using the influence function of the kernel CCA. To investigate the performance of the proposed method in a comparison of three popular gene selection methods (T-test, SAM and LIMMA), we were used extensive simulated and real microarray gene expression datasets. The performance measures AUC was computed for each of the methods. The achievement of the proposed method has improved than the three well-known gene selection methods. In real data analysis, the proposed method identified a subsets of 210 genes out of 2000 genes. The network of these genes has significantly more interactions than expected, which indicates that they may function in a concerted effort on colon cancer.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gene shaving (GS), identifies subsets of genes, is an important research area in the analysis of an DNA microarray gene expression data for biomedical discovery. It leads to gene discovery relevant for a particular target annotation. GS is not relevant to the hierarchical clustering and other widely used methods for analyzing gene expression in the genome-wide association studies. GS leads to gene discovery relevant for a specific target annotation. Hence, those selected genes play an important role in the analysis of gene expression data since they are able to differentiate samples from different populations. Despite their successes, these studies are often hampered by their relatively low reproducibility and nonlinearity

Hastie . (); Ruan  Yuan (2011); Chen  Ishwaran (2012); Castellanos-Garzón  Romos ().

The incorporation of various statistical machine learning methods into genomic analysis is a rather recent topic. Since large-scale DNA microarray data present significant challenges for statistical data analysis as the high dimensionality of genomic features makes the classical approaches framework no longer feasible. The kernel methods is a appropriate tools to deal such datasets that map data from a high dimensional space to a feature space using a nonlinear feature map. The main advantage of these methods is to combine statistics and geometry in an effective way Hofmann . (2008); Alam  Fukumizu (2014); Charpiat . (). Kernel canonical correlation analysis (kernel CCA) have been extensively studied for decades Akaho (2001); Alam  Fukumizu (2015, 2013),.

Nowadays, sensitivity, influence function (IF), based methods have been used to detect an influence observation. a visualization method for detecting influential observations using the IF of kernel PCA has been propposed Debruyne et al. (2009) Debruyne . (2010)

. Filzmoser et al. (2008) also developed a method for outlier identification in high dimensions

Filzmoser . (2008). However, these methods are limited to a single data set. Due to the properties of eigen-decomposition, kernel CCA and its variant are still a well used method for an biomedical data analysisAlam . (2008, 2016, 2018).

The contribution of this paper is three-fold. First, we address the IF of kernel CCA. Second, we use the distribution based methods to confirm the influential observations. Finally, the proposed method is applied to identify a set of gene in both synthesized and real an DNA microarray gene expression data.

The remainder of the paper is organized as follows. In the next section, we provide a brief review of positive definite kernel, kernel CCA and IF of kernel CCA. The utility of the proposed method is demonstrated by both simulated and real data analysis from an imaging genetics study in Section 3. In Section 4, we summarize our findings and give a perspective for future research.

2 Method

2.1 Positive definite kernel

In kernel methods, a nonlinear feature map is defined by positive definite kernel. It is known Aronszajn (1950) that a positive definite kernel is associated with a Hilbert space , called reproducing kernel Hilbert space (RKHS), consisting of functions on so that the function value is reproduced by the kernel. For any function and a point , the function value is where in the inner product of is called the reproducing property. Replacing with yields for any . A symmetric kernel defined on a space is called positive definite, if for an arbitrary number of points the Gram matrix is positive semi-definite. To transform data for extracting nonlinear features, the mapping is defined as

which is a function of the first argument. This map is called the f feature map, and the vector

in is called the feature vector. The inner product of two feature vectors is then This is known as the kernel trick. By this trick the kernel can evaluate the inner product of any two feature vectors efficiently without knowing an explicit form of Hofmann . (2008); Alam  Fukumizu (2014); Charpiat . ().

2.2 Kernel canonical correlation analysis

Kernel CCA has been proposed as a nonlinear extension of linear CCA Akaho (2001). Researchers have extended the standard kernel CCA with an efficient computational algorithm Bach  Jordan (2002). Over the last decade, kernel CCA has been used for various tasks Alzate  Suykens (2008); Huang . (2009); Richfield . (2017); Alam  Fukumizu (2015)

. Given two sets of random variables

and with two functions in the RKHS, and , the optimization problem of the random variables and is

(1)

The optimizing functions and are determined up to scale.

Using a finite sample, we are able to estimate the desired functions. Given an i.i.d sample,

from a joint distribution

, by taking the inner product with elements or “parameters” in the RKHS, we have features and , where and are the associated kernel functions for and , respectively. The kernel Gram matrices are defined as and . We need the centered kernel Gram matrices and , where with and is the vector with ones. The empirical estimate of Eq. (1) is then given by

where

where and are the directions of and , respectively.

2.3 Influence function of the kernel canonical correlation analysis

By using the IF of kernel PCA, linear PCA and linear CCA, we can derive the IF of kernel CCA (kernel CC and kernel CVs). For simplicity, let us define .

Theorem 2.1

Given two sets of random variables having the distribution and the j-th kernel CC ( ) and kernel CVs ( and ), the influence functions of kernel CC and kernel CVs at are

(2)

The above theorem has been proved on the basis of previously established ones, such as the IF of linear PCA Tanaka (1988, 1989), the IF of linear CCA Romanazzi (1992), and the IF of kernel PCA, respectively. The details proof is given in Alam . (2018).

Using the above result, we can establish some properties of kernel CCA: robustness, asymptotic consistency and its standard error. In addition, we are able to identify a set of genes based on the influence of the data.

For a sample data, let be a sample from the empirical joint distribution . The EIF (IF based on empirical distribution) of kernel CC and kernel CVs at for all points are , , and , respectively.

For the bounded kernels, the IFs defined in Theorem 2.1 have three properties: gross error sensitivity, local shift sensitivity, and rejection point. But for unbounded kernels, say a linear, polynomial and so on, the IFs are not bounded.

3 Experiments

To demonstrate the performance of the proposed method in a comparison of three popular gene selection methods (T-test, SAM and LIMMA), we used both simulated and real microarray gene expression datasets. We used three R packages of other methods such as stats, samr and limma. The performance measures AUC were computed for each of the methods using ROC package. All R packages are available in the comprehensive R archive network (cran) or bioconductor.

3.1 Simulation study

To investigate the performance of the proposed method in comparison with three popular methods as mentioned above with k= 2 groups, we considered gene expression profiles from both normal distribution and t-distribution. We also considered datasets for both small-and-large-sample cases with different percentages of DE genes.

3.2 Simulated gene expression profiles generated from Normal Distribution

The following one-way ANOVA model was used to generate simulated datasets from normal distribution

(3)

where , i is the expression of the th gene for the th samples in k group, is the mean of all expressions of ith gene in the kth group and

is the random error which usually follows a normal distribution with mean zero mean and variance

.

To investigate the performance of the proposed method in a comparison of other three popular methods as early mentioned for groups, we generated datasets using times of simulations for both small and large sample cases using Eq. (3.2). The means and the common variance of both groups were set as and , respectively. Each dataset for each case represented the gene expression profiles of genes, with samples. The proportions of DE gene (pDEG) were set to and for each of the datasets. We computed average values of different performance measures such as TPR, TNR, FPR, FNR, MER, FDR and AUC based on and estimated DE genes by four methods (T-test, SAM, LIMMA and Proposed) for each of datasets. Fig. 1a and Fig.1b represents the ROC curve based on estimated DE genes by four methods for both small-and-large-sample cases, respectively. From this figure we observe that the proposed method performed better than other three methods for small-sample case (see Fig.1a). On the other hand, for large-sample case (see Fig.1b) proposed method keeps almost equal performance with other three methods (T-test, SAM and LIMMA). Fig.2 shows the boxplot of AUC values based on 100 simulated datasets estimated by each of the four methods both for small-and-large-sample cases, respectively. Fig.2a and Fig.2b represent the boxplots of AUC values with pDEG = and , respectively. From these boxplots we obtained similar results like ROC curve for every pDEG values. We also noticed that the performance of the methods increases when we increase the value of pDEG to . Furthermore, we estimated the average values of different performance measures such TPR, TNR, FPR, FNR, MER, FDR and AUC based on (pDEG=) and (pDEG=) estimated DE genes by each of the methods. The results are summarized in Table 1. In this table the results without and within the brackets (.) indicates average of different performance measures estimated by different methods for small-and-large sample cases, respectively. From this Table 1 we also revealed similar interpretations like ROC curve and boxplots.

Figure 1: Performance evaluation using ROC-curve produced by the four methods (T-test, SAM, LIMMA and Proposed) based on 100 datasets with pDEG=0.02. Datasets were generated from normal distribution for (a) and (b) and datasets were generated from t-distribution for (c) and (d), where (a) and (c) represents ROC curve for small-sample case (n1=n2=3) and (b) and (d) represents ROC curve for large-sample case (n1=n2=15).
Methods With proportion of DE gene (pDEG) = 0.02
TPR TNR FPR FNR MER FDR AUC
T-test
0.702
(0.932)
0.006
(0.001)
0.994
(0.999)
0.298
(0.068)
0.012
(0.003)
0.298
(0.068)
0.702
(0.932)
SAM
0.775
(0.935)
0.005
(0.001)
0.995
(0.999)
0.225
(0.065)
0.009
(0.003)
0.225
(0.065)
0.775
(0.935)
LIMMA
0.810
(0.935)
0.004
(0.001)
0.996
(0.999)
0.190
(0.065)
0.008
(0.003)
0.190
(0.065)
0.810
(0.935)
Proposed
0.890
(0.935)
0.002
(0.001)
0.998
(0.999)
0.110
(0.050)
0.004
(0.002)
0.110
(0.050)
0.890
(0.950)
Methods With proportion of DE gene (pDEG) = 0.06
TPR TNR FPR FNR MER FDR AUC
T-test
0.772
(0.933)
0.012
(0.004)
0.988
(0.996)
0.228
(0.067)
0.023
(0.007)
0.228
(0.067)
0.771
(0.933)
SAM
0.810
(0.933)
0.010
(0.004)
0.990
(0.996)
0.190
(0.067)
0.019
(0.007)
0.190
(0.067)
0.809
(0.933)
IMMA
0.823
(0.933)
0.009
(0.004)
0.991
(0.996)
0.177
(0.067)
0.018
(0.007)
0.177
(0.067)
0.823
(0.933)
Proposed
0.911
(0.959)
0.005
(0.002)
0.995
(0.996)
0.089
(0.041)
0.009
(0.004)
0.089
(0.041)
0.911
(0.933)
Table 1: Performance evaluation of different methods based on simulated gene expression dataset generated from normal distribution.

3.3 Simulated Gene Expression Profiles generated from t- Distribution

We also investigated the performance of the proposed method in a comparison of other three methods (T-test, SAM and LIMMA) for non-normal case; accordingly we generated 100 simulated datasets from t-distribution with 10 degrees of freedom. We set the mean and variance as before. We estimated different performance measures such as TPR, TNR, FPR, FNR, MER, FDR and AUC based on 20 estimated DE genes by four methods for each of 100 datasets. The average values of performance measures are summarized in Table 2. From this table we notice that the performances of all the methods deteriorated when the datasets came from t-distribution. We also observe that the proposed method performed better than the other three methods (T-test, SAM and LIMMA). For example, the proposed method produces AUC =

() which is larger than (), () and () for the competitors T-test, SAM and LIMMA. The boxplots in Fig.3 and ROC curve in Fig.1(c-d) also revealed similar results like Table 2. We also noticed from boxplots that the proposed method has less variability among the other three methods. From this analysis we may conclude that the performance of the proposed method has improved than the three well-known gene selection methods.

Figure 2: Performance evaluation using boxplot of AUC values produced by the four methods (T-test, SAM, LIMMA and Proposed) based on 100 datasets were taken from normal distribution for small-and large-sample cases (a) Boxplot of AUC values with proportion of DE gene=0.02. (b) Boxplot of AUC values with proportion of DE gene=0.06. Each dataset contains p =1000 genes.
Methods With proportion of DE gene (pDEG) = 0.02
TPR TNR FPR FNR MER FDR AUC
T-test
0.318
(0.830)
0.014
(0.003)
0.986
(0.997)
0.682
(0.170)
0.027
(0.007)
0.682
(0.170)
0.316
(0.830)
SAM
0.328
(0.832)
0.014
(0.003)
0.986
(0.997)
0.672
(0.168)
0.027
(0.007)
0.672
(0.168)
0.326
(0.832)
LIMMA
0.412
(0.880)
0.012
(0.002)
0.988
(0.998)
0.588
(0.120)
0.024
(0.005)
0.588
(0.120)
0.411
(0.880)
Proposed
0.470
(0.888)
0.011
(0.002)
0.988
(0.998)
0.530
(0.112)
0.021
(0.004)
0.530
(0.112)
0.469
(0.887)
Table 2: Performance evaluation of different methods based on simulated gene expression dataset generated from t-distribution
Figure 3: Performance evaluation using boxplot of AUC values produced by the four methods (T-test, SAM, LIMMA and Proposed) based on 100 datasets were taken from t-distribution distribution for small-and large-sample cases (a) Boxplot of AUC values with proportion of DE gene=0.02. (b) Boxplot of AUC values with proportion of DE gene=0.06. Each dataset contains p =1000 genes.

3.4 Application to colon cancer microarray data

The data consist of expression levels of 2000 genes obtained from a microarray study on 62 colon tissue samples collected from colon-cancer patients. Among colon tissue, tumor tissues, coded 2 and 22 normal tissues, coded 1 Alon . (1999). The goal here is to characterize the underlying interactions between genetic makers for their association with the colon-cancer patients and the healthy persons.

To calculate the influence value of each gene, we used three methods: PCAout, liner CCA and the proposed method, KCCA, respectively. Figure 4 visualizes the plots of absolute influence value for

genes. By the outliers detection technique in the one dimensional influence value of each method, we obtained

, and genes using PCAout, liner CCA and the proposed method KCCA, respectively. To compare the selected genes, we made a Venn-diagram of the selected genes from the three methods. Figure 5 presents the Venn-diagram of the PCOut, LCCAOut, and KCCAOut methods. From this figure, we observe that the disjoint selected genes of PCOut, LCCAOut, and KCCAOut are , , and , respectively. The number of common genes between PCOut and LCCAOut, and PCOut and KCCAOut, and LCCAOut and KCCAOut are 7, 1, and 61, respectively. All methods selected 4 common genes: J00231, T57780, M94132 and M87789.

Figure 4:

The influence value of genes using three methods: principal components analysis (PCOut), linear canonical correlation analysis (LCCA), and kernel canonical correlation analysis (KCCA).

Figure 5: The Venn diagram of the selected genes using three methods: principal components analysis (PCOut), linear canonical correlation analysis (LCCA), and kernel canonical correlation analysis (KCCA).

Genes do not function alone; rather, they interact with each other. When genes share a similar set of GO annotation terms, they are most likely to be involved with similar biological mechanisms. To verify this, we extracted the gene-gene networks using STRING Szklarczyk . (2007). STRING imports protein association knowledge from databases of both physical interactions and curated biological pathways. In STRING, the simple interaction unit is the functional relationship between two proteins/genes that can contribute to a common biological purpose. Figure  6 shows the gene-gene network based on the protein interactions between the combined . In this figure, the color saturation of the edges represents the confidence score of a functional association. Further network analysis shows that the number of nodes, number of edges, average node degree, clustering coefficient, PPI enrichment -values are , , , for , respectively. This network of genes has significantly more interactions than expected, which indicates that they may function in a concerted effort.

Figure 6: The network of the selected genes.

4 Concluding remarks

The kernel based methods provide more powerful and reproducible outputs, but the interpretation of the results remains challenging. Incorporating biological knowledge information (e.g., GO) can provide additional evidences on the results. The performance of the proposed method was evaluated on both simulated and a real data. The extensive simulation studies show the power gain of the proposed method relative to the alternative methods.

The utility of the proposed method is further demonstrated with the application to colon cancer microarray data. According to the influence values, the proposed method is able to rank the influence of a gene and the genes are are identified to be highly related to disease. Using a ourlier detection methods the proposed method extracts the genes out of genes, which are considered to have significant impact on the patients. By conducting gene ontology, pathway analysis, and network analysis including visualization, we find evidences that the selected genes have a significant influence on the manifestation of colon cancer disease and can serve as a distinct feature for the classification of colon cancer patients from the healthy controls.

Although the Gaussian kernel has a free parameter (bandwidth), in this study, we used the median of the pairwise distance as the bandwidth for the Gaussian kernel, which appears to be practical. In future work, tt must be emphasized that choosing a suitable kernel is indispensable.

Acknowledgments

The authors wish to thank the University Grants Commission of Bangladesh for support.

References

  • Akaho (2001) AkahoAkaho, S.  2001. A kernel method for canonical correlation analysis A kernel method for canonical correlation analysis. International meeting of psychometric Society.35321-377.
  • Alam . (2016) Alam-2016ACMaAlam, MA., Calhoun, V.  Wang, YP.  2016. Influence Function of Multiple Kernel Canonical Analysis to Identify Outliers in Imaging Genetics Data Influence function of multiple kernel canonical analysis to identify outliers in imaging genetics data. Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Proceedings of the 7th acm international conference on bioinformatics, computational biology, and health informatics ( 210–219).
  • Alam  Fukumizu (2013) Ashad-13Alam, MA.  Fukumizu, K.  2013. Higher-order regularized kernel CCA Higher-order regularized kernel CCA. 12th International Conference on Machine Learning and Applications374-377.
  • Alam  Fukumizu (2014) Ashad-14Alam, MA.  Fukumizu, K.  2014. Hyperparameter Selection in Kernel Principal Component Analysis Hyperparameter selection in kernel principal component analysis. Journal of Computer Science10(7)1139–1150.
  • Alam  Fukumizu (2015) Ashad-15Alam, MA.  Fukumizu, K.  2015. Higher-Order Regularized Kernel Canonical Correlation Analysis Higher-order regularized kernel canonical correlation analysis.

    International Journal of Pattern Recognition and Artificial Intelligence29(4)1551005(1-24).

  • Alam . (2018) Alam-18bAlam, MA., Fukumizu, K.  Wang, YP.  2018. Influence function and robust variant of kernel canonical correlation analysis Influence function and robust variant of kernel canonical correlation analysis. Neurocomputing30412-29.
  • Alam . (2008) Ashad-08Alam, MA., Nasser, M.  Fukumizu, K.  2008. Sensitivity analysis in robust and kernel canonical correlation analysis Sensitivity analysis in robust and kernel canonical correlation analysis. 11th International Conference on Computer and Information Technology, Bangladesh.IEEE399-404.
  • Alon . (1999) Alon-99Alon, U., Barkai, N., Notterman, DA., Gish, K., Ybarra, S., Mack, D.  Levine, AJ.  1999.

    Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.

    96(12)6745-6750.
  • Alzate  Suykens (2008) Alzate2008Alzate, C.  Suykens, JAK.  2008. A regularized kernel CCA contrast function for ICA A regularized kernel CCA contrast function for ICA. Neural Networks21170-181.
  • Aronszajn (1950) Aron-RKHSAronszajn, N.  1950. Theory of reproducing kernels Theory of reproducing kernels. Transactions of the American Mathematical Society68337-404.
  • Bach  Jordan (2002) Back-02Bach, FR.  Jordan, MI.  2002.

    Kernel independent component analysis Kernel independent component analysis.

    Journal of Machine Learning Research31-48.
  • Castellanos-Garzón  Romos () Castellanos-15Castellanos-Garzón, J.  Romos, J.  .
  • Charpiat . () Charpiat-15Charpiat, G., Hofmann, M.  Schölkopf, B.  . Kernel methods in medical imaging Kernel methods in medical imaging. Handbook of Biomedical Imaging Handbook of biomedical imaging ( 63-81). Berlin, GermanySpringer.
  • Chen  Ishwaran (2012) XChen-12Chen, X.  Ishwaran, H.  2012. Random forests for genomic data analysis Random forests for genomic data analysis. Genomics99323-329.
  • Debruyne . (2010) Debruyne-10Debruyne, M., Hubert, M.  Horebeek, J.  2010. Detecting influential observations in Kernel PCA Detecting influential observations in kernel pca. Computational Statistics and Data Analysis543007-3019.
  • Filzmoser . (2008) Filzmoser-08Filzmoser, P., Maronna, R.  Werner, M.  2008. Outlier identification in high dimensions Outlier identification in high dimensions. computational Stastistics& Data Analysis521694-1711.
  • Hastie . () Hastei-00Hastie, T., Tibshirani, R., Eisen, MB., Alizadeh, A., Levy, R., Staudt, L.P. Brown, t.  .
  • Hofmann . (2008) Hofmann-08Hofmann, T., Schölkopf, B.  Smola, JA.  2008. Kernel Methods in Machine Learning Kernel methods in machine learning. The Annals of Statistics361171-1220.
  • Huang . (2009) Huang-2009Huang, SY., Lee, M.  Hsiao, C.  2009. Nonlinear measures of association with kernel canonical correlation analysis and applications Nonlinear measures of association with kernel canonical correlation analysis and applications. Journal of Statistical Planning and Inference1392162-2174.
  • Richfield . (2017) Richfield-17Richfield, O., Alam, MA., Calhoun, V.  Wang, YP.  2017. Learning Schizophrenia Imaging Genetics Data Via Multiple Kernel Canonical Correlation Analysis Learning schizophrenia imaging genetics data via multiple kernel canonical correlation analysis. Proceedings - 2016 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2016, Shenzhen, China5507-5011.
  • Romanazzi (1992) Romanazii-92Romanazzi, M.  1992. Influence in Canonical Correlation Analysis Influence in canonical correlation analysis. Psychometrika57(2)237-259.
  • Ruan  Yuan (2011) Ruan-11Ruan, L.  Yuan, M.  2011. An Empirical Bayes’ Approach to Joint Analysis of Multiple Microarray Gene Expression Studies An empirical bayes’ approach to joint analysis of multiple microarray gene expression studies. Biometrics671617-1626.
  • Szklarczyk . (2007) STRING-15Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., Huerta-Cepas, J.von Mering, C.  2007. STRING v10: Protein-protein Interaction Networks, Integrated over the Tree of Life STRING v10: Protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Research43531–543.
  • Tanaka (1988) Tanaka-88Tanaka, Y.  1988. Sensitivity analysis in principal component analysis: influence on the subspace spanned by principal components Sensitivity analysis in principal component analysis: influence on the subspace spanned by principal components. Communications in Statistics-Theory and Methods17(9)3157–3175.
  • Tanaka (1989) Tanaka-89Tanaka, Y.  1989.

    Influence functions related to eigenvalue problem which appear in multivariate analysis Influence functions related to eigenvalue problem which appear in multivariate analysis.

    Communications in Statistics-Theory and Methods18(11)3991–4010.