1 Introduction
In recent years, more and more studies attempt to link clinical outcomes, such as cancer and other diseases, with gene expression or other types of profiling data. It is of great interest to develop new computational methods to predict disease outcomes based on profiling datasets that contain tens of thousands of variables. The major challenges in these data lie in the heterogeneity of the samples, and the sample size being much smaller than the number of predictors (genes), i.e. the
issue, as well as the complex correlation structure between the predictors. Thus the prediction task has been formulated as a classification problem combined with selection of predictors, solved by modern machine learning algorithms such as regression based methods
[31, 1][38][25, 5] and neural networks [6]. While these methods are aimed at achieving accurate classification performance, major efforts have also been put on selecting significant genes that effectively contribute to the prediction [25, 5]. However, feature selection is based on fitted predictive models and is conducted after parameter estimation, which causes the selection to rely on the classification methods rather than the structure of the feature space itself. Beside building robust predictive models, the feature selection also serves another important purpose  the functionality of the selected features (genes) can help unravel the underlying biological mechanisms of the disease outcome.
Given the nature of the data, i.e. functionally associated genes tend to be statistically dependent and contribute to the biological outcome in a synergistic manner, a branch of gene expression classification research has been focused on integrating the relations between genes with classification methods, which helps in terms of both classification performance as well as learning the structure of feature space. A critical data source to achieve this goal has been gene networks. A gene network is a graphstructured dataset with genes as the graph vertices and their functional relations as graph edges. The functional relations are largely curated from existing biological knowledge [7, 37]. Each vertex in the network corresponds to a predictor in the classification model. Thus, it is expected that the gene network can provide useful information for a learning process where genes serve as predictors. Motivated by this fact, certain procedures have been developed where gene networks are employed to conduct feature selection prior to classification [8, 39]
. Moreover, methods that integrate gene network information directly into classifiers have also been developed. For example,
[12]proposes the random forestbased method, where the feature subsampling is guided by graph search on gene networks when constructing decision trees.
[42, 27] modify the objective function of the support vector machine with penalty terms defined according to pairwise distances between genes in the network. Similarly, [21]develops logistic regression based classifier using regularization, where again a relational penalty term is introduced in the loss function. The authors of these methods have demonstrated that embedding expression data into gene network results in both better classification performance and more interpretable selected feature sets.
With the clear evidence that gene networks can lead to novel variants of traditional classifiers, we are motivated to incorporate gene networks with deep feedforward networks (DFN), which is closely related to the stateoftheart technique deep learning [28]. Although nowadays deep learning has been constantly shown to be one of the most powerful tools in classification, its application in bioinformatics is limited [32]. This is due to many reasons including the issue, the large heterogeneity in cell populations and clinical subject populations, as well as inconsistent data characteristics across different laboratories, resulting in difficulties merging datasets. Consequently, the relatively small number of samples compared to the large number of features in a gene expression dataset obstructs the use of deep learning techniques, where the training process usually requires a large amount of samples such as in image classification [35]. Therefore, there is a need to modify deep learning models for disease outcome classification using gene expression data, which naturally leads us to the development a variant of deep learning models specifically fitting the practical situation with the help of gene networks.
Incorporating gene networks as relational information in the feature space into DFN classifiers is a natural option to achieve sparse learning with less parameters compared to usual DFN. However, to the best of our knowledge, few existing work has been done on this track. [4, 16] started the direction of sparse deep neural networks for graphstructured data. The authors developed hierarchical locally connected network architectures with newly defined convolution operations on graphstructured data. The methods have novel mathematical formulation, however, the applications are yet to be generalized. In both of the two papers, by using the two benchmark datasets MINST [29]
and ImageNet
[35] respectively, the authors have treated 2D grid images as a special form of graphstructured data in their experiments. This is based on the fact that a image can be regarded as a graph in which each pixel is a vertex connected with four neighbors in the four directions. However, graphstructured data can be much more complex in general, as the degree of each vertex can vary widely, and the edges do not have orientations as in image data. For a gene network, the degree of vertices is powerlaw distributed as the network is scalefree [24]. In this case, convolution operations are not easy to define. In addition, with tens of thousands of vertices in the graph, applying multiple convolution operations results in huge number of parameters, which easily leads to overfitting given the small number of training samples. By taking an alternative approach of modifying a usual DFN, our newly proposed graphembedded DFN can serve as a convenient tool to fill the gap. It avoids overfitting in the scenario, as well as achieves good feature selection results using the capabilities of DFN.The paper is organized as follows: Section 2 reviews usual deep feedforward networks and illustrates our networkembedded architecture. Section 3 compares the performance of our method with two related approaches using synthetic datasets, followed by the real application of a breast cancer dataset in Section 4. Finally, conclusions and discussion are presented in Section 5.
2 Methods
2.1 Deep feedforward networks
A deep feedforward network (DFN, or deep neural network (DNN), multilayer perceptron (MLP)) with hidden layers has a standard architecture
where is the input data matrix with samples and features, is the outcome vector containing classification labels, denotes all the parameters in the model, and
are hidden neurons with corresponding weight matrices
,and bias vectors
, . The dimensions of and depend on the number of hidden neurons and , as well as the input dimension and the number of classes for classification problems. In this paper, we mainly focus on binary classification problems hence the elements of simply take binary values and .is the activation function such as sigmoid, hyperbolic tangent or rectifiers.
is the softmax function converting values of the output layer into probability prediction i.e.
where
for binary classification where .
The parameters to be estimated in this model are all the weights and biases. For a training dataset given true labels, the model is trained using a stochastic gradient decent (SGD) based algorithm [15] by minimizing the crossentropy loss function
where again denotes all the model parameters, and is the fitted value of . More details about DFN can be found in [15].
2.2 Graphembedded deep feedforward networks
Our newly proposed DNN model is based on two main assumptions. The first assumption is that neighboring features on a known scalefree feature network or feature graph^{1}^{1}1Since in this paper we interchangeably discuss feature networks and neural networks, to avoid confusion, the equivalent term “graph” is used to refer to the feature network from now on, while “networks” naturally refer to neural networks. tend to be statistically dependent. The second assumption is that only a small number of features are true predictors for the outcome, and the true predictors tend to form cliques in the feature graph. These assumptions have been commonly used and justified in previous works reviewed in Section 1.
To incorporate the known feature graph information to DNN, we propose the graphembedded deep feedforward network (GEDFN) model. The key idea is that, instead of letting the input layer and the first hidden layers to be fully connected, we embed the feature graph in the first hidden layer so that a fixed informative sparse connection can be achieved.
Let be a known graph of features, with the collection of vertices and the collection of all edges connecting vertices. A common representation of a graph is the corresponding adjacency matrix . Given a graph with vertices, the adjacency is a matrix with
In our case is symmetric since the graph is undirected. Also, we require meaning each vertex is regarded to connecting itself.
Now to mathematically formulate our idea, we construct the DNN such that the dimension of the first hidden layer () is the same as the original input i.e. , hence has a dimension of . Between the input layer and the first hidden layer , instead of fully connecting the two layers with , we have
where the operation is the Hadamard (elementwise) product. Thus, the connections between the first two layers of the feedforward network are “filtered” by the feature graph adjacency matrix. Through the onetoone transformation, all features have their corresponding hidden neurons in the first hidden layer. A feature can only feed information to hidden neurons that correspond to features connecting to it in the feature graph.
Specifically, let be any instance (one row) of the input matrix , in usual DFN, the first hidden layer of this instance is calculated as
where is the th row of , and are the weight and bias for this layer. Now in our model, and each is multiplied by an indicator function i.e.
Therefore, the feature graph helps achieve sparsity for the connection between the input layer and the first hidden layer.
2.3 Detailed model settings
For the choice of activation functions in DNN, the rectified linear unit (ReLU)
[34] with the form (in scalar case)is employed. This activation has an advantage over sigmoid and hyperbolic tangent as it can avoid the vanishing gradient problem
[17] during training using SGD. To train the DNN model, we choose the Adam optimizer [22], which is the most widely used variant of traditional gradient descent algorithms in deep learning. Also, we use the minibatch training strategy by which the optimizer randomly trains a small proportion of the samples in each iteration. Details about the Adam optimizer and minibatch training can be found in [15, 22].The classification performance of a DNN model is associated with many hyperparameters, including architecture related parameters such as the number of layers and the number of hidden neurons in each layer, regularization related parameters such as the dropout proportion and the penalty scale of regularizers, model training related parameters such as the learning rate and the batch size. These hyperparameters can be fine tuned using advanced hyperparameter training algorithm such as Bayesian Optimization [33], however, as the hyperparameters are not of primary interest in our work, in later sections, we simply tune them using grid search in a feasible hyperparameter space. A visualization of our fine tuned GEDFN model for simulation and real data experiments is shown in Fig. 1.
3 Simulation Experiments
We conduct extensive simulation experiments based on our assumptions of data. The goal of the experiments is to mimic disease outcome classification using gene expression and network data, and explore the performance of our new method compared to the usual DFN. Robustness is also tested as we simulate datasets that do not fully satisfy the main assumptions. The method is applied to examine whether it can still achieve a reasonable performance (i.e. at least as good as a usual DFN).
3.1 Synthetic data generation
For a given number of features , we employ the preferential attachment algorithm proposed by [3] to generate a scalefree feature graph. The distance matrix recording pairwise distances among all vertices is then calculated. Next, we transform the distance matrix into a covariance matrix by letting
Here by convention the diagonal elements of are all zeros meaning the distance between a vertex to itself is zero.
After simulating the feature graph and obtaining the covariance matrix of features, we generate multivariate Gaussian samples as the input matrix i.e.
where for imitating gene expression data. To generate outcome variables, we first select a subset of features to be the “true” predictors. Following our assumptions mentioned in Section 2.2, we intend to select cliques in the feature graph. Among vertices with relatively high degrees, part of them are randomly selected as “cores”, and part of the neighboring vertices of cores are also selected. Denoting the number of true predictors as , we sample a set of parameters and an intercept within a certain range. In our experiments, we first sample ’s from , so that the signal will neither be too strong nor too weak. Also, some of the parameters are randomly turned into negative, so that we accommodate both positive and negative coefficients. Finally, the outcome variable is generated through a logistic regression model
where is a threshold and
is the logit function
The inverse is equivalent to a binary class softmax function.
Following the above procedure, we simulate a set of synthetic datasets with 5,000 features and 400 samples. We compare our method with the usual DFN and another feature graphembedded classification method networkguided forest (NGF) [12] mentioned in Section 1. In gene expression data, the number of true predictors account for only a small proportion of the features, i.e. the signaltonoise ratio is extremely low. Taking this aspect into consideration, we examine different numbers, i.e. 40, 80, 120 160, and 200, of true predictors, corresponding to 2, 4, 6, 8 and 10 cores among all the highdegree vertices in the feature graph. However, in reality, the true predictors may not be perfectly distributed in the feature graph as cliques. Instead, some of the true predictors, which we call “singletons”, can be quite scattered. To create this possible circumstance, we simulate three series of datasets with singleton proportions 0%, 50% and 100% among all the true predictors. Therefore, we investigate three situations where all true predictors are in cliques, half of the true predictors are singletons, and all of the true predictors are scattered in the graph, respectively.
3.2 Simulation results
In our simulation studies, as shown in Fig. 1, the GEDFN had three hidden layers, where the first hidden layer was the graph adjacency embedded layer. Thus the dimension of its output is the same as the input, namely 5,000. The second and third hidden layers had 64 and 16 hidden neurons respectively, which are the same for the usual DFN. The number of the first layer hidden neurons in the usual DFN was tuned using grid search. When the number of true predictors was 40, the first hidden layer had 512 neurons; as we increased the number of true predictors, 1,024 hidden neurons achieved slightly better performance. For each of the data generation settings, 10 datasets were generated, and the GEDFN, DFN, and NGF methods were applied on the data. For each simulated dataset, we randomly split the dataset into training and testing sets at a 4:1 ratio. The final testing set classification accuracy results were then averaged across the ten datasets. All the classification results were evaluated by the area under the receiver operating characteristic (ROC) curve (AUC).
Table 1 and Fig. 2 shows the results of the simulation experiments. Corresponding to the case that singleton proportion is 0%, Fig. 2(a) shows the two feature graph integrated methods outperformed DFN with no feature graph information. As the number of true predictors increased, all of the methods performed better as there were more signals in the feature set. As the singleton proportion increased to 0.5 (Fig. 2(b)), GEDFN was still the best among the three while the performance of NGF decreased. In Fig. 2(c), the difference between GEDFN and DFN is not very obvious, since there was essentially no feature graph information with singleton proportion at 100%. At the same time, NGF performed worse than neural network methods. It is also noted that with the increase of singleton proportions, the performance of DFN became worse as well, which is because neural network methods inherently tackle correlated features, and the correlation decreased as the number of singletons increased. Therefore, although not as directly affected as GEDFN and NGF, increased singleton proportions also deteriorated the performance of DFN.
% singletons  0%  50%  100%  

# true predictors  40  80  120  160  200  40  80  120  160  200  40  80  120  160  200 
DFN  0.836  0.872  0.875  0.902  0.909  0.811  0.858  0.91  0.923  0.923  0.738  0.822  0.865  0.887  0.922 
GEDFN  0.876  0.921  0.924  0.944  0.923  0.868  0.879  0.922  0.922  0.939  0.821  0.847  0.896  0.902  0.922 
NGF  0.871  0.89  0.902  0.928  0.913  0.844  0.865  0.896  0.873  0.918  0.786  0.814  0.845  0.894  0.903 
In summary, the simulation experiments demonstrated that compared to the NGF method and the usual DFN method, our newly proposed GEDFN model had better classification accuracies in cases where true predictors were concentrated on cliques in the feature network. When the number of singletons increased, the feature network could hardly provide any signal and the performance of GEDFN declined to the same level of the usual DFN model.
4 Real data application
4.1 Datasets
We applied our GEDFN method to the Cancer Genome Atlas (TCGA) breast cancer (BRCA) RNAseq dataset [23]. The dataset consists of a gene expression matrix with 20,532 genes of 707 cancer patients, as well as the clinical data containing various disease status measurements. The gene network came from the HINT database [10]. In this proofofconcept study, we were interested in the relation between gene expression and a molecular subtype of breast cancer  the tumor’s Estrogen Receptor (ER) status. ER is expressed in more than 2/3 of breast tumors, and plays a critical role in tumor growth [36].
After screening genes that were not involved in the gene network, a total of 9,211 genes were used as the final feature set in our classification. For each gene, the expression value was Zscore transformed i.e. the expression value minus the mean across all patients and then divided by the standard error.
4.2 Model fitting and evaluation of feature importance
In the analysis of gene expression data, not only are we concerned about the classification result, but also it is of great interest to find features that significantly contribute to the classification, as they can reveal the underlying biological mechanisms. In our example, besides training a wellperformed classification model, we also intended to figure out which genes play important roles in prediction. After fitting the GEDFN model, we employed the partial derivative method proposed in [11] to calculate an approximate score for each gene as the variable importance.
The main idea of the partial derivative method is that the importance of a specific variable is reflected by its impact on the change of prediction when its value changes. The ratio between change in prediction and change in the variable is thus the partial derivative of the predicted probability with respect to the variable i.e. . To obtain the partial derivative, one straightforward approximation is to compute it numerically. The procedure is as the following: after fitting the model using training data and obtaining the testing data prediction, for each gene in the testing dataset, we increase a small proportion (say, 5%) of its expression value in all the testing samples while fixing other gene values unchanged. Rerunning the model using this perturbed testing set, the new prediction is then obtained for each sample. Next, the impact of changing the gene value is computed as the difference between the new prediction and the original prediction. Finally, the average ratio between the prediction difference and the value changed in expression across samples can serves as the numerical partial derivative. Equation 1 shows a mathematical expression for this numerical partial derivative for gene :
(1) 
where and are the perturbed prediction and the original prediction respectively, and is the small percentage of the change in the gene expression value. This calculation procedure is repeated for every gene. To compare the effect size of genes, we use the absolute value as the importance score. A ranked importance list is then obtained by sorting the genes according to their scores.
4.3 Results
Using the HINT network architecture as in Section 3.2, we tested our GEDFN method on the BRCA data with ten repeated experiments. The average testing AUC is 0.938, indicating that the method is well applicable for gene expressionbased classification problem. The ranked effective genes list was also obtained by averaging gene importance scores across the 10 repeated analysis. We conducted further functional analyses for the top 5% ranked genes to interpret the feature importance of our model from the biological point of view. We conducted community detection on the network containing the topranked genes and their onestep neighbors [9]. Edges were weighted such that edges between topranked genes received a weight of 15, while other edges received a weight of 1. Sixteen modules were selected, and the corresponding top gene groups were tested for functionality through GO enrichment analysis [13]. Fig. 3 shows two examples of the 16 modules.
The largest module shown in Fig. 3a consists of 34 positive genes, i.e. the increased expression of which increases the probability of the sample being classified into the “ER+” group, and 27 negative genes, i.e. the increased expression of which decreases the probability of the sample being classified as “ER+”. Functional analysis using GOstats reveals several key biological processes overrepresented by the top ranked genes in this module. The most significant biological process is regulation of execution phase of apoptosis. ER is known to be closely related to the regulation of the apoptosis process. In the presence of estrogen, chemotherapyinduced apoptosis is suppressed in ER+ breast cancer cells [2]. It has been shown that siRNAmediated suppression of protein tyrosine phosphatase (PTP) induces apoptosis in ER but not in ER+ breast cancer cells [41].
The second biological process found was centrosome localization. Centrosome amplification mediated be estrogen leads to chromosomal instability, which is a critical process in breast oncogenesis [30, 19]. The results indicate the lack of estrogen receptor in the ER group is associated with a distinct gene expression patter in centrosome localization genes.
As shown by the third and fourth biological processes, a number of regulators of phosphorylation, many of which are involved in mitogenactivated protein kinase (MAPK) cascade, were associated with ER status. The MAPK cascade plays an integral role in estrogen signaling. RasMAPK cascade modulates the activity of ER [20], and MAPK can be activated by estradiol (E2) independent of transcriptional changes [18].
GOBPID  Term  Pvalue  Genes in the GO Term 

GO:1900117  regulation of execution phase of apoptosis  1.7E05  598;2021;3297;5062 
GO:0051642  centrosome localization  0.00042  4682;5048;23224 
GO:0042326  negative regulation of phosphorylation  0.00049  207;863;5048;5062;8661; 
51562;55333;57103;  
57689;84962;374403  
GO:0043409  negative regulation of MAPK cascade  0.0015  207;5048;8661;51562; 
55333;374403  
GO:0043087  regulation of GTPase activity  0.0020  2664;3383;5048;6281; 
7410;8787;9411;9744;  
23616;27237;84962;374403 
A number of genes involved in GTPase activity regulation are also transcriptionally associated with ER status. Through the G proteincoupled receptor GPR30, estrogen triggers the activation of the MAPKs Erk1 and Erk2 [14]. At the same time, activation of GPR30 by the receptorspecific agonist G1 induces mitochondrialrelated apoptosis [40].
The module in Fig. 3b is much smaller. Two of the positive genes (197, 1019) are involved in the insulin receptor signaling pathway, which has been shown to crosstalk with the estrogen signaling pathway in estrogendependent breast cancer [26]. Five of the selected genes (1019;1880;3265;5741;6850) belong to the process “positive regulation of cell proliferation”, three of which (1880;3265;6850) are part of the ERK1 and ERK2 cascade, which as we discussed before, is triggered by estrogen through GPR30 [14].
Overall, the real data results based on RNAseq data and HINT database clearly demonstrated the method can achieve good classification performance by selecting biologically relevant genes.
5 Conclusion
We presented a new deep feedforward network classifier embedding feature graph information. It achieves sparse connected neural networks by constraining connections between the input layer and the first hidden layer according to the feature graph. Simulation experiments have shown its relatively higher classification accuracy compared to existing methods, and the real data application demonstrated the utility of the new model.
Acknowledgements
This study was partially funded by National Institutes of Health [grant number R01GM124061]. The authors thank Dr. Hao Wu and Dr. Jian Kang for helpful discussions.
References
 [1] Algamal, Z.Y., Lee, M.H.: Penalized logistic regression with the adaptive lasso for gene selection in highdimensional cancer classification. Expert Systems with Applications 42(23), 9326–9332 (2015)
 [2] Bailey, S.T., Shin, H., Westerling, T., Liu, X.S., Brown, M.: Estrogen receptor prevents p53dependent apoptosis in breast cancer. Proc. Natl. Acad. Sci. U.S.A. 109(44), 18060–18065 (Oct 2012)
 [3] Barabási, A.L., Albert, R.: Emergence of scaling in random networks. science 286(5439), 509–512 (1999)
 [4] Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 (2013)
 [5] Cai, Z., Xu, D., Zhang, Q., Zhang, J., Ngai, S.M., Shao, J.: Classification of lung cancer using ensemblebased feature selection and machine learning methods. Molecular BioSystems 11(3), 791–800 (2015)
 [6] Chen, Y.C., Ke, W.C., Chiu, H.W.: Risk classification of cancer survival using ann with gene expression data from multiple laboratories. Computers in biology and medicine 48, 1–7 (2014)
 [7] Chowdhury, S., Sarkar, R.R.: Comparison of human cell signaling pathway databases–evolution, drawbacks and challenges. Database (Oxford) 2015 (2015)
 [8] Chuang, H.Y., Lee, E., Liu, Y.T., Lee, D., Ideker, T.: Networkbased classification of breast cancer metastasis. Molecular systems biology 3(1), 140 (2007)
 [9] Clauset, A., Newman, M.E., Moore, C.: Finding community structure in very large networks. Physical review E 70(6), 066111 (2004)
 [10] Das, J., Yu, H.: HINT: Highquality protein interactomes and their applications in understanding human disease. BMC Syst Biol 6, 92 (Jul 2012)
 [11] Dimopoulos, Y., Bourret, P., Lek, S.: Use of some sensitivity criteria for choosing networks with good generalization ability. Neural Processing Letters 2(6), 1–4 (1995)
 [12] Dutkowski, J., Ideker, T.: Protein networks as logic functions in development and cancer. PLoS computational biology 7(9), e1002180 (2011)
 [13] Falcon, S., Gentleman, R.: Using GOstats to test gene lists for GO term association. Bioinformatics 23(2), 257–258 (Jan 2007)
 [14] Filardo, E.J., Quinn, J.A., Frackelton, A.R., Bland, K.I.: Estrogen action via the G proteincoupled receptor, GPR30: stimulation of adenylyl cyclase and cAMPmediated attenuation of the epidermal growth factor receptortoMAPK signaling axis. Mol. Endocrinol. 16(1), 70–84 (Jan 2002)
 [15] Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT press (2016)
 [16] Henaff, M., Bruna, J., LeCun, Y.: Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163 (2015)
 [17] Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J., et al.: Gradient flow in recurrent nets: the difficulty of learning longterm dependencies (2001)
 [18] ImprotaBrears, T., Whorton, A.R., Codazzi, F., York, J.D., Meyer, T., McDonnell, D.P.: Estrogeninduced activation of mitogenactivated protein kinase requires mobilization of intracellular calcium. Proc. Natl. Acad. Sci. U.S.A. 96(8), 4686–4691 (Apr 1999)
 [19] Jung, Y.S., Chun, H.Y., Yoon, M.H., Park, B.J.: Elevated estrogen receptorÎ± in VHLdeficient condition induces microtubule organizing center amplification via disruption of BRCA1/Rad51 interaction. Neoplasia 16(12), 1070–1081 (Dec 2014)
 [20] Kato, S., Endoh, H., Masuhiro, Y., Kitamoto, T., Uchiyama, S., Sasaki, H., Masushige, S., Gotoh, Y., Nishida, E., Kawashima, H., Metzger, D., Chambon, P.: Activation of the estrogen receptor through phosphorylation by mitogenactivated protein kinase. Science 270(5241), 1491–1494 (Dec 1995)
 [21] Kim, S., Pan, W., Shen, X.: Networkbased penalized regression with application to genomic data. Biometrics 69(3), 582–593 (2013)
 [22] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014), http://arxiv.org/abs/1412.6980
 [23] Koboldt, D.C., Fulton, R.S., McLellan, M.D., Schmidt, H., KalickiVeizer, J., McMichael, J.F., Fulton, L.L., Dooling, D.J., Ding, L., Mardis, E.R., et al.: Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012)
 [24] Kolaczyk, E.D.: Statistical Analysis of Network Data: Methods and Models. Springer Publishing Company, Incorporated, 1st edn. (2009)
 [25] Kursa, M.B.: Robustness of random forestbased gene selection methods. BMC bioinformatics 15(1), 8 (2014)
 [26] Lanzino, M., Morelli, C., Garofalo, C., Panno, M.L., Mauro, L., Ando, S., Sisci, D.: Interaction between estrogen receptor alpha and insulin/IGF signaling in breast cancer. Curr Cancer Drug Targets 8(7), 597–610 (Nov 2008)
 [27] Lavi, O., Dror, G., Shamir, R.: Networkinduced classification kernels for gene expression profile analysis. Journal of Computational Biology 19(6), 694–709 (2012)
 [28] LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
 [29] LeCun, Y., Cortes, C.: MNIST handwritten digit database (2010), http://yann.lecun.com/exdb/mnist/
 [30] Li, J.J., Weroha, S.J., Lingle, W.L., Papa, D., Salisbury, J.L., Li, S.A.: Estrogen mediates AuroraA overexpression, centrosome amplification, chromosomal instability, and breast cancer in female ACI rats. Proc. Natl. Acad. Sci. U.S.A. 101(52), 18123–18128 (Dec 2004)
 [31] Liang, Y., Liu, C., Luan, X.Z., Leung, K.S., Chan, T.M., Xu, Z.B., Zhang, H.: Sparse logistic regression with a l 1/2 penalty for gene selection in cancer classification. BMC bioinformatics 14(1), 198 (2013)
 [32] Min, S., Lee, B., Yoon, S.: Deep learning in bioinformatics. Briefings in bioinformatics p. bbw068 (2016)
 [33] Mockus, J.: Bayesian approach to global optimization: theory and applications, vol. 37. Springer Science & Business Media (2012)

[34]
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML10). pp. 807–814 (2010)

[35]
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., FeiFei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015)
 [36] Sorlie, T., Tibshirani, R., Parker, J., Hastie, T., Marron, J.S., Nobel, A., Deng, S., Johnsen, H., Pesich, R., Geisler, S., Demeter, J., Perou, C.M., L?nning, P.E., Brown, P.O., B?rresenDale, A.L., Botstein, D.: Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc. Natl. Acad. Sci. U.S.A. 100(14), 8418–8423 (Jul 2003)
 [37] Szklarczyk, D., Jensen, L.J.: Proteinprotein interaction databases. Methods Mol. Biol. 1278, 39–56 (2015)
 [38] Vanitha, C.D.A., Devaraj, D., Venkatesulu, M.: Gene expression data classification using support vector machine and mutual informationbased gene selection. Procedia Computer Science 47, 13–21 (2015)
 [39] Wei, P., Pan, W.: Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model. Bioinformatics 24(3), 404–411 (2007)
 [40] Wei, W., Chen, Z.J., Zhang, K.S., Yang, X.L., Wu, Y.M., Chen, X.H., Huang, H.B., Liu, H.L., Cai, S.H., Du, J., Wang, H.S.: The activation of G proteincoupled receptor 30 (GPR30) inhibits proliferation of estrogen receptornegative breast cancer cells in vitro and in vivo. Cell Death Dis 5, e1428 (Oct 2014)
 [41] Zheng, X., Resnick, R.J., Shalloway, D.: Apoptosis of estrogenreceptor negative breast cancer and colon cancer cell lines by PTP alpha and src RNAi. Int. J. Cancer 122(9), 1999–2007 (May 2008)
 [42] Zhu, Y., Shen, X., Pan, W.: Networkbased support vector machine for classification of microarray samples. BMC bioinformatics 10(1), S21 (2009)
Comments
There are no comments yet.