Dropout Feature Ranking for Deep Learning Models

12/22/2017 ∙ by Chun-Hao Chang, et al. ∙ 0

Deep neural networks are a promising technology achieving state-of-the-art results in biological and healthcare domains. Unfortunately, DNNs are notorious for their non-interpretability. Clinicians are averse to black boxes and thus interpretability is paramount to broadly adopting this technology. We aim to close this gap by proposing a new general feature ranking method for deep learning. We show that our method outperforms LASSO, Elastic Net, Deep Feature Selection and various heuristics on a simulated dataset. We also compare our method in a multivariate clinical time-series dataset and demonstrate our ranking rivals or outperforms other methods in Recurrent Neural Network setting. Finally, we apply our feature ranking to the Variational Autoencoder recently proposed to predict drug response in cell lines and show that it identifies meaningful genes corresponding to the drug response.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have started to come out as top performers in biology and healthcare including genomics (Xiong et al., 2015), medical imaging (Esteva et al., 2017), EEG (Rajpurkar et al., 2017) and EHR (Futoma et al., 2017). However, DNNs are black-box models and notorious for their non-interpretability. In the fields of biology and healthcare, to derive hypotheses that could be experimentally verified, it is paramount to provide information about which biological or clinical features are driving the prediction. The desired data may be very expensive to collect, thus it is also important to generate experimental designs that will collect the most effective data leading to the highest accuracy within reasonable budget. Therefore, there is a strong need for feature ranking for deep learning methods to advance their use in biology and healthcare. We aim to close this gap by proposing a new general feature ranking method for deep learning.

In this work we propose to rank features by variational dropout (Gal et al., 2017). Dropout is an effective technique commonly used to regularize neural networks by randomly removing a subset of hidden node values and setting them to . In this work we use the Dropout concept on the input feature layer and optimize the corresponding feature-wise dropout rate. Since each feature is removed stochastically, our method creates a similar effect to feature bagging (Ho, 1995)

and manages to rank correlated features better than other non-bagging methods such as LASSO. We compare our method to Random Forest (RF), LASSO, ElasticNet, Marginal ranking and several techniques to derive importance in DNN such as Deep Feature Selection and various heuristics. We first test it on

simulation datasets and shows that our methods can rank features correctly in the non-linear feature interactions especially among the important features. Then we test it on real-world datasets and show that our method has higher performance under the same number of features in the deep neural network. Then we test it on a multivariate clinical time-series dataset and show that our method also rivals or outperforms other methods in recurrent neural network setting. Finally, we test our method on a real-world drug response prediction problem using a previously proposed Variational Autoencoder (VAE) (Kingma and Welling, 2013). In this proof-of-concept application, we show that our method identifies genes relevant to the drug-response.

2 Related Work

Many previously proposed approaches to interpret DNNs focus on interpreting a decision (such as assigning a particular classification label in an image) for a specific example at hand (e.g. (Simonyan et al., 2013; Zeiler and Fergus, 2014; Ribeiro et al., 2016; Zhou et al., 2016; Selvaraju et al., 2016; Shrikumar et al., 2017; Zintgraf et al., 2017; Fong and Vedaldi, 2017; Dabkowski and Gal, 2017)

). In this case, a method would aim to figure out which parts of a given image make the classifier think that this particular image should be classified as a dog. These methods are unfortunately not easy to use for the purpose of feature selection or ranking, where the importance of the feature should be gleaned across the whole dataset.

Several works have mentioned using variational dropout to achieve better performance (Gal et al., 2017) Kingma et al. (2015), have a Bayes interpretation of dropout (Maeda, 2014), or compress the model architecture (Molchanov et al., 2017). These works focus on tuning the dropout rate to automatically get the best performance, but do not consider applying it to the feature ranking problems.

Li et al. (2016) proposed Deep Feature Selection (Deep FS). Deep FS adds another hidden layer to the network with one connection per input node to this hidden layer (of the same size as input) and uses an penalty on this layer. The weights between these layers are initialized to but since they are not constrained to , they can become large positive and negative values. Thus, this additional layer can amplify a particular input and will need to be balanced within the original network architecture. Additionally, using penalty prevents Deep FS from selecting correlated features, important in many biological and health applications.

Finally, several works also targeted interpreting features in a clinical setting. Che et al. (2015)

uses Gradient-Boosted Trees to mimic a recurrent neural network on a healthcare dataset to achieve comparable performance.

Nezhad et al. (2016) interprets the clinical features by autoencoder and random forest. Suresh et al. (2017)

use Recurrent Neural Network to predict the clinical dataset and use the ranking heuristics called ’Mean’ in our settings. These approaches rely on additional decision tree architecture to learn the features, or use heuristics which have a weaker ranking performance in our experiments.

3 Methods

3.1 Variational Dropout

Dropout (Srivastava et al., 2014) is one of the most effective and widely used regularization techniques for neural networks. The mechanism is to inject a multiplicative Bernoulli noise for each hidden unit within a neural network. Specifically, during forward pass, for each hidden unit in layer a dropout mask is sampled. The original hidden node value is then multiplied by this mask , which stochastically sets the hidden node value to or .

Variational dropout (Maeda, 2014) optimizes the dropout rate

as a parameter instead of it being a fixed hyperparameter. For a neural network

, given a mini-batch of size (sampled from training set of samples) and a dropout mask , the loss objective function that follows from the variational interpretation of dropout can be written as:


Here, , where is the variational mask distribution and is a prior distribution.

Figure 1: Dropout feature ranking diagram. Before training (Left), the dropout rate for each feature is initialized to . After training (Right), each feature gets a different dropout rate. We then rank all features based on the magnitude of the dropout rate - the lower the magnitude, the higher the rank.

3.2 Feature Ranking Using Variational Dropout

Figure 1 shows our approach. To analyze which features are important for a given pre-trained model to correctly predict its target variable , we introduce Dropout Feature Ranking (Dropout FR) method. In our method we add variational dropout regularization to the input layer of . To achieve minimum loss, the Dropout FR model should learn small dropout rate for features that are important for correct target prediction by the analyzed model , while increasing the dropout rate for the rest of unimportant features. Specifically, given features, we set a variational mask distribution as a fully factorized distribution. This gives us a feature-wise dropout rate where magnitude indicates the importance of feature .

Instead of having a in the equation 1 to regularize the dropout distribution , we directly penalize the number of existing features (features not dropped-out). This avoids the need to set the prior dropout rate and is aligned with the

penalty for linear regression

(Murphy, 2012)

. Our loss function can thus be written as:


where and is determined by cross validation.

Concrete relaxation

To optimize w.r.t. the parameters

, we need to backpropagate through discrete variable

. We adopt the same approach as in Gal et al. (2017) to optimize our dropout rate. Specifically, instead of sampling the discrete Bernoulli variable, we sample from Concrete distribution (Jang et al., 2016; Maddison et al., 2016) with some temperature which produces value between and . This distribution places most of the mass in and to approximate the discrete distribution. The concrete relaxation

for Bernoulli distribution


where . We fix as

and find it works well in all our experiments. Compared to traditional REINFORCE estimator

(Williams, 1992)

, this concrete distribution has lower variance and has better performance (data not shown), so we apply it in all our experiments.


We adopt the annealing tricks to avoid the model being penalized heavy before it fully optimizes. Specifically, we increase linearly from

to its specified value in the first few epochs during optimization. This is similar to the KL annealing tricks

(Bowman et al., 2015) in the VAE.

Relation to reinforcement learning

Our method can be seen as a policy gradient based method (Sutton et al., 2000)

(one of the reinforcement learning techniques) applied to feature selection setting. From this perspective, our policy is the factorized Bernoulli distribution, and the reward consists of the log probability of targets and the number of features used. We optimize the policy that outputs the best feature combination in this large feature spaces with

combinations where is the total number of features. To get the feature-wise explanation, we adopt the factorized Bernoulli distribution to gain per-feature importance value as our ranking.

3.3 Training details of neural networks

We list all the hyperparameters of our experiments (No Interaction, Interaction, Support2, MiniBooNE, Online News, YearPredictionMSD, and Physionet) in table 1

. For all the feed-forward neural network, we add dropout and batch normalization in every hidden layer, and use learning rate decay and early stopping to train the classifier. For recurrent neural network, we apply dropout and batch normalization in the output. We do the grid search and cross validation to select the

. We add a small penalty to reduce overfitting. We use Adam (Kingma and Ba, 2014) to optimize all the experiments. All the hyperparameters are selected by hands without much tuning.

No Interaction Interaction Support2 MiniBooNE Online News YearPredictionMSD Physionet
Loss MSE MSE CE loss CE loss MSE MSE CE loss
Input layer 40 40 42 50 59 90 37
Hidden layers 40-20 40-20 170-170 150-100-50-25 80-80-80 200-100-80-40-20 64
Dropout Every hidden layer with 0.5 dropout rate input(0.3) output(0.5)
BatchNorm After every hidden layer output
Learning rate 0.001 0.001 0.001 0.001 0.001 5.00E-04 1.00E-03
L2 penalty 1.00E-05 1.00E-05 1.00E-04 1.00E-04 1.00E-05 1.00E-05 1.00E-06
Patience 3 3 5.00E+00 2 4.00E+00 2 2
Lookahead 10 10 12 6 9 5 5
Epochs 100 100 600 100 100 100 100
Dropout feature ranking parameters
Annealing 30 30 30 30 30 30 30
0.1 1 0.01 0.1 0.01 1 0.001
learning rate 0.001 0.001 0.002 0.002 0.002 0.002 0.01
Epochs 200 200 400 150 200 100 300
Table 1: Neural Network hyperparameters for each dataset

4 Results

First, we test and compare our droupout feature ranking method (Dropout FR) in simulation settings. Second, we test classification and regression real-world datasets using the feed-forward neural network. We then compare our approach using a clinical time series dataset by interpreting a recurrent neural network. Finally, we apply our approach to a drug-response task to understand which genes contribute to drug response in a variational autoencoder (VAE) prediction model.

No Interactiton Interactiton
Criteria Top 40 Top 20 Top 5 Top 40 Top 20 Top 5
Dropout FR
Deep FS
Elastic Net
Table 2: Comparisons of different ranking methods in simulations (No Interaction, Interaction).

4.1 Compared methods

We compare our approach to a variety of strawman methods as well as approaches commonly used for feature ranking. LASSO uses an penalty while Elastic Net uses a mix of and penalty (we pick to balance between and ), in which the feature importance is derived from the order each feature goes to 0 as the penalty increases. Random Forest derives its feature ranking by the average decrease of impurity across different trees (we use

trees for all experiments). Marginal ranking refers to the univariate feature analysis that ignores the interaction between features. We use t-test probability as the ranking criteria in the binary classification task, and use Pearson correlation coefficient in the regression task. Random ranking means that we randomly assign ranks to different features, serving as a baseline in the real-world dataset evaluation in the section

4.3 and 4.4.

Deep FS (Li et al., 2016) was proposed specifically for interpreting deep learning models. It adds another hidden layer to the network with one connection per input node to this hidden layer (of the same size as input) and uses an penalty on this layer. After the optimization, the magnitude of the connection weight is used as a proxy of the importance of each variable. Note that to correctly evaluate importance of each feature and to ultimately rank features, in theory the method should examine the order with which weights drop to as the penalty increases. However, this would require hundreds of manual settings of the penalty hyperparameter, which is not scalable, so we follow the authors and use the connection weight instead. We pick the coefficient also by cross validation.

Finally, we use two heuristics to rank features in a DNN. We call the first approach ‘Mean’ method: we replace one feature at a time with the mean of the feature and rank feature importance based on the corresponding increase in the training loss. Our second method is called ‘Shuffle’: for each feature we permute its values across the samples and evaluate importance by the increase of the training loss.

4.2 Simulation

We simulate two datasets to show that multi-layer neural network can capture the non-linear interactions. First, we simulate a dataset without any feature interactions (called ’No Interaction’). We sample features , while only top features are informative of target . These top features have decreasing importance with increasing Bernoulli noise that stochastically sets each feature to . We set our target . Thus, among the informative features, the most important feature is the 1-st and the least important feature is the 20-th. And the ground truth ranking is decreasing from 1st feature to 20-th feature with the noisy features (21th - 40th) as the least important. We then calculate the Spearman coefficient to the ground truth as our performance metric.

To compare the effect of second order interaction, we simulate another dataset with second order feature interactions (called ’Interaction’). Namely, we use the product of feature pairs instead of each individual feature to affect the target . Specifically, we set the target , where target depends on the product of feature pair and . Thus, among the informative features, the most important feature pair is 1-st and 2-nd features, with decreasing importance of 19-th and 20-th feature pair. The least important features are still the noisy features (21th - 40th).

Training Details

We simulate samples for both datasets to generate enough samples for the neural network to perform reasonably well. We train a feed-forward neural network with hidden layers (exact architecture shown in section 3.3), then rank features by our Dropout FR, Mean, Shuffle and Deep FS. We train random forest with trees since we find increasing number of trees does not improve. We use 5-fold cross validation for all our experiments.


In Table 2, we show the Spearman coefficients for these two datasets when comparing full features (Top ), only the informative features (Top ) and the top most informative features (Top ). In the No Interaction dataset, we find all methods perform great when comparing full features except Elastic Net and LASSO. However, we find these methods perform well when only considering the top informative features. It shows that these methods can not distinguish noisy from true features, but are able to rank the strengths of informative features. In the Interaction dataset, we find Elastic Net, LASSO and Marginal method perform much worse, showing these simple linear layer and single-feature statistical tests can not capture second order interaction effects. We find deep learning based methods (Dropout FR, Mean, Shuffle) except Deep FS perform the best across all the settings. Random Forest (RF) is able to distinguish noisy features from important features (see Top 40). However, we find it performs much worse when only considering the top features, showing that it can not correctly rank the very top and most important features, failing to capture complicated feature interactions.

Datasets Samples  Features NN RF Elastic Net LASSO
Support2 (accuracy)
MiniBooNE (accuracy)
OnlineNewsPopularity (MSE)
YearPredictionMSD (MSE)  90
Table 3: UCI dataset classifier performance

4.3 Evaluation in the Real-World datasets

Evaluation Criteria

To understand which feature ranking is the best in the real-world datasets, we evaluate each feature ranking by its test set performance of top features. A better feature ranking should reach higher performance by using same amounts of features. We evaluate the feature ranking with two settings. We call the first setting ’zero-out’: after taking top features, we set the rest of the features to and evaluate the test performance using the already trained neural network: this represents how well we interpret a given classifier. The second setting is called ’retrain’: we retrain neural network using the top features. It represents in general which features are important under this neural network architecture.

In this experiment, we evaluate our ranking approaches on 2 classification tasks (clinical dataset Support2111http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets, UCI MiniBooNE datasets), and 2 UCI regression tasks (Online News Popularity (Fernandes et al., 2015), YearPredictionMSD). Here we describe each dataset. Support2 is a multivariate clinical dataset which aims to predict in-hospital death by patient’s demographics, clinical assessments and lab tests. It consists of samples and features with positive mortality labels. MiniBooNE aims to predict effective particles in distinguish electron neutrinos (signal) from muon neutrinos (background). It consists of samples and features with positive labels. Online News Popularity dataset predicts the sharing times of articles in the website Mashable by article topics, word compositions and timestamps. It consists of samples and features, and the goal is to predict the number of times this article is shared. YearPredictitonMSD is a task to predict the published year of songs that is published from 19 to 20 century by various sound features. It consists of samples and song features.

We preprocess all the continuous features by clipping the values of outliers to the outlier threshold defined by the interquantile ranges (IQR) method. Then we normalize them to

mean and unit variance. We also remove the outliers and normalize the target variable in Online News Popularity dataset. For categorical variable, we do the one-hot encoding. We do 5-fold cross validation with

percent training set as the validation set in all our experiments.

We show each dataset summary statistics and classifiers’ performance in Table 3. We select datasets that have a relatively large number of instances, the scenario where neural networks commonly outperform their competitors. With the exception of the largest dataset in this experiment (YearPredictionMSD), neural network performance is relatively close to the performance of random forest. RF even outperforms NN on the Support2, the smallest dataset. As expected, neural network performance gets better as the datasets get larger.

In Figure 2, we compare our ranking methods with all other methods mentioned in section 4.1. In the ‘zero-out’ setting (first row), our method compares favorably in all the datasets we tested, with significant difference in the larger dataset YearPredictionMSD. We note that we get slightly inferior performance to that of random forest when only the top or features are used in the MiniBooNE dataset. We also observe similar phenomenon compared to Shuffle method in the YearPredictionMSD dataset. However, the overall performance on these and other datasets, when only the top or features are used, is much worse compared to the performance with or more features, indicating that or features are not sufficient to model any of these datasets. Our method has significantly higher performance when the number of features for the same and other dataset is or higher. We deduce that Dropout FR selects better combinations of features (since it gets lower loss as the number of features gets larger) at the cost of the performance just given the top few features.

In the ’retrain’ setting (second row), we only compare the first datasets due to the time it takes to retrain models for the YearPredictionMSD dataset. In this setting, we find that our method rivals or outperforms other methods. This demonstrates that Dropout FR method can retrieve better feature combinations suited to the neural network architecture than many other approaches in the wide variety of datasets.

In both settings, we find that marginal ranking (green) performs much better in Support2 and News Popularity dataset and much worse in more complicated datasets, MiniBooNE and YearPredictionMSD. It might also be the reason why Dropout FR performs relatively close to other baselines in these simpler datasets since using marginally important features is sufficient to explain the outcome. However, as datasets get bigger and more complicated, our method achieves significantly better results than other baselines as seen in MiniBooNE (only RF is close) and YearPredictionMSD dataset. Note that these comparisons also help us to infer the complexity of the datasets, thus it maybe beneficial to gain more insights into the data by always evaluating the performance of the strawmen methods alongside Dropout FR.

     Support2      MiniBooNE      News Popularity      YearPredictionMSD
       Number of features        Number of features        Number of features
Figure 2: Comparison of methods on datasets with evaluation settings (zero-out, retrain).

4.3.1 Stability

In this experiment, we show that our algorithm is robust to random initialization of the neural network. Figure 3 shows different runs with different random seeds in the Support2 dataset setting . We show that they all converge to similar dropout rate for each feature after optimization (shown by complete overlap of the performances corresponding to different seeds on the graph).

     Feature Index
Figure 3: The keep probability () of features in the Support2 dataset for different random seeds. ( different colors) when .

4.3.2 Regularization Coefficient Effect

In Figure 4, we examine the effect of different regularization coefficient on the final dropout rate in our algorithm. We note that when we have strong regularization (high ), most of the features get pruned and have high dropout rate (low keep probability). On the other hand, when the regularization is too weak, every instance has the dropout rate close to . It is crucial to select proper that preserves the important features while pruning the noisy features.

Number of Features
Figure 4: The keep probability of features in the Support2 dataset for different regularization coefficient .

4.4 Predicting in-hospital mortality

In this experiment, we evaluate the performance of our method using a multivariate time-series clinical dataset to determine the importance of clinical covariates in predicting in-hospital mortality. This dataset, from PhysioNet 2012 Challenge (Goldberger et al., 2000), is a publicly available collection of multivariate clinical time series with intensive care unit (ICU) patients. It contains patient measurements within the first hours in the ICU. The goal is to predict the in-hospital mortality as a binary classification problem. We use the only publicly available Training Set A subset which contains patient measurements with patients having the positive mortality labels.

     Physionet (AUROC)      Physionet (AUPR)
       Number of features        Number of features
Figure 5: Comparison of methods on Physionet datasets with both AUROC and AUPR.

We follow the preprocessing of Lipton et al. (2016)

work. First, we use binary features indicating whether or not a feature was measured at a given time point. If a feature was not measured, we set the binary variable to

and if it was measured, we set it to . Concatinating these reverse-indicator variables with the original features results in

features in total. Second, we normalize each feature to zero mean and unit variance except for the binary features. Finally, we bin the input features into 1-hour intervals, take average of multiple measurements within 1-hour time window, and impute missing values with

. These lead to a time-series with time points and features. We split the dataset randomly into , , and as training, validation and test set, respectively, and repeat the procedure times.

We follow the RNN architecture used in Che et al. (2016) to predict the mortality . We use -fold cross validation to select . For random forest, we use trees and sum the feature importance across all the time points (since in RF each time point is considered independently for each of the features), including original feature and its corresponding reverse-indicator features.

First, we compare the neural network performance with respect to other commonly-used classifiers on the PhysioNet dataset and show that RNN is better in test set AUPR and AUROC in Table 4.5.

In Figure 5, we compare Dropout FR, Mean, RF ranking method and Random ranking in the zero-out and retrain settings with both AUROC and AUPR. We find that Dropout FR performs overall better than Random Forest with significant difference in using feature across all settings we evaluate. We also find that Dropouot FR performs significantly better than Mean method across all settings. Overall, we show that our method performs well in the recurrent neural network architecture, capturing the feature importances in the time-series datasets.

In Table 5, we show the top features selected by RF and Dropout FR. Overall these two approaches rank features somewhat differently, though many of the features in the two lists are the same. We find the reason for the inferior RF performance observed in Figure 5 when only feature is used is the different ranking of ‘Urine’ and ‘GCS’ features (RF selects ‘GCS’ as second). The table also demonstrates that feature importance does not simply follow the frequency of the features in the dataset for either of the methods.

4.5 Drug Response Prediction

We apply our method to a real-world drug response dataset to find which genes determine drug response using the semi-supervised variational autoencoder (SSVAE) (Kingma et al., 2014) model applied to this task by Rampasek et al. (2017), who kindly shares the dataset and code with us. The SSVAE takes gene expression of 903 preselected genes as input and performs a binary classification to find whether the given cell line responds to the drug.

In Table 5, we examined genes contributing to the response of bortezomib, a drug commonly used in multiple myeloma patients. We choose this drug since the model performs the best in this drug and it is widely investigated in biological research literature. The gene that was ranked the highest by our algorithm (with lowest dropout probability), NR1H2, was previously found to be indicative of Multiple Myeloma (MM) non-response to anti- agents such as bortezomib (Agarwal et al., 2014). The second ranked gene, BLVRA, is known to be amplified in cells sensitive to anti-MM treatment, such as bortezomib (Soriano et al., 2016). Interestingly, BVLRA was also ranked second by RF (and not ranked highly by t-test). The gene ranked first by RF is FOSL1 which was not directly found to be linked to response by bortezomib, but is tangentially related through osteoclass process (FOSL1 helps with differentiation into bone cells and there is a secondary effect of bortezomib to prevent bone loss during inflammation processes). Overall, we found that ranking of RF follows rather closely ranking by t-test. Dropout FR ranking was significantly different, capturing the importance of the ranking for the SSVAE classification.

Test set performance of different methods in the Physionet Method AUPR AUROC SVM-linear SVM-RBF RF RNN

5 Discussion

In this work we proposed a new general approach for understanding the importance of features in deep learning. This simple approach has been previously shown to be very powerful for regularizing DNNs and preventing them from overfitting, but thus far has never been used on the input layer or applied to the task of feature ranking, i.e. to understand the performance of DNNs. We believe that variational dropout works well because it acts similarly to feature bagging (Ho, 1995), subsetting the features during training. It allows to decouple correlated variables in certain instances and optimizes the corresponding feature-wise dropout rate. This may also be the reason for the good performance by random forest which we have observed in our experiments and also the reason for poor performance of used in LASSO and Deep FS.

In our simulation experiment, we showed that deep learning based methods (Dropout FR, Zero, and Shuffle) capture the second-order interactions well. For other methods, we find that Random Forest performs worse when considering the order of more important features, showing it is not able to capture the correct ranking among important interacting features. Other methods such as Marginal, LASSO and Elastic Net perform much worse in our simulation, indicating simple univariate testing or linear layer is not sufficient to capture complicated nonlinear effects across both simulated and real datasets.

Further, we tested our approach in feed-forward networks, a recurrent neural network and a semi-supervised variational autoencoder showing that Dropout FR is applicable to various deep learning architectures and in most scenarios performs better than other commonly-used baselines and in other scenarios it performs as well as some of the best alternatives. In particular, our experiments in the feed-forward neural networks show that our method outperform other methods significantly in the MiniBooNE (ranking the top features) and YearPredictionMSD (ranking top features) datasets. Although we find our approach is not the best performer for some small numbers of features, we consider it reasonable since it is not a greedy approach and thus might optimize the ranking that sacrifices the performance of fewer features in exchange for larger performance gain for a bigger combination of features. In addition, in the time-series setting (Physionet), our approach outperforms other methods, including Random Forest when only using the top one feature. We see the same phenomenon in simulation that Random Forest is not good at ranking the top-ranked features, which is important for experimental design. Overall, we found it useful to compare multiple strawmen, such as marginal ranking, to gain further insights into the complexity of the data.

Top features selected from RF rank and Dropout FR in Physionet RF rank Present RNN rank Present 1 Urine GCS 2 GCS Urine 3 HR BUN 4 SysABP MechVent 5 Temp Temp 6 NISysABP HR 7 NIMAP Lactate 8 Weight Weight 9 NIDiasABP NIDiasABP 10 MAP SysABP

Top genes selected from RF rank and Dropout FR in drug response. We also list the t-test pvalue for each gene. RF rank Pvalue NN rank Pvalue 1 FOSL1 1.48e-07 NR1H2 2.26E-05 2 BLVRA 1.30e-01 BLVRA 1.30E-01 3 TRAM2 1.18e-10 PIK3CA 6.14E-02 4 CD44 1.21e-07 ATP6V1D 4.34E-01 5 DRAP1 1.30e-06 USP20 8.47E-02 6 FKBP4 4.99e-04 TRAM2 1.18E-10 7 UBE2L6 2.77e-06 CD58 2.20E-05 8 PLEKHM1 8.83e-04 DHX8 3.95E-02 9 SERPINE1 8.34e-05 PPP1R13B 5.40E-03 10 BAX 3.05e-05 CASP10 6.07E-03

6 Conclusion

We propose a new general feature ranking method for deep learning to interpret the feature importance. When it is impossible to measure all the features under various constraints, such as limited time and undue physical or emotional burden on the patient, it is paramount to design the system that collects the right subsets of features leading to highest performance. Our method can be used to design diagnostic standard procedure that measures least number of clinical tests with the highest or comparable predictive power. In conclusion, we provide a new and effective method that addresses the resource-constraint setting which is widely seen in the healthcare industry, and an effective solution to the common need in biology to interpret the predictive system especially such as deep learning, commonly thought of as a complex black box.