The interpretability of a machine learning (ML) algorithm is critical to many data analysis tasks. Perhaps the most popular motivating example involves ML interpretability in medical data analysis. Let us imagine we have a set of patients in our dataset, some with heart disease, some without. Given a set of features about these patients, one may not only be concerned with the accurate prediction of heart disease, but of the features most important to the success of this prediction task. Let us denote these important features as the feature importances of our medical dataset. Such feature importances can help medical professionals gather useful insights regarding the prevention of heart disease.
Currently, one of the most popular methods to achieve such insights is by using the feature importances calculated when training a Random Forest (RF) [Breiman2001]
. RF is an ensemble learning technique which utilizes numerous decision trees for regression and classification tasks. As described in the original paper, RF allows us to quantify feature importance given the interpretable nature of the underlying decision tree structure.
RF solves the interpretability problem by providing knowledge of feature importances on a global scale. That is, RF inherently answers the question ”What features best separate all data points?” Yet, what if one desires to know how important a feature is to the prediction of a specific sample? For example, what if we wish to gather feature importance insights for a specific patient in our medical dataset? Maybe he or she had an outlying characteristic we wish to explore on a low-level? This is the type of insight Single Sample Feature Importance (SSFI) aims to provide.
In general, SSFI builds on RF by answering the question, “What features are contributing most to the prediction of the target variable for a single sample in our dataset?” Our SSFI algorithm exploits the existing properties of RF to quantify the contribution of each feature to the prediction of a given sample.
2 Related Work
When it comes to analyzing how all samples are impacted by the feature space during prediction, there are a number of existing methodologies. Various forms of Multiple Linear Regression coefficient analysis[standardize_coef] [Dominance_Analysis] [MR_PY] [heuristic_importance] [RIA] [multifaceted] have been utilized to answer questions pertaining to feature importance. Upon the development of ensemble-tree learning algorithms [Breiman2001], RF has become a popular tool for feature importance calculation. The specifics of RF’s default feature importance calculation is discussed in Section 3.1. Variations of this approach, such as Permutation Importance [perm_import] combat bias produced by RF in the presence of categorical class imbalance. However, none of the aforementioned approaches allow for investigation of feature importances for a single sample. This is the problem SSFI aims to investigate.
The notion of using RF for low-level feature analysis was first introduced by [Palczewska2014]. However, SSFI expands on this idea by redefining the way one calculates the importance of the feature at a certain node in a Decision Tree. Our node importance function is detailed in Section 4.1.
A general solution to the low-level model interpretability problem is LIME [lime], which can explain the prediction of any classifier or regressor via local approximation of an interpretable model. Both LIME and SSFI provide a relative ranking of feature importances during the prediction of a given sample. However, LIME is built to provide a general solution to the interpretability problem, while SSFI should be viewed as an extension of RF. We compare the performance of both the general and RF based solutions within a classification setting in Section 5.2.
When analyzing image data, [CMAP]
shows how Class Activation Maps (CMAP) can be used to identify image regions with the greatest impact on a prediction made by a Convolutional Neural Network. Since SSFI is built upon RF, which is generally unable to learn complex image data, CMAPs can be difficult to compare with SSFI. However, when presented with small grayscale images learnable by RF, we are able to compare the important pixels extracted from both algorithms. We compare the results of SSFI with CMAPs in Section5.3.
3 Random Forest
RF is a popular algorithm used for both classification and regression tasks. RF is an ensemble of decision trees, which makes them an attractive predictor for a variety of reasons. First, Decision Trees are non-parametric, allowing for the modeling of very complex relationships without the use of a prior. Furthermore, Decision Trees efficiently utilize both categorical and numeric data, are robust to outliers, and provide an interpretable modeling of the data[1407.7502].
More formally, Decision Trees can be understood as recursive partition classifiers. Let us denote a learning set where each is a
1 input vector containingexplanatory variables and each is a continuous value (regression) or class value (classification) corresponding to the respective target. In the case where , Decision Trees aim to recursively partition the inputs X into the subsets which minimizes
is the probability of picking a sample with class label. , known as the Gini impurity [Breiman1983ClassificationAR], quantifies the quality of a split by how mixed the classes are in the groups created by the split. A perfect split results in , which means the probably of picking class in subset is 1.
Decision Trees suffer from high variance as a single tree is highly dependent on its given training data. To combat this issue, one can apply a technique called Bootstrap Aggregating (Bagging)[Breiman1996_]. Given our Learning set , bagging creates Decision Trees which are trained on a random subsample of (with replacement) of size . Thus, given Decision Trees, our Bagging-based prediction of our target variable is
where is the output of Decision Tree given some input vector of size . This approach reduces the inherent Variance problem related to single Decision Tree learners by training many trees on slightly different datasets.
A limitation of Bagging is that each tree uses the same sampling procedure for each tree, allowing for highly correlated trees which may have equivalent (or very similar) split points. RF solves this by limiting the number of features to be considered by each tree during the bagging procedure. For Classification problems, the number of features used in each split in a dataset with features is usually . This results in lower correlation between trees, higher diversity in predictions, and an improvement in overall accuracy.
3.1 Feature Importance
The most popular feature importance measure utilizing RF is known as the Gini Importance [Breiman1983ClassificationAR]. The Gini Importance is implicitly calculated during training as it is a product of the Gini Impurity used to calculate Decision Tree splits. That is, when training a Decision Tree, one can calculate how the selection of a feature at node in tree contributes to the minimization of the Gini Impurity. We can extend this idea to RF and simply observe a feature’s total impact on the Gini Impurity’s decrease throughout all trees in a RF.
Let represent a trained RF consisting of trees. For each tree, let there be nodes in tree , the importance of feature at is computed as:
where is the weighted number of samples (i.e. number of samples for a node divided by total number of samples) that reach , is the Gini Impurity of , and represent the weighted number of samples reaching the left child node and right child node of respectively and and represent the Gini Impurity of the left and right child nodes respectively. Thus the importance of represents the Gini decrease provided by the feature used to split at node .
Perhaps more intuitively, it might be useful to think about the Gini decrease as:
as the importance of a node is quantified by how much it contributes to the minimization of the Gini Impurity. Thus, given some feature used to split node , the larger is, the greater impact has on Impurity minimization.
One can thus calculate the total Gini Importance for variable as:
Which is the total decrease in impurity across all nodes and all trees in a forest contributed by feature . Note that Eq. 5 assumes that at a node for which feature is not present produces 0 change in Gini decrease.
4 Single Sample Feature Importance
A limitation of the variable importance calculation as described in section 3.1 is that it only provides insight into how important a feature is in the context of global Gini Impurity minimization. What if one desires to know how important a feature is to the prediction of a specific sample? In this section, we detail the SSFI algorithm and outline all underlying assumptions used during our calculation.
Consider a trained RF model with trees, where each tree has depth , and total nodes. Let be the sequence of nodes used during the prediction of sample given . For all nodes there exists a split value which has been calculated using the methods described in Section 3. We predict the output of a test sample which has features, each with some feature value . SSFI defines the importance of feature at node as:
Where is the feature value of which corresponds to the feature used to split , is the depth of , is the computed split value of node , and is a free tuning parameter. In our experiments, we found to be the optimal value.
The overall importance of to the prediction of is the sum of all for all , and all :
Note that Eq. 7 assumes a contribution of 0 by feature at node if does not correspond to the node used to split . Eq. 6 highlights how SSFI quantifies importance. The first component implies that a feature which occurs earlier in the tree (i.e. closer to the root) will have a greater impact on the prediction. This follows one’s intuition as the first split in a Decision Tree influences the longest sequence of split decisions in the tree. The later component can be understood as the “confidence” of a split, where “confidence” simply means the distance of the feature value from the split value. All together, Eq. 6 rewards features that were used early on in prediction and contain large distances from computed node split values. Eq. 7 has a similar interpretation to Eq. 5, as we simply sum these importance values across all predictions made by a RF.
It is important to note the assumptions that need be made when utilizing SSFI. First, we assume the samples in our learning set and our test sample come from the same dataset. The trained RF model used during calculation has never seen the SSFI sample being analyzed. Yet, since we derive SSFI from this trained model, the test sample must share the underlying structure of the training dataset. In practice, we generate SSFI for all samples by performing Leave-One-Out (LOO) cross validation [ref1] where given some learning set where each sample is from the same distribution, one can iteratively calculate SSFI for the held out sample while training the RF model on the other remaining samples. This is displayed in Algorithm 1.
Furthermore, it is important for SSFI that the trained RF model is able to partition the data well and produce accurate predictions. Proper testing of RF’s ability to separate a given dataset must be performed prior to SSFI’s utilization. If utilizing LOO cross validation, one may verify the viability of SSFI by analyzing the error during LOO and ensuring the model is able to predict test samples accurately.
Assessing the performance of SSFI is non-trivial as one generally cannot know the ground truth single-sample feature importances for a given dataset. Thus, we have contrived a series of experiments that aim to test SSFI’s validity. In this section, we first show how SSFI as a method of feature selection compares to both LIME and the traditional Gini Importance measure. Later, we perform a more qualitative analysis by examining the pixels SSFI deems most important during image classification.
5.1 Evaluation Metrics
To evaluate model performance on experiments with numeric data, we use the Coefficient of Determination ():
Where is the number of samples in our test set, is the predicted value of ground truth value . and represent the mean of the predicted values and ground truth values respectively.
For image classification, we quantify accuracy as:
Where is the number of samples in our test set and is our predicted class for sample which has ground truth label .
5.2 Feature Selection using SSFI
To begin our evaluation, we pose the question: How do SSFI features compare to the top features calculated by RF and LIME? To investigate, let us construct four different experiments:
Evaluate model performance using Leave-One-Out (LOO) cross validation, where at each iteration of LOO, the feature space used to predict the target variable dynamically adjusts to utilize the pre-computed SSFI features for the test sample.
Evaluate model performance using LOO cross validation, where at each iteration of LOO, the feature space used to predict the target variable dynamically adjusts to utilize the pre-computed LIME features for the test sample.
Evaluate model performance using LOO where, throughout the cross validation process, the feature space remains static. The static feature set is chosen by a RF model previously trained on the same dataset.
Evaluate model performance using LOO and a random feature set. This experiment will establish a baseline result for comparison with 1,2, and 3.
Each experiment is run 50 times to account for slight variations in performance due to the random component of RF models. Thus, all results are the average score after 50 experiments. Furthermore, each model will only be trained on the top-3 features produced by each selection method. We evaluate feature performance using both linear (Linear Regression [regresh]
) and non-linear (RF) models. All RF models used for prediction are fit using 100 estimators and all model inputs are normalized between.
|[.5] Feature Selection Method||Random Forest||Linear Regression|
|Random Forest||66.9 %||62.6%|
The results in Table 1 highlight how the dynamic SSFI feature set selects features that outperform both LIME and RF. By validating performance on both RF and Linear Regression, we ensure our features generalize outside of the presence of RF, with which SSFI is generated by. In general, this suggests SSFI is identifying important sample-level information about the feature space not captured by the other approaches.
|[.5] Feature Selection Method||Random Forest||Linear Regression|
The results in Table 2 were conducted under the same conditions as Table 1, only with the Breast Cancer Dataset [Dua:2019]. Once again, we find that SSFI calculated features vastly outperform the static feature set produced by the Random Forest model. However, in this setting, LIME and SSFI generally select overlapping feature sets that produce very similar results. It may be the case that LIME performs better on binary classification tasks, but nonetheless shows SSFI and LIME are at least in agreement.
These experiments highlight the existence and accessibility of low-level feature information. Furthermore, SSFI is clearly not guessing as the dynamic feature space significantly outperforms both globally important features and randomly selected features. Thus, SSFI features have been shown to be reliable predictors for their given test samples when presented with a dataset learnable by RF. However, we must note that we do not wish to compare the overall quality of LIME and SSFI. LIME’s intention is to provide a general solution to the low-level interpretability problem. Furthermore, LIME provides more information than SSFI, such as what features caused a certain class not to be predicted. Thus, the SSFI-LIME comparison simply serves to help provide a quantitative analysis in the absence of ground-truth labels. Nonetheless, the Wine classification results may suggest that, in the situations where RF can separate a dataset extremely well, RF-based interpretability methods may allow for greater knowledge extraction.
5.3 Visual Analysis
We may also interpret the SSFI feature selection choices visually by training a RF classifier on image data. RF is generally not a viable choice for image classification, but in the presence of simple grayscale images, can produce accurate results. For example, when training a RF classifier on the MNIST Handwritten Digit dataset [lecun-mnisthandwrittendigit-2010], we achieve accuracy. When we run SSFI on this trained classifier, we can visualize the most important pixels used during classification.
shows SSFI’s effectiveness in extracting useful features when predicting top MNIST features. We find that SSFI consistently identifies pixels that construct each digit during it’s feature extraction.
We perform a second level of visual analysis with the Fashion MNIST dataset [xiao2017/online] which contains 60,000 grayscale clothing images. This time, we compare SSFI with the Class Activation Map (CMAP) [CMAP]
, which is a deep learning strategy that visualizes the Global Average Pooling layers of a trained Convolutional Neural Network (CNN) to highlight the regions of an image which are important for the classification of that image. To do so, we use each 2828 in Fashion MNIST to train a ResNet16 [resnet] CNN which can classify Fashion MNIST with 95% accuracy. Next, we train a RF classifier on Fashion MNIST which obtains 85% accuracy. Given our trained CNN, we visualize the CMAP for a sample of images and compare these regions to the pixels extracted by SSFI of the same sample.
Figure 2 highlights how the SSFI extracted pixels generally fall in the extracted CMAP regions. The displayed results are what we deemed a representative sample of our findings. For example, long thing tops, as shown in row 1 of Figure 2, often had a dense region near a corner of the clothing deemed most important. Furthermore, rows 2 and 4 highlight foot ware, where both SSFI and CMAP almost always found the back of the shoe/boot to be important. Finally, row 3 shows the prediction of a shirt, where the sleeves were often important for classification. These results show that when RF is able to accurately learn how to separate image data, it is in agreement with the CNN regarding pixels important for classification.
In an attempt to quantify the SSFI and CMAP comparison, we also calculated how often SSFI pixels were appearing in CMAP regions. To do so, we define an evaluation metric which returns 1 if any of the top 5 pixels from SSFI appear in the red ”most important” CMAP regions, and 0 otherwise. This experiment assumes the CMAP to be the ground truth, and is performed only to explore if SSFI agrees with the deep learning solution to this problem. In our experiment, we use 1000 random samples from Fashion MNIST. However, only trials where both the RF used to predict the test sample, and the CNN used to generate the CMAP make a correct prediction are considered in this analysis. This constraint resulted in only 647 data points being used.
Table 3 displays the results of the SSFI vs CMAP comparison. In general, we find that both algorithms consider similar portions of the region to be most important to classification 73% of the time. Certain classes, such as Dress, Coat, and Shirt brought about disagreement between algorithms. However, both algorithms are strongly aligned in their region importance for Trouser, Sandal, and Ankle Boot.
By comparing SSFI with CMAP, we may only conclude that, when RF is able to learn a dataset, it tends to identify the same regions as being important to classification as the corresponding deep learning solution. In general, our qualitative visual analysis has served to highlight the validity of the SSFI algorithm in the absence of ground-truth single-sample feature importances.
The low-level quantification of feature importance is highly desirable in practices where data samples require individualized inspection. The conditions for SSFI’s success are of course stringent. The requirement of data to be well understood by RF reduces generality but provides a tool for analysis when used in the correct setting.
This brings about interesting questions for future work. Might we validate SSFI’s ability to identify the root cause of anomaly detection problems or perhaps the improvement of classification using a dynamic feature space? In this study, our approach looked to verify the quality of the individualized features extracted by SSFI in the general sense, but future work should investigate the use of single sample importances as tools for solving more specific machine learning problems.
The research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. The work of Joseph Gatto was sponsored by JPL Summer Internship Program and the National Aeronautics and Space Administration.