Predictive machine learning models are almost everywhere in industry, and complex models such as random forest, gradient boosted trees, and deep neural networks are being widely used due to their high prediction accuracy. However, interpreting predictions from these complex models remains an important challenge, and many times the interpretations at individual sample level are of the most interestRibeiro et al. (2016). There exist several state-of-the-art sample-level model interpretation approaches, e.g., SHAP Lundberg and Lee (2017), LIME Ribeiro et al. (2016), and Integrated Gradient Sundararajan et al. (2017)
. Among them, SHAP (SHapley Additive exPlanation) calculates SHAP values quantifying the contribution of each feature to the model prediction, by incorporating concepts from game theory and local explanations. In contrast to other approaches, SHAP has been justified as the only consistent feature attribution approach with several unique properties which agree with human intuition. Due to its solid theoretical guarantees, SHAP becomes one of the top recommendations of model interpretation approaches in industryLundberg et al. (2018b); Zhang et al. (2020); Agius et al. (2020); Ariza-Garzón et al. (2020).
There exist several variants of SHAP. The general version is KernelSHAP Lundberg and Lee (2017), which is model-agnostic, and generally takes exponential time to compute the exact SHAP values. Its variants include TreeSHAP Lundberg et al. (2018a, 2019) and DeepSHAP Lundberg and Lee (2017)
, which are designed specifically for tree-based models (e.g., decision tree, random forest, gradient boosted trees) and neural-network models. In these variants, the special model structures may lead to potential improvements in computational efficiency. For example, the tree structure enables a polynomial time complexity for TreeSHAP. In this paper, we choose TreeSHAP for further exploration, as tree-based models are widespread in industry.
After looking into many TreeSHAP use cases, we find out that despite its algorithmic complexity improvement, computing SHAP values for large sample size (e.g., tens of millions of samples) or large model size (e.g., maximum tree depth ) still remains a computational concern in practice. For example, explaining 20 million samples for a random forest model with 400 trees and maximum tree depth 12 can take as long as 15 hours even on a 100-core server (more details in Appendix A.8). In fact, explaining (at least) tens of millions of samples widely exist in industry-level predictive models, e.g., feed ranking model, ads targeting model, and subscription propensity model. Spending tens of hours in model interpretation becomes a significant bottleneck in these modeling pipelines, causing huge delays in post-hoc model diagnosis via important feature analysis, as well as long waiting time in preparing feature reasoning for model end users.
In this paper, we conduct a thorough inspection into the TreeSHAP algorithm, with the focus on improving its computational efficiency for large size of samples to be explained. In this, we take the number of samples into consideration in our time complexity analysis of TreeSHAP algorithm. We propose two new algorithms - Fast TreeSHAP v1 and Fast TreeSHAP v2. Fast TreeSHAP v1 is built based on TreeSHAP by redesigning the ways of calculating certain computationally-expensive components. In practice, Fast TreeSHAP v1 is able to consistently boost the computational speed by 1.5x with the same memory cost as in TreeSHAP. Fast TreeSHAP v2 is built based on Fast TreeSHAP v1 by further pre-computing certain computationally-expensive components which are independent with the samples to be explained. It is able to largely reduce the time complexity when calculating SHAP values, leading to 2-3x faster computational speed than TreeSHAP in practice with just a small sacrifice on memory. The pre-computation step in Fast TreeSHAP v2 enables its suitability in multi-time model interpretations, where the model is pre-trained and remains unchanged, and new scoring data are coming on a regular basis.
2 Related Works
Since the introductions of SHAP Lundberg and Lee (2017) and TreeSHAP Lundberg et al. (2018a, 2019), many related works have been done in this area. Most of them focus on the application side in a variety of areas, including medical science Lundberg et al. (2018b); Zhang et al. (2020); Agius et al. (2020), social science Ayush et al. (2020), finance Ariza-Garzón et al. (2020) and sports Rommers et al. (2020). There also exist a lot of exciting papers focusing on the designs of SHAP/TreeSHAP implementations Wang et al. (2019); Murdoch et al. (2019); Kaur et al. (2020); Yang et al. (2021), as well as the theoretical justification of SHAP values Janzing et al. (2020); Kumar et al. (2020); Sundararajan and Najmi (2020); Aas et al. (2019); Frye et al. (2020); Merrick and Taly (2020); Chen et al. (2020).
Besides these works, only a few of them focus on the computational efficiency of SHAP/TreeSHAP: The authors of Greenwell (2019) developed a Python package fastshap to approximate SHAP values for arbitrary models by calling scoring function as few times as possible. In Gupta and Joseph (2020), a model-agnostic version of SHAP value approximation was implemented on Spark. Both works were not specifically designed for tree-based models, and thus the advanced polynomial time complexity may not be leveraged. The authors of Maksymiuk et al. (2020); Komisarczyk et al. (2020) built R packages shapper and treeshap as R wrappers of SHAP Python library, which achieved comparable speed. However, no algorithmic improvements have been made. In Messalas et al. (2020, 2019)
, MASHAP was proposed to compute SHAP values for arbitrary models in an efficient way, where an arbitrary model is first approximated by a surrogate XGBoost model, and TreeSHAP is then applied on this surrogate model to calculate SHAP values. The most related work of improving computational efficiency in TreeSHAP as far as we know isMitchell et al. (2020), where the authors presented GPUTreeShap as a GPU implementation of TreeSHAP algorithm. Our work is different from GPUTreeShap as our work focuses on improving the computational complexity of the algorithm itself, while the parallel mechanism of GPU rather than the improvement of algorithm complexity has led to the speedup of GPUTreeShap. In our future work, it is possible to further combine Fast TreeSHAP v1 and v2 with GPUTreeShap.
In this section, we review the definition of SHAP values Lundberg and Lee (2017), and then walk through the derivation of TreeSHAP algorithm Lundberg et al. (2018a, 2019) which directly leads to the development of Fast TreeSHAP v1 and v2.
3.1 SHAP Values
Let be the predictive model to be explained, and be the set of all input features.
maps an input feature vectorto an output where in regression and in (binary) classification. SHAP values are defined as the coefficients of the additive surrogate explanation model: . Here represent a feature being observed () or unknown (), and are the feature attribution values. Note that the surrogate explanation model is a local approximation to the model prediction given an input feature vector .
As described in Lundberg and Lee (2017), SHAP values are the single unique solution of ’s in the class of additive surrogate explanation models that satisfies three desirable properties: local accuracy, missingness, and consistency. To compute SHAP values we denote as the model output restricted to the feature subset , and the SHAP values are then computed based on the classic Shapley values:
It is still remaining to define . In recent literature, there exist two main options for defining : the conditional expectation , and the marginal expectation , where is the sub-vector of restricted to the feature subset , and . While it is still debating which option is more appropriate Janzing et al. (2020); Kumar et al. (2020); Sundararajan and Najmi (2020); Aas et al. (2019); Frye et al. (2020); Merrick and Taly (2020); Chen et al. (2020), both options need exponential time in computation as Equation 1 considers all possible subsets in .
3.2 SHAP Values for Trees
For a tree-based model , we note that it is sufficient to investigate the ways to calculate SHAP values on a single tree, since the SHAP values of tree ensembles equal the sums of SHAP values of its individual trees according to the additivity property of SHAP values Lundberg et al. (2018a, 2019). The authors of Lundberg et al. (2018a, 2019) define a conditional expectation for tree-based models. The basic idea is to calculate by recursively following the decision path for if the split feature in the decision path is in , and taking the weighted average of both branches if the split feature is not in . We use the proportions of training data that flow down the left and right branches as the weights.
Algorithm 1 proposed in Lundberg et al. (2018a, 2019) (Appendix A.1) provides the details to calculate for tree-based models. A tree is specified as a tuple of six vectors : contains the values of each leaf node. and represent the left and right node indexes for each internal node. contains the thresholds for each internal node, contains the feature indexes used for splitting in internal nodes, and represents the cover of each node (i.e., how many training samples fall in that node). The time complexity of calculating in Algorithm 1 is , where is the number of leaves in a tree, since we need to loop over each node in the tree. This leads to an exponential complexity of for computing SHAP values for samples with a total of trees. We next show how TreeSHAP can help reduce this time complexity from exponential to polynomial.
TreeSHAP proposed in Lundberg et al. (2018a, 2019) runs in time and memory, where is the total number of samples to be explained, is the number of features, is the number of trees, is the maximum depth of any tree, and is the maximum number of leaves in any tree. The intuition of the polynomial time algorithm is to recursively keep track of the proportion of all possible feature subsets that flow down into each leaf node of the tree, which is similar to running Algorithm 1 simultaneously for all feature subsets. We recommend readers to check Lundberg et al. (2018a, 2019) for the algorithm details. In this paper, we present our understanding of the derivation of the TreeSHAP algorithm, which also leads to our proposed Fast TreeSHAP algorithms with computational advantage.
We introduce some additional notations. Assume in a given tree , there are leaves with values to , corresponding to paths to , where path is the set of internal nodes starting from the root node and ending at the leaf node (leaf node is not included in the path). We use to denote the th internal node in path hereafter. We also use to denote the feature set used for splitting in internal nodes in path . Moreover, we denote the feature subspaces restricted by the thresholds along the path as , where if node connects to its left child in path or if node connects to its right child in path ( are the thresholds for internal nodes in path ), and we also denote the covering ratios along the path as , where ( are the covers of internal nodes in path ). The formula of in Algorithm 1 can be simplified as:
where is an indicator function. Let , i.e., is the "proportion" of subset that flows down into the leaf node , then . Plugging it into Equation 1 leads to the SHAP value .
can be computed by only considering the paths which contain feature and all the subsets within these paths (i.e., instead of considering all subsets). Specifically,
The proof of Theorem 1 can be found in Appendix A.2. For convenience, we call the "Shapley weight" for subset size and path . We point out that the TreeSHAP algorithm is exactly built based on Theorem 1. Specifically, the EXTEND method in TreeSHAP alogithm keeps track of the sum of the "proportion"s of all subsets that flow down into a certain leaf node weighted by its Shapley weight for each possible subset size. When descending a path, EXTEND is called repeatedly to take a new feature in the path and add its contribution to the sum of the "proportion"s of all feature subsets of size up to the current depth. At the leaf node, EXTEND reaches the sequence of values for each possible subset size . The UNWIND method in TreeSHAP algorithm is used to undo a previous call to EXTEND, i.e., to remove the contribution of a feature previously added via EXTEND, and is indeed commutative with EXTEND. UNWIND can be used when duplicated features are encountered in the path, or at the leaf node when calculating the contribution of each feature in the path to SHAP values according to Equation 2.
4 Fast TreeSHAP
We now further simplify Equation 2 for an even faster computational speed. Consider a subset which consists of all the features in not satisfying the thresholds along the path , i.e., . Also define for , where is a subset of . We see that when , can be interpreted as the sum of the "proportion"s of all subsets that flow down into a leaf node (each feature in the subsets must satisfy the thresholds along the path ) weighted by its Shapley weight for subset size . Finally, we define . Theorem 2 further simplifies Equation 2 (proof in Appendix A.3):
can be computed by only considering the subsets within the paths where each feature in the subsets satisfies the thresholds along the path. Specifically,
4.1 Fast TreeSHAP v1 Algorithm
We propose Fast TreeSHAP v1 algorithm based on Theorem 2 which runs in time and memory. This computational complexity looks the same as in the original TreeSHAP. However, we will show in section 4.1.1 and 5 that the average running time can be largely reduced.
In Fast TreeSHAP v1 algorithm (Algorithm A.4 in Appendix A.4), we follow the similar algorithm setting as in the original TreeSHAP algorithm, where both EXTEND and UNWIND methods are being used. The EXTEND method is used to keep track of for . Remind that is the sum of the "proportion"s of all subsets that flow down into a leaf node (each feature in the subsets must satisfy the thresholds along the path ) weighted by its Shapley weight for subset size . Compared with the EXTEND method in the original TreeSHAP algorithm, the main difference is the constraint applied on these subsets (highlighted in bold), which largely reduces the number of subset sizes to be considered. Specifically, in Fast TreeSHAP v1, when descending a path, EXTEND is called only when a new feature in the path satisfies the threshold, and then its contribution to the sum of "proportion"s of all feature subsets of size up to the number of features satisfying the thresholds until the current depth is added. When reaching the leaf node, the number of possible subset sizes considered by EXTEND is in Fast TreeSHAP v1 rather than in the original TreeSHAP. The UNWIND method is still used to undo a previous call to EXTEND. Specifically, it is used when duplicated features are encountered in the path or when calculating for , at the leaf node. Besides EXTEND and UNWIND, we also keep track of the product of covering ratios of all features not satisfying the thresholds along the path, i.e., in Equation 3, which is trivial.
4.1.1 Complexity Analysis
In the original TreeSHAP, the complexity of EXTEND and UNWIND is bounded by , since both of them need to loop over the number of possible subset sizes, which equals in path . At each internal node, EXTEND is called once, while at each leaf node, UNWIND is called times to update SHAP values for each of the features in the path. This leads to a complexity of for the entire tree because the work done at the leaves dominates the work at the internal nodes.
In Fast TreeSHAP v1, both EXTEND and UNWIND need to loop over the number of possible subset sizes under the constraint on subset (highlighted in bold), which is in path . Thus, although the complexity of EXTEND and UNWIND is still bounded by , the average running time can be reduced to , which equals the average ratio between the number of possible subset sizes under constraint and the number of all possible subset sizes, i.e., . Moreover, according to Equation 3, at the leaf node of Path , UNWIND is called times for each of the features satisfying the thresholds in the path, and only once for all other features in the path. Therefore, although the number of times UNWIND being called is still bounded by , the actual number can also be lowered by on average. As a result, although we still have the complexity of for the entire tree, the average running time can be reduced to compared with the original TreeSHAP. Finally, the complexity is for the entire ensemble of trees and samples to be explained, with the running time reduced to on average compared with the original TreeSHAP.
4.2 Fast TreeSHAP v2 Algorithm
We propose Fast TreeSHAP v2 algorithm that runs in time and memory. For balanced trees it becomes time and memory. Compared with time and memory in original TreeSHAP and Fast TreeSHAP v1, Fast TreeSHAP v2 outperforms in computational speed when the number of samples exceeds , where is the maximum depth of any tree (more details in the next paragraph). Fast TreeSHAP v2 has a stricter restriction on tree size due to memory concerns. In practice, it works well for trees with maximum depth as large as 16 in an ordinary laptop, which covers most of the use cases of tree-based models. We discuss the memory concerns in detail in Sections 4.3 and 5.
The design of Fast TreeSHAP v2 algorithm is inspired by Fast TreeSHAP v1 algorithm. Recall that the loops of UNWIND at the leaves dominate the complexity of Fast TreeSHAP v1 algorithm, where the length of each loop is , and each call to UNWIND also takes time, resulting in time complexity at each leaf node. While looping over each of the features in the path is inevitable in updating SHAP values at the leaf node (i.e., the loop of length is necessary), our question is: Is it possible to get rid of calling UNWIND? From Equation 3 we see that, the ultimate goal of calling UNWIND is to calculate and for . We also note that for different samples to be explained, although may vary from sample to sample, all the possible values of and for fall in the set with size . Therefore, a natural idea to reduce the computational complexity is, instead of calling UNWIND to calculate and for every time we explain a sample, we can pre-compute all the values in the set which only depend on the tree itself, and then extract the corresponding value when looping over features at leaf nodes to calculate for each specific sample to be explained. In fact, what we just proposed is to trade space complexity for time complexity. This should significantly save computational efforts when there exist redundant computations of across samples, which generally happens when (For each sample, around ’s should be calculated for each path, thus on average calculations should be taken for samples). This commonly occurs in a moderate-sized dataset, e.g., when , when , and when . We show the appropriateness of trading space complexity for time complexity in practice in Section 5.
We split Fast TreeSHAP v2 algorithm into two parts: Fast TreeSHAP Prep and Fast TreeSHAP Score. Fast TreeSHAP Prep (Algorithm A.5 in Appendix A.5) calculates the sets for all ’s in the tree, and Fast TreeSHAP Score (Algorithm A.5 in Appendix A.5) calculates ’s for all samples to be explained based on the pre-computed . The main output of Fast TreeSHAP Prep is , an matrix where each row records the values in for one path . To calculate , similar to the original TreeSHAP and Fast TreeSHAP v1, both EXTEND and UNWIND methods are used. The EXTEND method keeps track of for all possible subsets simultaneously, and the UNWIND method undoes a previous call to EXTEND when duplicated features are encountered in the path. At the leaf node, is obtained by summing up across for all possible subsets simultaneously. In Fast TreeSHAP Score, given a feature vector , we need to find out its corresponding , i.e., the feature subset within path where each feature satisfies the thresholds along the path, and then extract the corresponding value of and for from pre-computed .
4.2.1 Complexity Analysis
In Fast TreeSHAP Prep, the time complexities of both EXTEND at the internal node and calculation at the leaf node are bounded by , where comes from the number of possible subsets within each path, and comes from the number of possible subset sizes. Thus the time complexity is for the entire ensemble of trees. Note that this time complexity is independent with the number of samples to be explained, thus this entire part can be pre-computed, and matrix can be stored together with other tree properties to facilitate future SHAP value calculation. The space complexity is dominated by , which is . Note that this complexity is for one tree. In practice, there are two ways to achieve this complexity for ensemble of trees: i). Sequentially calculate for each tree, and update SHAP values for all samples immediately after one is calculated. ii). Pre-calculate for all trees and store them in the local disk, and sequentially read each into memory and update SHAP values for all samples accordingly.
In Fast TreeSHAP Score, it takes time at each internal node to figure out , and time at each leaf node to loop over each of the features in the path to extract its corresponding value from (It takes time to look up in ). Therefore, the loops at the leaves dominate the complexity of Fast TreeSHAP Score, which is . Finally, the complexity is for the entire ensemble of trees and samples to be explained. Compared with complexity in the original TreeSHAP and Fast TreeSHAP v1, this is a -time improvement in computational complexity.
4.3 Fast TreeSHAP Summary
Table 1 summarizes the time and space complexities of each variant of TreeSHAP algorithm ( is the number of samples to be explained, is the number of features, is the number of trees, is the maximum number of leaves in any tree, and is the maximum depth of any tree).
|TreeSHAP Version||Time Complexity||Space Complexity|
|Fast TreeSHAP v1||111Average running time is reduced to of original TreeSHAP.|
|Fast TreeSHAP v2 (general case)|
|Fast TreeSHAP v2 (balanced trees)|
|Average running time is reduced to of original TreeSHAP.|
Fast TreeSHAP v1 strictly outperforms original TreeSHAP in average running time and performs comparably with original TreeSHAP in space allocation. Thus we recommend to at least replace original TreeSHAP with Fast TreeSHAP v1 in any tree-based model interpretation use cases.
We consider two scenarios in model interpretation use cases to compare Fast TreeSHAP v1 and v2:
One-time usage: We explain all the samples for once, which usually occurs in ad-hoc model diagnosis. In this case, as mentioned in Section 4.2, Fast TreeSHAP v2 is preferred when (commonly occurs in a moderate-sized dataset, as most tree-based models produce trees with depth ). Also, Fast TreeSHAP v2 is under a stricter memory constraint: . For reference, for double type matrix (assume in complete balanced trees, i.e., ), its space allocation is 32KB for , 8MB for , and 2GB for . In practice, when becomes larger, it becomes harder to build a complete balanced tree, i.e., will be much smaller than , leading to a much smaller memory allocation than the theoretical upbound. We will see this in Section 5.
Multi-time usage: We have a stable model in the backend, and we receive new data to be scored on a regular basis. This happens in most of the use cases of predictive modeling in industry, where the model is trained in a monthly/yearly frequency but the scoring data are generated in a daily/weekly frequency. One advantage of Fast TreeSHAP v2 is that it is well-suited for this multi-time usage scenario. In Fast TreeSHAP v2, we only need to calculate once and store it in the local disk, and read it when new samples are coming, which leads to -time computational speedup over Fast TreeSHAP v1.
We train different sizes of random forest models for evaluation on a list of datasets in Table 4 in Appendix A.6, with the goal of evaluating a wide range of tree ensembles representative of different real-world settings. While the first three datasets Adult Kohavi and others (1996), Superconductor Hamidieh (2018), and Crop Khosravi and Alavipanah (2019); Khosravi et al. (2018) in Table 4
are publicly available, we also include one LinkedIn internal dataset “Upsell” to better illustrate the TreeSHAP implementation in industry. The Upsell dataset is used to predict how likely each LinkedIn customer is to purchase more Recruiters products by using features including product usage, recruiter activity, and company attributes. For each dataset, we fix the number of trees to be 100, and we train a small, medium, large, and extra-large model variant by setting the maximum depth of trees to be 4, 8, 12, and 16 respectively. Other hyperparameters in the random forest are left as default. Summary statistics for each model variant is listed in Table5 in Appendix A.6.
We compare the execution times of Fast TreeSHAP v1 and v2 against the existing TreeSHAP implementation in the open source SHAP package (https://github.com/slundberg/shap). For fair comparison, we directly modify the C file treeshap.h in SHAP package to incorporate both Fast TreeSHAP v1 and v2. All the evaluations were run on a single core in Azure Virtual Machine with size Standard_D8_v3 (8 cores and 32GB memory). We ran each evaluation on 10,000 samples. In Table 2
, results are averaged over 5 runs and standard deviations are also presented. To justify the correctness of Fast TreeSHAP v1 and v2, in each run we also compare the calculated SHAP values from Fast TreeSHAP v1 and v2 with SHAP values from the original TreeSHAP, and the maximal element-wise difference we observed during the entire evaluation process is, which is most likely the numerical error.
|Model||Original||Fast Tree-||Speedup||Fast Tree-||Speedup|
|TreeSHAP (s)||SHAP v1 (s)||SHAP v2 (s)|
|Adult-Small||2.40 (0.03)||2.11 (0.02)||1.14||1.30 (0.04)||1.85|
|Adult-Med||61.04 (0.61)||44.09 (0.61)||1.38||26.62 (1.08)||2.29|
|Adult-Large||480.33 (3.60)||333.94 (4.20)||1.44||161.43 (3.95)||2.98|
|Adult-xLarge||1805.54 (13.75)||1225.20 (8.97)||1.47||827.62 (16.17)||2.18|
|Super-Small||2.50 (0.03)||2.04 (0.02)||1.23||1.28 (0.08)||1.95|
|Super-Med||89.93 (3.58)||60.04 (3.56)||1.50||35.65 (2.06)||2.52|
|Super-Large||1067.18 (10.52)||663.02 (5.79)||1.61||384.14 (4.78)||2.78|
|Super-xLarge||3776.44 (28.77)||2342.44 (35.23)||1.61||1988.48 (15.19)||1.90|
|Crop-Small||3.53 (0.07)||2.90 (0.04)||1.22||3.15 (0.02)||1.12|
|Crop-Med||69.88 (0.71)||50.13 (0.91)||1.39||34.57 (1.49)||2.02|
|Crop-Large||315.27 (6.37)||216.05 (8.64)||1.46||130.66 (3.80)||2.41|
|Crop-xLarge||552.23 (10.37)||385.51 (8.48)||1.43||290.49 (3.19)||1.90|
|Upsell-Small||2.80 (0.04)||2.23 (0.05)||1.26||2.20 (0.06)||1.27|
|Upsell-Med||90.64 (4.59)||63.34 (1.82)||1.43||34.02 (0.93)||2.66|
|Upsell-Large||790.83 (5.79)||515.16 (1.66)||1.54||282.98 (4.89)||2.79|
|Upsell-xLarge||2265.82 (17.44)||1476.56 (4.20)||1.53||1166.98 (15.02)||1.94|
We conduct pairwise comparisons between these three algorithms:
Original TreeSHAP vs Fast TreeSHAP v1: For medium, large, and extra-large models, we observe speedups consistently around 1.5x. We observe lower speedup (around 1.2x) for small models probably due to the insufficient computation in computationally-expensive parts. These speedups also seem much lower than the theoretical upper bound () discussed in Section 4.1, which is probably due to the existence of other tasks with slightly lower computational complexity in the algorithm.
Original TreeSHAP vs Fast TreeSHAP v2: For medium and large models, we observe speedups around 2.5-3x, while the speedups drop to around 2x for extra-large models. This is because the first step Fast TreeSHAP Prep in Fast TreeSHAP v2 takes much longer time for larger models, and the execution time of Fast TreeSHAP v2 listed in Table 2 is a combination of its two steps. Later in this section, we will examine the execution times of Fast TreeSHAP Prep and Fast TreeSHAP Score separately.
Fast TreeSHAP v1 vs Fast TreeSHAP v2: The speedups of Fast TreeSHAP v2 are consistently higher than the speedups of Fast TreeSHAP v1 except for small models, showing the effectiveness of Fast TreeSHAP v2 in improving the computational complexity. Their comparable performance for small models is also due to the insufficient computation.
|Model||Original||Fast Tree-||Fast Tree-||Speedup||Space Allo-|
|TreeSHAP (s)||SHAP Prep (s)||SHAP Score (s)||(Large )||cation of|
|Adult-Small||2.40 (0.03)||<0.01 (<0.01)||1.30 (0.04)||1.85||2KB|
|Adult-Med||61.04 (0.61)||0.20 (0.01)||26.42 (1.07)||2.31||368KB|
|Adult-Large||480.33 (3.60)||11.32 (0.14)||150.11 (3.81)||3.20||24.9MB|
|Adult-xLarge||1805.54 (13.75)||268.90 (8.29)||558.72 (7.88)||3.23||955MB|
|Super-Small||2.50 (0.03)||<0.01 (<0.01)||1.28 (0.08)||1.95||2KB|
|Super-Med||89.93 (3.58)||0.36 (0.01)||35.29 (2.05)||2.55||462KB|
|Super-Large||1067.18 (10.52)||30.30 (0.34)||353.84 (4.34)||3.02||45.2MB|
|Super-xLarge||3776.44 (28.77)||673.04 (8.35)||1315.44 (6.84)||2.87||1.76GB|
|Crop-Small||3.53 (0.07)||<0.01 (<0.01)||3.15 (0.02)||1.12||2KB|
|Crop-Med||69.88 (0.71)||0.23 (0.01)||34.34 (1.48)||2.03||370KB|
|Crop-Large||315.27 (6.37)||8.08 (0.09)||122.58 (3.71)||2.57||15.1MB|
|Crop-xLarge||552.23 (10.37)||75.28 (2.34)||215.21 (2.02)||2.57||323MB|
|Upsell-Small||2.80 (0.04)||<0.01 (<0.01)||2.20 (0.06)||1.27||2KB|
|Upsell-Med||90.64 (4.59)||0.33 (0.01)||33.69 (0.92)||2.69||452KB|
|Upsell-Large||790.83 (5.79)||24.59 (0.36)||258.39 (4.53)||3.06||33.7MB|
|Upsell-xLarge||2265.82 (17.44)||442.74 (14.26)||724.24 (7.89)||3.13||996MB|
Table 3 shows the execution times of Fast TreeSHAP Prep and Fast TreeSHAP Score in Fast TreeSHAP v2. We see that the execution time of Fast TreeSHAP Prep is almost negligible for small models, but increases dramatically when the model size increases. This coincides with our discussions in Section 4.3 that for Fast TreeSHAP v2, a larger model needs a larger set of samples to offset the computational cost in Fast TreeSHAP Prep. The column “Speedup” shows the ratios between the execution times of Fast TreeSHAP Score and original TreeSHAP. For one-time usage scenarios, this column approximates the speedup when sample size is sufficiently large (i.e., Fast TreeSHAP Score dominates the execution time of Fast TreeSHAP v2). For multi-time usage scenarios, this column reflects the exact speedup when matrix is pre-computed, and newly incoming samples are being explained. We observe that the speedup increases as the model size increases, which exactly reflects the -time improvement in computational complexity between Fast TreeSHAP Score and original TreeSHAP. Finally, the last column shows the space allocation of matrix which dominates the memory usage of Fast TreeSHAP v2.222Space allocation of is calculated by for double type entries. We can see that, although Fast TreeSHAP v2 costs more memory than the other two algorithms in theory, in practice, the memory constraint is quite loose as all the space allocations in Table 3 are not causing memory issues even in an ordinary laptop. Actually, the maximum depth of trees in most tree-based models in industry do not exceed 16.
Figure 1 plots the execution time versus number of samples for models Adult-Med and Adult-Large. confidence interval of the execution time is also indicated by the shaded area. Here we consider the one-time usage scenarios for a better performance showcase of the two steps in Fast TreeSHAP v2. For Adult-Med, Fast TreeSHAP v2 almost starts at the same place as the other two algorithms, since the first step Fast TreeSHAP Prep takes only 0.2s. For Adult-Large, Fast TreeSHAP Prep takes much longer time due to the larger model size, resulting in higher starting point of the green curve. However, the green curve immediately crosses the other two curves when the number of samples exceeds 500, which coincides with our previous discussions on the sample size requirement of Fast TreeSHAP v2. In both plots, Fast TreeSHAP v2 consistently performs the best while the original TreeSHAP consistently performs the worst when the number of samples exceeds a certain threshold.
The above evaluations are all based on 100 trees, 10,000 samples, and 1 core for fast and fair comparisons. In real life scenarios, the number of trees can be as large as several thousands, hundreds of millions of samples can be encountered in model scoring, and multi-core machines can be used to conduct parallel computing. As parallel computing is one of our future works, we just briefly discuss the potential ways to parallelize Fast TreeSHAP v1 and v2 in Appendix A.7. Based on the proposed ways of parallel computing, we can reasonably expect that both Fast TreeSHAP v1 and v2 are able to significantly improve the computational efficiency in real life scenarios (e.g., reducing the execution time of explaining 20 million samples in a 100-core server from 15h to 5h, and reducing the execution time of explaining 320 thousand samples in a 4-core laptop from 3.5h to 1.4h). More details of the analysis of real life scenarios can be found in Appendix A.8.
TreeSHAP has been widely used for explaining tree-based models due to its desirable theoretical properties and polynomial computational complexity. In this paper, we presented Fast TreeSHAP v1 and Fast TreeSHAP v2, two new algorithms to further improve the computational efficiency of TreeSHAP, with the emphasis on explaining samples with a large size. Specifically, Fast TreeSHAP v1 shrinks the computational scope for the features along each path of the tree, which is able to consistently improve the computational speed while maintaining the low memory cost. Fast TreeSHAP v2 further splits part of computationally-expensive components into pre-computation step, which significantly reduces the time complexity from to with a small sacrifice on memory cost, and is well-suited for multi-time model explanation scenarios.
As one of our future works, we are currently working on implementing the parallel computation in Fast TreeSHAP v1 and v2. Another future direction of Fast TreeSHAP is to implement it in Spark which naturally fits the environment of Hadoop clusters and the datasets stored in HDFS.
We would like to express our sincere thanks to Humberto Gonzalez, Diana Negoescu, and Wenrong Zeng for their helpful comments and feedback, and Parvez Ahammad for his support throughout this project.
-  (2019) Explaining individual predictions when features are dependent: more accurate approximations to shapley values. arXiv preprint arXiv:1903.10464. Cited by: §2, §3.1.
-  (2020) Machine learning can identify newly diagnosed patients with cll at high risk of infection. Nature communications 11 (1), pp. 1–17. Cited by: §1, §2.
-  (2020) Explainability of a machine learning granting scoring model in peer-to-peer lending. Ieee Access 8, pp. 64873–64890. Cited by: §1, §2.
-  (2020) Generating interpretable poverty maps using object detection in satellite images. arXiv preprint arXiv:2002.01612. Cited by: §2.
-  (2020) True to the model or true to the data?. arXiv preprint arXiv:2006.16234. Cited by: §2, §3.1.
-  (2020) Shapley explainability on the data manifold. arXiv preprint arXiv:2006.01272. Cited by: §2, §3.1.
-  (2019) Fastshap. GitHub. Note: https://github.com/bgreenwell/fastshap Cited by: §2.
-  (2020) Shparkley: scaling shapley values with spark. GitHub. Note: https://github.com/Affirm/shparkley Cited by: §2.
-  (2018) A data-driven statistical model for predicting the critical temperature of a superconductor. Computational Materials Science 154, pp. 346–354. Cited by: Table 4, §5.
Feature relevance quantification in explainable ai: a causal problem.
International Conference on Artificial Intelligence and Statistics, pp. 2907–2916. Cited by: §2, §3.1.
-  (2020) Interpreting interpretability: understanding data scientists’ use of interpretability tools for machine learning. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–14. Cited by: §2.
-  (2019) A random forest-based framework for crop mapping using temporal, spectral, textural and polarimetric observations. International Journal of Remote Sensing 40 (18), pp. 7221–7251. Cited by: 2nd item, Table 4, §5.
MSMD: maximum separability and minimum dependency feature selection for cropland classification from optical and radar data. International Journal of Remote Sensing 39 (8), pp. 2159–2176. Cited by: 2nd item, Table 4, §5.
Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid.. In Kdd, Vol. 96, pp. 202–207. Cited by: Table 4, §5.
-  (2020) Treeshap. GitHub. Note: https://github.com/ModelOriented/treeshap Cited by: §2.
-  (2020) Problems with shapley-value-based explanations as feature importance measures. In International Conference on Machine Learning, pp. 5491–5500. Cited by: §2, §3.1.
-  (2019) Explainable ai for trees: from local explanations to global understanding. arXiv preprint arXiv:1905.04610. Cited by: §A.1, §1, §2, §3.2, §3.2, §3.3, §3.
-  (2018) Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888. Cited by: §A.1, §1, §2, §3.2, §3.2, §3.3, §3.
-  (2017) A unified approach to interpreting model predictions. In Proceedings of the 31st international conference on neural information processing systems, pp. 4768–4777. Cited by: §1, §1, §2, §3.1, §3.
-  (2018) Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nature biomedical engineering 2 (10), pp. 749–760. Cited by: §1, §2.
-  (2020) Shapper. GitHub. Note: https://github.com/ModelOriented/shapper Cited by: §2.
-  (2020) The explanation game: explaining machine learning models using shapley values. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction, pp. 17–38. Cited by: §2, §3.1.
-  (2020) Evaluating mashap as a faster alternative to lime for model-agnostic machine learning interpretability. In 2020 IEEE International Conference on Big Data (Big Data), pp. 5777–5779. Cited by: §2.
-  (2019) Model-agnostic interpretability with shapley values. In 2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA), pp. 1–7. Cited by: §2.
-  (2020) GPUTreeShap: fast parallel tree interpretability. arXiv preprint arXiv:2010.13972. Cited by: §2.
-  (2019) Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences 116 (44), pp. 22071–22080. Cited by: §2.
" Why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §1.
-  (2020) A machine learning approach to assess injury risk in elite youth football players. Medicine and science in sports and exercise 52 (8), pp. 1745–1751. Cited by: §2.
-  (2020) The many shapley values for model explanation. In International Conference on Machine Learning, pp. 9269–9278. Cited by: §2, §3.1.
-  (2017) Axiomatic attribution for deep networks. In International Conference on Machine Learning, pp. 3319–3328. Cited by: §1.
-  (2019) Designing theory-driven user-centric explainable ai. In Proceedings of the 2019 CHI conference on human factors in computing systems, pp. 1–15. Cited by: §2.
-  (2021) Intellige: a user-facing model explainer for narrative explanations. arXiv preprint arXiv:2105.12941. Cited by: §2.
-  (2020) Clinically applicable ai system for accurate diagnosis, quantitative measurements, and prognosis of covid-19 pneumonia using computed tomography. Cell 181 (6), pp. 1423–1433. Cited by: §1, §2.
Appendix A Appendix
a.1 Algorithm Details of Estimating
Algorithm 1 proposed in [18, 17] provides the details to calculate for tree-based models. Here a tree is specified as , where is a vector of node values, which takes the value for internal nodes.333A node in a tree can be either an internal node or a leaf node. The vectors and represent the left and right node indexes for each internal node. The vector contains the thresholds for each internal node, and is a vector of indexes of the features used for splitting in internal nodes. The vector represents the cover of each node (i.e., how many data samples fall in that node from its parent).
a.2 Proof of Theorem 1
Plugging in into Equation 1 leads to the SHAP value
It is easy to see that if (i.e., , ). Therefore, the above equation can be simplified as
Similarly, where , , we have and , thus . Therefore, ,
We repeat the above process times for each , each time on a feature , and we finally have
Plugging in into the above equation, we have
a.3 Proof of Theorem 2
Let , i.e., is the subset of where the features in do not satisfy the thresholds along the path . Note that if and otherwise, Equation 2 can then be simplified as
I.e., only subsets are necessary to be included in the equation.
Remind that , i.e., is the subset of where the features in do not satisfy the thresholds along the path . Also remind that , , and , where is a subset of .
Then it is easy to see that when (i.e., ), we have , and