1 Introduction
Machine learning has great success in modeling data and making predictions automatically. In many realworld applications, we need an explanation rather than a blackbox model. For example, when customers apply for a loan on credit, the loan officers will compute their credit scores based on their historical behaviors. In this case, it’s far from enough to only show the customers the final scores, and the loan officers would better give some detailed reasons. While most efforts in data mining have been made on improving the accuracy and efficiency, which results in better models, little attention is paid to model interpretation for these models. Several common measures for the variable significance have been proposed. Gini importance is one of the commonly used importance measure for Random Forest, which is derived from the Gini index[2]. Gini is used to measure impurity between the parent node and two descendent nodes of samples after splitting. The final importance is accumulated from the Gini changes for each feature over all the trees in forest. This general feature importance(FI), also known as global interpretation , shows the important factors of the target, which unpacks the general information in the trained models. However, it doesn’t take any feature values of an instance into consideration, which is insufficient sometimes. Local interpretation, on the other hand, places particular emphasis on a specific case and reveals the main causes of each record. This type of interpretation makes up for the shortages of the global one. One approach proposed to define the feature contributions(FC) [12] , which is accumulated from label distribution changes, as a measure of the feature impact on the output. The value of feature contribution reveals how much a feature contributes and the sign represents whether it’s a positive impact or not.
GBDT[6] is an ensemble model built on top of a bunch of regression decision trees. It has some appealing characteristics. For example, GBDT can naturally handle nonlinearity and tolerate missing values. As a winning model in many data mining challenges [7, 1, 3]
, GBDT is a good option for regression, classification and ranking problems with wellknown ability to generalize. Besides its wide range of applications, GBDT is also flexible in allowing users to define their own suitable loss functions. Furthermore, there are many implementations
[4][8] and much work has been done to speed up the training process.In most cases, GBDT outperforms linear models and random forest. Given the popularity and high quality of GBDT, it’s important to uncover internals of the model. For GBDT, global feature importances calculation is widely used to do the feature selection. For example, Breiman proposed a method to estimate feature importance
[6]. However, existing work has largely ignored the exploration of local interpretations, which will be the focus of this paper. Specifically, we will study feature contributions for GBDT. We starts from previous approaches of model interpretation for random forest[12] and update the definition of the feature contribution. The proposed mechanism is flexible enough to interpret all versions of GBDT. The original definition based on label distribution change is proved to be a special case of ours under a particular loss function.The rest of the paper is organized as follows. Section 2 provides a brief review of related work on local interpretations. Section 3 gives out the formal definition of feature contribution as preliminary and presents the approach for calculating feature contributions for random forests. In section 4, we describe the rationale behind as well as main actions in interpreting GBDT. Section 5 contains experiment settings and the process to examine the proposed methodology. At the end, section 6 concludes our work.
2 Related Work
Local model interpretation provides convincing reasons to the model outputs. One type of interpretations prefer both the good performance of complex models and interpretability of simple models. The pipeline of this type will first make use of advanced models as a blackbox and then extract useful information out of it with the help of a more interpretable model. For example, a novel approach in [5]
formally treats the interpretation of additive tree models as extracting the optimal actionable plan. It models the optimization problem as an integer linear programming and utilizes existing toolkit as the solver. The constraints are based on both the output score and the objective function. Notice that, this kind of approaches need extra training process especially for the interpretation and bring new models or tasks to solve.
Some other researchers come up with modelindependent local interpretations. They mainly make changes to feature value and test the chain effect to performance loss of predictions. The loss is then taken as the measure of local importance of feature[11]. This method only relies on the output evaluation and provides an unified way to check feature contribution for blackbox models. By replacing the actual feature values with missing, zero or average values, the impact of a feature in predicting is then removed. The instancelevel contributions of all the features can be calculated separately and compared with each other. Moreover, this method is also work for global feature importance.
As a derivative of decision tree, the random forest goes further on model interpretation than GBDT. The method in [10, 12] computes the feature contributions so as to show informative results about the structure of model and provide valuable information for designing new compounds. This method makes full use of the information, not only the training data but also the model structure. It is natural to design the interpretations with the model structures to get a more reasonable result.
This work proposes an easy way to get the feature contributions on the instancelevel. Generally, it can be applied to all versions of GBDT implementations with little preprocessing and modification to the prediction process.
3 Preliminary
Additive tree models are a powerful branch of machine learning but are often used as black boxes. Though they enjoy high accuracies, it’s hard to explain their predictions from a feature based point of view. Different ensemble strategies bring out different models while sharing the tree structure as a basis. So the model interpretations for different addictive tree models share some key spirits and can spread out from one to another with appropriate adaptation. In this section, we first review a practical interpretation method for random forest (for the binary classification) and introduce the general definition of feature contribution to better illustrate the proposed model interpretation for GBDT.
3.1 Interpretation for Random Forest
Random forest is one of the most popular machine learning models due to its exordinary accuracy utilizing categorical or numerical features on regression and classification problems. A random forest is a bunch of decision trees that are generated respectively and vote together to get a final prediction. Every tree is trained on randomly sampled data and subsampling feature columns to introduce the diversity for better generalization, which is the key weakness of single decision tree models. Random forest is known as a typical bagging model and the bagging strategy works out by averaging the noises to get a lower variance model.
An instance starts a path from the root node all the way down to a leaf node according to its real feature value. All the instances in the training data will fall into several nodes and different nodes have quite different label distributions of the instances in them. Every step after passing a node, the probability of being the positive class changes with the label distributions. All the features along the path contribute to the final prediction of a single tree.
A practical way to evaluate feature contributions is explored[12]. The key idea is taking the distribution change values for the positive class as the feature contribution. Concretely, it takes four procedures to work:

Computing the percentage of positive class of every node in a tree;

Recording the percentage difference between every parent node and its children;

Accumulating the contributions for every feature on each tree;

Averaging the feature contribution among all the trees in the forest;
The method consists of an offline preparation embedded in training (steps 12) and an online computing with the prediction process (step 34). It is easy to record the local contribution (or local increment) and related split feature to every edge on a tree.
3.2 Gradient Boosting Decision Tree
GBDT is another type of ensemble model that consists of a collection of regression decision trees. However, the ensemble is based on gradient boosting which promotes the prediction gradually by reducing the residual. For every iteration, a new model is built up to fit the negative gradient of the loss function until it converges under an acceptable threshold. The final prediction is the summation of all stagewise model predictions. Gradient boosting is a general framework and different models are available to be embedded. GBDT introduces decision tree as the basic weak learner. When square error is chosen as the loss function, the residual between current prediction and target label is the negative gradient which is computational friendly.
From the above definition, we can see the differences between random forest and GBDT, some of which are the main obstacles that prevent us from adapting the model interpretation for random forest to GBDT:

Random forest aggregates trees by voting, while GBDT sums up the scores from all the trees. This means that the trees in GBDT are not equal and the trees have to be trained in sequential order. The interpretation should make proper adaptations to deal with this problem.

Decision tree in GBDT outputs a score instead of a majority class type for classification problems. Though we can get the label distribution changes as random forest interpretation, the output scores in GBDT should be wisely taken into consideration.
3.3 Problem Statement
Given a training dataset , where is the total number of training samples, implies a
dimensional feature vector,
is the feature vector for the th sample and is the related label. We can illustrate training process of GBDT as in algorithm 1. is the residual for sample in the mth iteration.Besides the basics of model, the feature contribution(FC) , as the key concept for local interpretation, is clarified below. We introduce the notation of FC by denoting the model interpretation for random forest in section 3.1 :
(1) 
in equation 1 is the Local Increment(LI) of feature for node defined before. For binary classification, represents the percentage of the instances belonging to the positive class in node .
4 Mechanism
Looking back at model interpretation for random forest, its central spirit is to establish the idea of feature contribution. By computing label distribution, a measure of the change is then obtained and associated with the split feature. In the case of GBDT, we can expand this computation with a slight modification. Because the targets of the latter trees are the residual, it should replace the instance label while computing label distribution. Nevertheless, the problem of this version is that the average of labels on a leaf node is not always equal to the score on it. So the valuable model information in these scores are not utilized and the method is not appropriate for different GBDT versions [6, 4].
In fact, the loss function determines the optimal coefficient and table 1 shows some common examples. LS and LAD stand for Least Square and Least Absolute Deviation respectively. is the residual updated after each iteration. is the approximation on iteration . and are the first and second order gradient statistics on the loss. Different from the numerical optimization essence to compute negative gradient (for LS and LAD), XGB[4] first approximates the loss function with its second order Taylor expansion and an analytic solution is then got. So it contains no negative gradient computation and the evaluation of leaf weights is far from the label average. Particularly, only if the LS loss function and traditional GBDT training process is used, the label averages meet the scores.
Settings  Loss Function  Negative Gradient  Leaf weight 
LS  
LAD  
XGB  /  
Without loss of generality, the interpretation for GBDT needs to work on the leaf scores. Since the scores are only assigned to leaf nodes, we have to find a way to propagate them back all the way to the root. The left tree of Fig 1 shows an example tree in a GBDT model, with split feature and split value marked on arcs. Observing the three nodes in the rounded rectangle, the instances in node 6 will get a score difference as: , where is the score on node k. Moreover, this difference is caused by splitting feature branching by a threshold of 1.5. We can allocate this difference to the two branches by assigning the average score of child nodes to their parent node. For instance, . Then, the local increment metrics could be calculated using the scores, . Similarly, the leaf scores as well as the local increment could be spread to the whole tree.
The interpretation process during predicting is the same as that of the random forest. On the right hand side of Fig 1, all the node average scores and feature contributions on the tree are marked. Supposing an instance gets a final prediction on leaf node 14 of tree , a cumulation through the path: will be executed: , ,.
By the propagation strategy, the average score is assigned to the node 6 which assumes an instance falls into the left branch or the right with equal probability. So the expectation of intermediate nodes could be revised as in equation 4:
(4) 
where the and is the number of the instances fall into child nodes node c1 and c2. These statistics need extra information from training process.
By viewing the computation in this brand new way, we get a flexible interpretation mechanism by only using the leaf node scores and instance distributions, regardless of the implement settings of GBDT. Under the setting of the LS loss function, we can see that not only the label distribution meets the prediction score on leaf node but also the label distribution of the intermediate node meets our back propagated score. That is to say, the label distribution method is a special case of our mechanism with this particular setting. Furthermore, this method also supports the multiple classification problems.
5 Experiment
In this section, we demonstrate the experiments on the proposed interpretation. In the first place, we show the mechanism is reliable and generally agrees with global feature importance. Then we compare our interpretations to those of random forest and find it accord with the global feature importance better. Finally, we study the interpretations of real cases in our scenario and get a satisfied analysis for them.
5.1 Experiment setup
The GBDT version in our experiment is the Scalable Multiple Additive Regression Tree(SMART)[13], which is a distributed algorithm under the parameter server. Hundreds of billions of samples with thousands of features could be trained by the algorithm. Not only the storage usage but also the running time cost is optimized without the loss of the accuracy.
The training data is drawn from transactions under the scene of Fast Pay(FP) in Alipay^{1}^{1}1https://global.alipay.com/. A transaction is marked as a positive if it is reported as a fraud by the customer. To keep a balanced ratio between positive and negative cases, only 1% of normal transactions are retained by random sampling.
Fig 2 is a fraction of GBDT model in Predictive Model Markup Language (PMML) format^{2}^{2}2http://dmg.org/pmml/v43/GeneralStructure.html and the tree embedded in it can be translate as shown in Fig 1. The element is an encapsulation for a tree node, which contains a predicative rule to choose itself or its siblings. The attribute assigns a unique number to each node in a tree. The value of in a is the predicted value for an instance falling into it. is a simple boolean expression indicating the split information. Our pretrained model is stored as a PMML file. JPMML^{3}^{3}3https://github.com/jpmml/jpmmlevaluator is employed as the evaluator and we implement the proposed interpretation based on it.
5.2 Consistency check
We implement the feature contribution as the previous description in [6]. In order to make the interpretation be independent of the training process of GBDT, the training algorithm is not changed in our experiment. In order to get the distribution of instances in equation 4, we use JPMML to predict the training instances and record instance distributions on every node. According to the tree structure in model and instance distributions, the preprocess is done by back propagating the local increments as shown in section 4. With the local increments, the feature contributions of the new instances could be computed. After interpreting lots of instances, we can get a distribution of feature contributions among the instances. The median is a robust estimator for the expectation of the general feature contribution and should somehow keep accordance with the global feature importances metrics[12].
Fig 3 plots the global Feature Importance(FI) for GBDT and Feature Contribution(FC) medians for every feature. As we can see, this two statistics have similar distributions and are in good agreement. It proves that the interpretation for GBDT is practical and reasonable.
5.3 Comparison to Random Forest
Following the experiment of last section, we get a ranking of the feature contribution median. This ranking is a measure of feature importance and reflects the quality of local interpretation. We implement the work for random forest in [12] and compare it with our ranking. For justice, we replace the GBDT Feature Importance with Information Value(IV) as the importance metric. IV is a concept from information theory and shows the predictive strength for the features[9].
In Fig 4, we compute the intersection size on different variable coverage (i.e. Top 1050 features of IV). implies the method explained in section 3.1. is the simple average strategy with only the information in PMML file. is the revised version in equation 4. From the result, our interpretations capture the importance better and the revised version works best.
5.4 Case Study
Besides the general evaluation, we analysis the 300 specific instances in the test data. Fig 5 shows a case, we only list some representative fields and divide them into 4 parts. The variables are ranked by IV (general feature importance). Domain experts check the feature risk manually and draw the following conclusions:

Part I: Variables in this section are with high IV, our interpretation is able to capture the features that are judged to be high risk(marked as blue fields). The feature with high IV but low risk (judging from the feature value) is assigned a lower score, so the interpretation is good for instancelevel contributions.

Part II: There are 2 variables(colored pink) with high IV and marked high risk is missed by the interpretation, which mainly due to its low occurrence in split features. The global importance of these two variables is also low and model interpretations are limit by the model quality.

Part III: Variables with median or low IVs are not caught by mistake and is assigned a low feature contribution for that case.

Part IV: Several variables are considered to be high risk for the particular instance, even the general IVs of them are low. Our interpretation finds them out, which shows the superiority of the local feature contribution over the global feature importance.
Further more, if we conduct interpretations on a batch of fraud cases which are missed by the model, the local feature contributions will help analysts improve the model.
6 Conclusion
Employing models as a blackbox is not enough. A measure for the impact of a feature on the prediction convinces analysts in an intuitive way. The local interpretation provides an explanation when necessary and contributes to the promotion of the models. We describe a method to unpack the interpretation for the advanced model GBDT. To the delight of analysts, the whole process is independent from the training details and technical optimizations. Only the tree structure and instance distribution are needed, which can be easily extracted by a postprocessing after training. The label distribution based method of random forest is proved to be a special case of our method. We explore the distribution of local feature contributions and prove it to be in agreement with global feature importance. The method is applied to real case studies in different scenarios and serves as a good translator of our models.
References
 [1] (2007) The netflix prize. In Proceedings of KDD cup and workshop, Vol. 2007, pp. 35. Cited by: §1.
 [2] (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §1.
 [3] (2011) Yahoo! learning to rank challenge overview. In Proceedings of the Learning to Rank Challenge, pp. 1–24. Cited by: §1.
 [4] (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §1, §4, §4.
 [5] (2015) Optimal action extraction for random forests and boosted trees. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 179–188. Cited by: §2.
 [6] (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232. Cited by: §1, §1, §4, §5.2.
 [7] (2014) Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pp. 1–9. Cited by: §1.
 [8] (2017) LightGBM: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pp. 3149–3157. Cited by: §1.
 [9] (1997) Information theory and statistics. Courier Corporation. Cited by: §5.3.
 [10] (2011) Interpretation of qsar models based on random forest methods. Molecular informatics 30 (67), pp. 593–603. Cited by: §2.
 [11] (2017) Distributionfree predictive inference for regression. Journal of the American Statistical Association (justaccepted). Cited by: §2.
 [12] (2013) Interpreting random forest models using a feature contribution method. In Information Reuse and Integration (IRI), 2013 IEEE 14th International Conference on, pp. 112–119. Cited by: §1, §1, §2, §3.1, §5.2, §5.3.
 [13] (2017) PSMART: parameter server based multiple additive regression trees system. In Proceedings of the 26th International Conference on World Wide Web Companion, pp. 879–880. Cited by: §5.1.