Explaining Classification Models Built on High-Dimensional Sparse Data

by   Julie Moeyersoms, et al.
NYU college

Predictive modeling applications increasingly use data representing people's behavior, opinions, and interactions. Fine-grained behavior data often has different structure from traditional data, being very high-dimensional and sparse. Models built from these data are quite difficult to interpret, since they contain many thousands or even many millions of features. Listing features with large model coefficients is not sufficient, because the model coefficients do not incorporate information on feature presence, which is key when analysing sparse data. In this paper we introduce two alternatives for explaining predictive models by listing important features. We evaluate these alternatives in terms of explanation "bang for the buck,", i.e., how many examples' inferences are explained for a given number of features listed. The bottom line: (i) The proposed alternatives have double the bang-for-the-buck as compared to just listing the high-coefficient features, and (ii) interestingly, although they come from different sources and motivations, the two new alternatives provide strikingly similar rankings of important features.



There are no comments yet.


page 1

page 2

page 3

page 4


Loss-Based Variational Bayes Prediction

We propose a new method for Bayesian prediction that caters for models w...

Training Deep Models to be Explained with Fewer Examples

Although deep models achieve high predictive performance, it is difficul...

Inferring feature importance with uncertainties in high-dimensional data

Estimating feature importance is a significant aspect of explaining data...

Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition

Classical learning theory suggests that the optimal generalization perfo...

Sufficient principal component regression for pattern discovery in transcriptomic data

Methods for global measurement of transcript abundance such as microarra...

GAM: Explainable Visual Similarity and Classification via Gradient Activation Maps

We present Gradient Activation Maps (GAM) - a machinery for explaining p...

Coping With Simulators That Don't Always Return

Deterministic models are approximations of reality that are easy to inte...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent studies show that fine-grained behavior data Junqué de Fortuny et al. (2014) can yield accurate predictive models. What you “like” on Facebook for example allows the prediction of personal characteristics remarkably well Kosinski et al. (2013), as well as predicting product interest or credit default behavior De Cnudde et al. (2015). Using fine-grained behavior data has been shown to build more accurate models than traditional, structured and engineered data, such as socio-demographic data Martens et al. (2016). However, accurate predictions are just one important facet in the process of developing and assessing predictive models. Business stakeholders often need to interpret the model or use it to draw insights. As attention shifts toward explaining how and why models make their predictions, modelers need to balance predictive accuracy and explainability. Prior work suggests that when users do not understand the workings of a classification model, they can be reluctant to use it, even if the model is known to improve decision performance Kayande et al. (2009); Martens & Provost (2014)

. Furthermore, when pushing to deploy machine learning models, we need to consider that stakeholders often need more than just holdout evaluations to justify chaning their decision-making strategies. The need for explanations encompasses various perspectives, including those of managers, customer-facing employees, and the technical team 

Martens & Provost (2014).

An important aspect of fine-grained behavior data is its very high dimensionality and sparsity Junqué de Fortuny et al. (2014)

. Returning to our running example: the Facebook like data can be represented as a matrix where each row (data instance) corresponds to a person, and each column (feature) corresponds to a page on Facebook that one can like. If someone likes a page, the entry is 1, and 0 otherwise. There exists a huge number of possible pages to like, and hence there a huge number of dimensions. On the other hand, a particular user will like only a very small proportion of these pages. In “traditional” data mining (working with non-behavior data), feature selection and dimensionality reduction techniques are often employed to cope with high dimensionality. However, with fine-grained behavior data it has been shown that feature selection can lead to substantially reduced predictive performance 

Junqué de Fortuny et al. (2013)

. Further, dimensionality reduction via singular value decomposition often may not improve predictive performance (and often reduce it) 

Clark & Provost (2015), and is dubious for improving explainability in any case.

Linear models are often used for large, sparse behavior data (see references above) as these typically achieve relatively strong predictive performance, while being fast both to train and to deploy. The latter benefit is cited as the prime motivating benefit for their use in large-scale production systems McMahan et al. (2013). When one tries to explain or interpret such models, the natural tendency is to look at the input features (Facebook pages in our example) with the highest coefficients, as one would do with traditional data. For example, Kosinski et al. (2013) find that the best predictors for high intelligence include Facebook pages “Thunderstorms”, “The Colbert Report”, “Science”, and “Curly Fries”.

In this paper we investigate two alternatives for explaining predictive models from sparse behavior data. The intuition is as follows: the coefficients do not take into account the coverage of the feature (how many users actually like the page). So, to explain the predictions of such classification models, we need to consider both the coefficients and the coverage of the features. We do this by aggregating across instance-based explanations, using two very different approaches. We evaluate these alternatives in terms of explanation “bang for the buck,”, i.e., how many examples’ inferences are explained for a given number of features listed. The bottom line: (i) The proposed alternatives have double the bang-for-the-buck as compared to just listing the high-coefficient features, and (ii) interestingly, although they come from different sources and motivations, the two new alternatives provide strikingly similar rankings of important features.

2 From Instance-Based Explanations to Global Ranking

We propose alternatives to find the best-explaining features by starting from the instance level. We define two different approaches to obtain such instance-level solutions: one based on the minimum set of features without which the model would not have made the prediction (the ”evidence counterfactual” or EC for short), and one based on the Shapley value from cooperative game theory. Creating a global model “explanation” is a simple procedure. EC or Shapley values are first calculated per feature per instance. Afterwards these scores are summed over all instances and normalized (by the total sum). In this way, the coverage per feature is also taken into account. Features with larger weights will generally get higher values when observed on an instance, but the final explanatory importance of a feature will also depend on how often it is seen.

2.1 The Evidence Counterfactual

We draw the first alternative from prior work on explaining document classifications (Martens & Provost, 2014). Here, an “explanation” is defined as a minimal set of features (words in the prior work), such that removing these features from the instance changes the predicted class. Only when all the words in the explanation are removed does the class change (the set is minimal). In our running Facebook example, if Anna would not have liked “Data Mining”, “the Deer Hunter” and “I love reviewing ICML papers”

, then her predicted class would change from highly intelligent to medium intelligent. Hence, for this individual user, these three pages are the explanation why she was classified as highly intelligent. The definition used in the paper is as follows 

(Martens & Provost, 2014):

Definition 1

Consider a document consisting of unique words from the vocabulary of words: = , which is classified by classifier : as class . An explanation for document ’s classification is a set of words such that removing all words in from the document leads to produce a different classification. Further, an explanation is minimal in the sense that removing any subset of E does not yield a change in class. Specifically:
is an explanation for
1. (the words are in the document),
2. (the class changes), and
3. ( is minimal).
denotes the result of removing the words in from document .

One can interpret the minimal subset as the features that caused the prediction to be made.111This interpretation is based on the I/O behavior of the classifier. Examining causality more deeply requires assessing whether it actually makes sense in the domain to assume that the observation can be treated as fixed—so that changing one feature does not change another.

Returning to our Facebook example, when a linear model is being used, one could argue simply to list the top Facebook pages with the highest positive weights that appear in the liking history of a certain user as an explanation for the class. would then be the minimal number of top pages such that removing these pages leads to a class change. The evidence counterfactual method (EC) produces minimum-sized combinations for linear models by ranking all pages liked by the user according to the product where is the linear model coefficient and

a binary vector that denotes whether or not a page was liked by user

. The combinations with the top-ranked pages is a combination of smallest size (the proof can be found in (Martens & Provost, 2014)).

However, it could be interesting to find alternative combinations next to the minimum-size subset. A straightforward approach would be to conduct a complete search through the space of all page combinations, starting with one page, and increasing the number of pages until a subset is found. The algorithm starts by checking whether removing one page from the customer’s liking history would cause a change in class label. If so, an irreducible subset is found (in the linear case). If the class does not change based on only one page, the algorithm considers all combinations of size 2, 3 and so on. Note that for a liking history of pages, a combination of pages requires evaluations. This complete search scales exponentially with the number of Facebook pages. For data sets with a high dimensionality, this is impracticable if one wants to find multiple combinations. Therefore, a greedy implementation was used (Martens & Provost, 2014).

2.2 Approximate Shapley Value

A second way to find the best explaining features for a single instance can be obtained by using concepts from cooperative game theory, and more specifically the Shapley value. A cooperative game is one in which a set of players engage in a game that results in some non-negative payoff for the set.

Let us first define some core concepts before we present the second method:

  1. is the complete set of players, with cardinality .

  2. is a subset of players, with and

  3. is a value function that represents the total utility (money, points, etc.) the set generates when playing the game

  4. , is the marginal utility of adding player to a set .

The Shapley value (SV) is defined on an individual player , and represents how much of the total value it should be allocated upon realization of the game. Formally, the SV () is the expected marginal utility , of adding player to a set , where is the first players taken from each random permutation of . Or, in other words, the Shapley value() of the game for player is the average of its marginal contribution to all possible coalitions. This can be expressed as such (Shapley, 1988):

The method for finding a player’s Shapley value depends on the definition of the gain function . This function is different depending on the type of game, but in our case we will approach this problem as a weighted voting game. A weighted voting game is a type of game consisting of players, where each player is defined by a weight . The payoff for any weighted voting game played by a subset is defined as (Banasri et al., 2010):

In words, if we think of each player as a voter whose vote is worth , a game is won if the total sum of weighted votes is greater than some threshold. From this definition, we can define .

A player’s marginal utility is if that player’s vote swung the total above the threshold (). If the threshold was not met, or the value was already above the threshold, player adds no marginal value. Bringing this back to our core problem of explaining a model, at an instance level, we can think of the features as playing a weighted voting game, where each feature’s weight in the game is the weight learned by a linear model 222Note that this current setup only applies to linear models with binary features.. Given a set classification threshold, a feature accumulates value if it “swings” the classification from negative to positive given randomly chosen features within the instance summed up before it.

We refer the interested reader to Moeyersoms et al. (2016) for certain necessary practical details (such as negative weights and approximation methods to address scalability).

Figure 1: Explanation curves for different ranking approaches.

3 Empirical Evaluation

Shapley EC Coverage
www.bcbg.com www.ebay.com www.bcbg.com www.answers.com
www.katespade.com www.katespade.com www.stuartweitzman.com www.ebay.com
www.ebay.com www.bcbg.com www.katespade.com www.huffingtonpost.com
www.stuartweitzman.com www.stuartweitzman.com www.talbots.com abcnews.go.com
www.restorationhardware.com www.restorationhardware.com us.christianlouboutin.com www.about.com
www.huffingtonpost.com www.gilt.com www.dior.com www.forbes.com
www.gilt.com www.huffingtonpost.com www.restorationhardware.com www.cnn.com
www.wayfair.com www.forever21.com www.forever21.com www.legacy.com
www.talbots.com www.colehaan.com www.brooksbrothers.com www.weather.com
www.forever21.com www.talbots.com www.selfridges.com www.allrecipes.com
Table 1: Top 10 highest ranked features according to the Evidence Conterfactual (EC), (Approximate) Shapley, and coverage methods.

For the empirical evaluation, consider predicting product interest based on online browsing data, where a data instance corresponds to an online user, and a feature corresponds to a website. For each user (customer) the feature vector shows the websites visited by that user. One typical application of such data is targeted online advertising: who should you target with a certain ad, given the history of all the websites visited by the users. This data is characterized by its high dimensionality and feature sparsity.

The advertising example that we are using is one from a luxury retail store. The data set consists of several million binary features which respresent URLs visited by each customer. We assume a linear classification model is given, such as logistic regression (see

Dalessandro et al. (2014)).

The above EC and Shapley methods provide us ways to rank the features, but how can we empirically evaluate what is to some degree a subjective task? We propose “Explanation curves”, which show how many data instances are explained if one considers only the top ranked features. The X-axis denotes the number of ”top” (top-) features according to that ranking method (in log scale), and the Y-axis shows the number of instances that would be correctly classified as positive (i.e. get a score larger than the threshold) when only those features are used and all other features are set to zero.

Figure 1 compares the explanation curves for both methods, as well as for choosing the coefficients with the largest coefficients (Betas; ), and choosing the terms with the highest coverage. One can see that only taking into account the largest coefficients (s) of the prediction model explains only half the instances, for almost any point on the explanation curve—a feature may have a very high weight in the prediction model but rarely appears in an instance. Choosing by coverage accounts for this effect, as do both the EC and (Approximate) Shapley value methods.

Next, Table 1 shows the top 10 highest ranked features according to the different methods. This would be the typical way of explaining such a model (usually with being larger than 10). As can be seen from these results, “ebay.com” is ranked first by the EC method, implying that this is the best explaining feature according to this method. Although this feature has a large coverage, it only appears to be ranked on the 17th place in terms of its . Yet, the EC method takes into account both and therefore the ranking will be different. Moreover, when considering the explanatory value of “ebay.com”, it seems that this value is about 40 times larger according to the EC method (7%) as compared to just looking at the (proportionally) of the URL (0.17%). Lastly, and possibly most interestingly of all, Table 2 shows the correlations between the top 1000 ranked features. Shapley and EC are almost identical in the rankings they provide (as was seen from Table 1 as well). The correlation with however is much smaller.

Shapley EC Coverage
Shapley \ 0,97 0,60 0,55
EC 0,97 \ 0,62 0,50
0,74 0,75 \ 0,39
Coverage 0,16 0,16 0,05 \
Table 2: Spearman’s rank correlation coefficients between the different methods.

4 Conclusion

We introduced two alternatives for explaining predictive models by listing important features, one drawn from prior work on explaining document classifications, the other derived from the Shapley value used in cooperative game theory. We evaluate these alternatives in terms of explanation “bang for the buck,”, i.e., how many examples’ inferences are explained for a given number of features listed, as illustrated by “explanation curves.” The bottom line conclusions are: (i) Across almost the entire range of the explanation curves, the new alternative explanation methods have double the bang-for-the-buck as compared to just listing the high-coefficient features. (ii) Very interestingly, although they are derived from quite different sources and motivations, the two new alternatives provide strikingly similar rankings of important features.


  • Banasri et al. (2010) Banasri, Basu, Chakrabarti, Bikas K., Chakravarty, Satya R., and Gangopadhyay, Kausik. Econophysics and Economics of Games, Social Choices and Quantitative Techniques. Springer, 2010. ISBN 978-88-470-1500-5.
  • Clark & Provost (2015) Clark, Jessica and Provost, Foster. Dimensionality reduction via matrix factorization for predictive modeling from large, sparse behavioral data. Working Paper 2451/33970, NYU, 2015.
  • Dalessandro et al. (2014) Dalessandro, Brian, Chen, Daizhuo, Raeder, Troy, Perlich, Claudia, Han Williams, Melinda, and Provost, Foster.

    Scalable hands-free transfer learning for online advertising.

    In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1573–1582. ACM, 2014.
  • De Cnudde et al. (2015) De Cnudde, S., Moeyersoms, J., Stankova, M., Tobback, E., Javaly, V., and Martens, D. Who cares about your facebook friends? credit scoring for microfinance. Research Paper D/2015/1169/018, University of Antwerp. Faculty of Applied Economics, 2015.
  • Junqué de Fortuny et al. (2014) Junqué de Fortuny, E., Stankova, M., Moeyersoms, J., Minnaert, B., Provost, F., and Martens, D. Corporate residence fraud detection. In Proceedings of 20th ACM SIGKDD (KDD ’14), pp. 1650–1659, 2014.
  • Junqué de Fortuny et al. (2013) Junqué de Fortuny, Enric, Martens, David, and Provost, Foster. Predictive modeling with big data: Is bigger really better? Big Data, 1(4):215–226, 2013.
  • Kayande et al. (2009) Kayande, U., De Bruyn, A., Lilien, G. L., Rangaswamy, A., and van Bruggen, G. H. How incorporating feedback mechanisms in a DSS affects dss evaluations. Information Systems Research, 20:527–546, 2009.
  • Kosinski et al. (2013) Kosinski, M., Stillwell, D., and Graepel, T. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 110(15):5802–5805, 2013.
  • Martens et al. (2016) Martens, D., Provost, F., Clark, J., and de Fortuny E., Junqué. Mining massive fine-grained behavior data to improve predictive analytics. MIS Quarterlty, Forthcoming, 2016.
  • Martens & Provost (2014) Martens, David and Provost, Foster. Explaining data driven document classification. MIS Quarterly, 38(1):73–99, 2014.
  • McMahan et al. (2013) McMahan, H Brendan, Holt, Gary, Sculley, David, Young, Michael, Ebner, Dietmar, Grady, Julian, Nie, Lan, Phillips, Todd, Davydov, Eugene, Golovin, Daniel, et al. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1222–1230. ACM, 2013.
  • Moeyersoms et al. (2016) Moeyersoms, Julie, Dalessandro, Brian, Provost, Foster, and Martens, David. The economic value of customer behavior data. Working paper, University of Antwerp, 2016.
  • Shapley (1988) Shapley, L.S. A value for n person games. University of Cambridge Press, pp. 31–40, 1988.