For decision-making tasks in high-risk domains, machine learning (ML) methods are required to have high level of interpretability. Many feature importance based post-hoc explainable methods, such as SHapley Additive exPlanations (SHAP) shap and Local Interpretable Model-Agnostic Explanations (LIME) lime, are proposed to explain black box models. However, fool showed that feature importance based explanation can neither reflect the real behavior of the black model nor improve human understanding of the model. Thus, interpretable models have been an increasingly active research direction, and high order Generalized Additive Models (GAMs) such as Explainable Boosting Machine (EBM) ebm and NODE-GAM nodegam are purposed to provide analysis on individual features or interaction between two features toward the prediction target. Additionally, a line of researches have focused on the high-order features and feature grouping methods autocross; deep. However, the aforementioned methods are purely data-driven and did not include domain knowledge from expertise, and sometimes the resulted feature interactions from those methods are hard to be understood by humans.
To improve the interpretability of the above methods, argumentation-based methods are purposed to increase human understanding and trust of the model by injecting the human-level knowledge during decision-making stage. Because formal argumentation, as a formalism for representing and reasoning with knowledge handbookargumentation, are capable of providing various ways for justifying and explaining why a claim or a decision is made Contrastive; explanation. Among them, quantitative argumentation frameworks (QAF) has a greater advantage over qualitative argumentation when combined with data-driven methods. To construct the argumentative structure, majorities of the existing augmentation-based methods are bespoke to a specific problem ant; tweet and depend on the knowledge from in-domain expertise, which significantly limits the usage of these methods since the specific argumentation structures are hard to migrate to a new problem.
One approach to address the above issue is to automate the argumentation structure generation process with data miming techniques using orthogonal in-domain knowledge information from external data. For tabular data in high risk domain, data descriptions are one of the great resources for in-domain knowledge since in-domain expertise needs to make decisions depends on raw features value directly DBLP:conf/aies/LakkarajuKCL19. In this paper, we propose a concept and argumentation based model (CAM) that generates human-understandable concepts by mining data descriptions automatically and representing the generated concepts and the reasoning paths with argumentation structure. To illustrate CAM, Fig. 1 shows a concrete example from a real word high risk application fico : to explain the decision-making process, concept Installment is generated automatically from the two underlying features based on their similar data descriptions, and the processes are repeated twice to generate concepts Delinquency and Inquiry. On top of them, concept risk is generated to represent the final decision-making process. The resulted concept-based knowledge can be properly represented and reasoned using QAF: each concept can be viewed as an abstract argument, while the inter-concept edges can be understood as supports or attacks between arguments, and a quantitative argumentation-based field-wild leaning algorithm is designed to evaluate and filter the generated concepts.
With quantitative argumentation, CAM can be represented as stacked QAFs with weighted edges that represent the inter-concept relationship strength, and a quantitative argumentation-based method is designed as the reasoning machine inside the stacked QAFs to conduct knowledge reasoning by aggregating the strengths of lower-level concepts or features to the strengths of higher-level concepts. The reasoning machine can output a knowledge reasoning path that can be used as global model interpretation. As a result of it, CAM can be treated as an interpretable white-box model since the decision-making process is transparent to the users with the visualization of the reasoning path in form of dialogical tree.
Our contributions. To summarize, our contributions are listed as follows:
Conceptually, we propose CAM to automatically generate and evaluate concepts from tabular data, and utilize quantitative argumentation to represent and reason on concepts. Furthermore, we provide explanation as the key reasoning path within CAM.
Empirically, we conduct CAM on both open source benchmarks and real-word bususiness datasets in high-risk domains. Experimental results show that (1) CAM is competitive compared to other ML models; (2) CAM is both global and local interpretable, and the knowledge inside the model is coherent with human understanding.
Explainable artifical intelligence works for tabular data
For tabular data, a group of post-hoc methods explain models by computing or approximating the feature importance, such as LIME and SHAP. But, the explanation can be complex and difficult to be understood by humans because the granularity of the explanation is too fine beyond
. Recently, researchers have focused on the interaction of features rather than individual features by constructing interpretable model. NODE-GAM and EBM are a class of interpretable model that can provide analysis on individual features or interaction between two features. These two methods apply neural network models and boosting tree models, respectively, to fit a functional relationship between a single feature or an interaction feature with the output. However,badgam argue that different fitting models can have different or even contradictory interpretations for the same data. This is because the fitting model model relies solely on data-driven and leads to overfitting. DANETsdeep are able to abstract higher-level tabular features by nerual network. But the higher-level features only contribute to the classification performance. The semantic within them is not completely explicit, which may lead to confusing features.
Interpretble Concepts Mining
Recent researches have focused on generating high-level human concepts from data. TCAV tcav and VCEC vcec produced estimation of how important a concept is for the prediction. ProtoPNet protype
is trained to learn visual prototype vector and calculate similarity for prediction. ACEace proposed a method to automatically extract visual concept from certain class’s images. But all the above-mentioned methods are designed specifically for image data. Our goal is to make similar effort for interpretable tabular ML.
Quantitative argumentation-based work in explainable artificial intelligence
Quantitative argumentation frameworks (QAF) are a knowledge representation formalism that can be used to solve decision problems in a very intuitive way by weighing up pro and contra arguments qaf. QAF are are based on Bipolar Argumentation Frameworks (BAFs) baf by quantifying the semantics of arguments and the relations between them. Many reasoning methods are proposed for evaluating their semantics, including DF-QuAD algorithm dfqaf, O-QuAD algorithm ant, Multi-layer Perception (MLP)-based algorithm mlpqaf and etc. These models have better performance in solving real life problems, such as fraud detection ant, opinion polling polling, and review aggregation review. However, the argumentation model rely on a concrete knowledge structure. It can be the inherent structure of the data, such as the tree conversation structure in social media tweet, or constructed manually by human experts ant. To the best of our knowledge, existing works have not introduce a method that can automatically construct argumentative tree from tabular data.
Concepts and argumentation based model
Knowledge definition and representation
knowledge in tabular data
In tabular data, the knowledge may be divided into two categories: human-level knowledge and knowledge learned from data. The former can be data description which express the semantics of features in natural language and is intuitively understandable to human beings. From a cognitive perspective concept, we believe that a concept should be knowledge that abstracts common characteristics from a set of features or lower-level concepts, which should be consistent with human cognition, as shown in Fig. 1. Another kind of knowledge can be learn from underlying data and expressed as the value of higher-level concepts relevant to the prediction targets which is aggregated by exploiting the correlative features or lower-level concepts in the groups. And a concept tree is constructed with concepts, their children and the correlation between them.
Represent knowledge in quantitative argumentation framework
A QAF is a quadruple , , consisting of a set of arguments , edges between these arguments, a function that assigns a to each argument and a function that assigns a weights to each edge. Furthermore, for every argument , we let and
A concept tree rooted by concept with the global semantics (denoted as ) can be represented in form of QAF as , , , each argument in set represents a concept or a feature , edges between these arguments describe the positive and negative correlation between concepts and features, a function that assigns a to each argument which represents a concept, and a function that assigns a weights to each edge. ( are not defined now until the field-wise learning algorithm is introduced.) It is worth noting that features as a unit of knowledge are also represented as arguments. However, in the QAF, the arguments representing features are at the leaf node positions, because the knowledge embedded in features has the finest granularity. In the tabular database, features have been mapped to the underlying data. Therefore the strength of the arguments representing the features can be obtained directly from the data without the need of the base score function to assign initial values. We denote features as , and is a set of features contained in tabular data.
Definition of concept
From the perspective of knowledge representation, the knowledge contained in a concept can be represented as two characteristics of the concept: a semantic-based and an argumentation-based characterisation. The former one can be described in form of natural language for human or semantic vector for machine, representing the knowledge learned from data description. The other can be represent in form of QAF rooted by this concept, in which the nodes and edges in QAF are learned from data description and and are learned from underlying data.
Let be a set of concepts, a concept .A semantic-based characterisation of c is meaning . An argumentation-based characterisation of c is , where is a finite set of meanings of concepts, and is the subtree of .
A concept consists two characterisations , represents the semantic information and describes its argumentative information. As shown in Fig. 1, we take concept ’Installment’ as an example.
concept have the meaning as ‘installment is a sum of money due as one of several equal payments for something, spread over an agreed period of time’. And its argumentative structure , , , in which , , and can be instantiated by learning from the underlying data.
Knowledge acquisition: concept mining method
In this section, we present the concept mining method which can be decoupled into two processes: (i) semantic knowledge mining, and (ii) quantitative knowledge mining, as shown in Fig. 2. Semantic knowledge mining is designed for automatically searching lower-level knowledge units (such as features and lower-level concepts) with similar meanings and abstract higher-level concepts from them. Then, quantitative knowledge mining is designed for mining the correlations between concepts and their children.
Semantic knowledge mining
To simulate the process of abstracting concepts in human cognitive learning, we need to combine features with similar meaning and extract the same characteristics as the meaning of the generated concepts. To achieve this goal, three main procedures are necessary in Semantic knowledge mining: 1) Vectorization. 2) Grouping. 3) Abstraction.
Vectorization transfers the natural language information into vector space with the leverage of pretrained multi-lingual language model. In that way, the meaning of features or concepts is embedded into vector from natural sentences. And then Grouping process the semantic vectors into several groups by combine the features or concepts with high similarity. (In our task, every group contains two elements which can be features or concepts). Finally, in Abstraction process, the similar descriptive part of the natural statements in each group is extracted and vectorized as the meaning of newly generated higher-order concept. To be noticed, the structures of the new concepts are also mined in this process, which can be represented in form of and in their QAFs.
For example, in Fig.1, of in description of dataset is ‘number of trades with installment, installment is one of several equal payments for something, spread over an agreed period of time’. And is described as ‘percents of trades with installment, installment is …’. Through procedures of Vectorization and Grouping, and are grouped together. Then, is abstracted from them, and the meaning of is represented as ‘installment is one of several equal payments for something, spread over an agreed period of time’. To make machine understand, will also be the semantic vector. Also, and of is captured as showed in Example 1.
Quantitative knowledge mining
After the process of semantic knowledge mining, as shown in Fig.2, assuming that we obtain concepts from
features, and the structural information of them. In this part, the quantitative knowledge from underlying data need to be mind and attached to the generated concepts. Logistic regression (LR) is a suitable choice. First, it is the most widely used interpretable model, and fast for inference. Second, the parameters in LR model contains the same semantic with QAF. However, the LR model only learn from the features to the prediction target, and the strength of the concept nodes and the strength of the associated edges are not directly accessible. Therefore, we propose a field-wise learning algorithm to learn values of all nodes and edges in QAF.
The field-wise learning algorithm runs in two steps. In First step, we train a LR model to learn the strength of nodes and edges of from original arguments (without the newly generated concepts in last process) to concept with global semantics . is denoted as a QAF rooted with and the children only contain original arguments. In second step, we link the newly generated concept and its children as a sub-tree to , and delete the edges which link children of directly to , thus we get a new by adding the structure of a new generated concept and remove the repeated arguments. Then we use a net which has the same structure with to learn the the unknown strength of nodes and edges of . The same parts of and has been learned in the first step, thus the net only learn strength of edges and node related to . Hence, the learning process is ‘field-wise’. We repeat the second step, until all the strength of edges and nodes related to newly generated concepts are mined. And we obtain a list of fully learned QAF .
Formally, denote original arguments set as , where may be features or concepts. Specially, in the first round of concept mining, only represents features. Denote children of a newly generated concept as . In first step, LR model can be described as:
where is the logistic function, is the weight and is bias.
To represent the knowledge learned from data in form of QAF, edges between arguments and concept with global semantics are represented as , is the strength of edge , thus function in are instantiated as , where and . represent the initial score of . But in QAF, , thus we define . In second step, a net model can be described as:
where in sum function , , and is learned in last step thus we fix as a constant score during net model training process. is the weight of newly generated concept , are new weights of and , cause their structure has changed. is bias of . And all the weights and biases can be represented in to instantiate and function.
Knowledge reasoning: quantitative argumentation-based method
To ensure the consistency of the strength in QAF in the learning process and inference process, we define a Net-based reasoning method as our quantitative argumentation-based method to complete knowledge reasoning. In Net-based reasoning algorithm, we have a strength . is the strength value of argument . The strength values are then updated by doing the following two steps for all from down to top until the concept with global semantics:
An instance is denoted as , corresponding to all features in dataset. It is worth noting that the is preprocessed, such that . For the sake of simplicity of presentation, we take the outputs of the first round of concept mining as an example to illustrate the reasoning process of QAF. Thus, in , the children of concept with global semantics are all features, represented as , and the edges are denoted as
where and function are instantiated by field-wise learning algorithm in previous process.
By completing the inference for all samples in the evaluation dataset using the quantitative argumentation-based method, we can use resulting metrics Area-Under-Curve (AUC) to evaluate the performance of . When the process of concept mining can generate new concepts, the reasoning method can evaluate whether the concepts are useful for decision making and thus filter out the irrelevant concepts. The process is described as follows.
When the evaluation is over, the kept concepts and features not grouped for concept mining enter the knowledge mining process as a new round of input. CAM performs concept mining method and quantitative argumentation-based method repeatedly to generated all the import concepts from tabular data for decision making task as shown in Fig. 2. Until the concept mining method can not mining a new concept, the output of it is . In , every layer of important concepts are stacked until the concept with global semantics. In this case, quantitative argumentation-based method will not perform evaluation, but only act as a reasoning machine. Then, CAM are constructed by combining and reasoning machine. Thus, CAM can make decisions base on human-level knowledge.
Dialogical explanation within CAM
CAM are capable of providing the underlying structure for generating dialogical explanations for users. A user may interact with CAM by requesting an explanation of an argument (concepts or features).
Given a of an instance x, and its with strength , an argumentation dialogue between a user and CAM consists of explanation requests for from user, to which CAM responds with explanation
Inspired by review , we provide a simple argumentation dialogue as follows.
Let be functions giving positive primary, negative primary, positive secondary and negative secondary, for any argument :
For any , if , let ; else, let , , where refers to the argument , at which the value of () is as large as possible. Then, an argumentation dialogue is such that for any :
Our intuition here is that the dialogical explanation is simpler than but consistent with CAM by giving at most two paths which contributed most to concepts with global semantics. The explanation of may consist of its supporter or attackers which have significant impacts on , depending on whether is accepted or not. This dialogue is fairly repetitive traced down other important arguments in support of the result.
Introduction of dataset
We evaluate CAM with two real-world high risk application benchmark datasets denoted as Fico and Mimic3 and two in-domain anti-fraud datasets collected from two Alibaba e-commence applications denoted as data1 and data2. These datasets are medium-size with 10-100K samples and table. 1 summarizes these datasets. And the detail about the datasets will be described in Appendix.
|Source Type||Name||Domain||#Samples||#Features||Positive rate|
|In domain Dataset||data 1||E-commence||96452||33||3.2%|
|Interpretable models||Black-box models|
The setting of experiment
We use 80-20 splits for training and evaluation set and we repeat the experiments with five random seeds. All the datasets are for binary classification, and we use AUC as the evaluation metrics. CAM is compare against the interpretable methods of LR, EBM and NODE-GA2M, and the black-box models of MLP and xgboost (XGB)xgb
. The compared models are selected as they are commonly used classification tools for tabular data. And among them, XGB is widely regarded as the classification model with excellent performance, and EBM and NODE-GA2M are considered as the state-of-the-art in interpretable models in recent years. In Appendix, we provide the detail about data preprocessing and the hyperparameter selections for the models.
Analysis on Classification results
In Table. 2, we present the comparative results among the proposed CAM, LR, EBM, NODE-GA2M and MLP, XGB models in terms of mean and std of AUC111 Mean is the average value of the 5 experimental results indicating the average performance of the model, and std is the variance of the 5 results indicating the stability of the model.
Mean is the average value of the 5 experimental results indicating the average performance of the model, and std is the variance of the 5 results indicating the stability of the model.. The best experimental results are in bold font. The results shows that CAM achieve best mean value of AUC in Fico dataset, while XGB performs best in other three datasets. In average, CAM ranks third behind the XGB and EBM. And EBM only has a small lead over CAM. As for the std value of AUC, CAM performs best in Mimic3, EBM achieve best score in data1, and NONE-GA2M runs most stably in Fico and data2. In average, CAM outperform other models as the most consistent model. Overall, the results show that CAM can competitive performance among all the interpretable and black-box models, and it has high stability.
Interpretability Analysis on Fico Dataset
Here we interpret the CAM instantiated in terms of risk prediction for Fico dataset. Fico dataset contains 10K credit bureau reports of consumers that used for predicting their loan defaulting risk. We provide the details in Appendix.
First, we analyze the interpretability of the global model from both semantic and argumentative perspectives. We can see that the top layer of the concept tree has 11 nodes, of which 6 concepts are generated by grouping and abstracting 18 features, and each concept can be treated as a risk factor. From the semantic perspective, Taking concept inquiry as an example, the two child concepts MSinceMostRecInqexcl7days and NumInquiry describes different aspects of inquiry, and features NumInqLast6Mexcl7days and NumInqLast6M describe different aspects of NumInquiry as well. The resulted semantic tree for inquiry is reasonable since all related features are grouped into concepts with different granularities. Additionally, We can see that in grouping, irrelevant features or concepts are not mixed in and the abstraction process makes the semantics of the nodes more general. From an argumentative perspective, in Fig.3, the red and blue nodes represent the supporters and attackers to concept risk respectively. It means that as the strength of the blue node increases, the value of the concept risk decreases. The opposite is true for the red nodes. To be specific, we can see that when the values of feature NumInqLast6Mexcl7days and NumInqLast6M increase, then the value of concept NumInquiry increase, and the value of concept Inquiry increase, finally the value of concept Risk increase. Combined with the semantic information in description, our observation is reasonable as it can be summarized as ‘when the lending institution pulled a consumer’s credit bureau report more frequently, the consumer’s risk of defaulting on a loan increase’. Similarly, we can obtain other observations in line with human intuition such as ‘when the consumer’s revolving balance increase, the consumer’s risk of defaulting on a loan increase’, ‘ when number of credit agreements on a consumer credit bureau report with on-time payments (satisfactory) increase, the consumer’s risk of defaulting on a loan decrease’ and etc.
Second, we start with a local example showed in Fig. 4, and CAM gives the result of the reasoning and a explanation in form of dialogical tree which is showed as follows.
Users can end the conversation at any time, depending on their understanding of CAM, and he key reasoning path we can get from the above conversation is that: . With the description, we can understand the reasoning paths as ‘in this case, installment balance of the the consumer is 471% of his original loan amount, which leads the fraction of installment risk factor increases by 68.7%. The significantly increased fraction of installment risk factor leads the installment risk factor increases by 25% since the fraction of installment is significantly higher than other individuals. Finally, with the highest feature weights, installment risk factor contributes to the risk decision mostly’. This explanation is very intuitive and easy for human to understand.
Conclusions and future Work
In this work, we have proposed CAM as an interpretable model for tabular data in high-risk domains. CAM consists of a concept mining method to automatically acquire human-level knowledge as concepts in the form of QAF from both data descriptions and underlying data, and a quantitative-argumentation based method to evaluate the discovered concepts. We have also provided explanations for decisions by showing a dominated path for each reasoning process within CAM in the form of a dialogical tree.
In our experiment, we have applied CAM on four datasets in high-risk domains. The results of classification indicate that CAM can reach competitive results comparing with other state-of-the-art models. And the results of interpretation show that CAM is a global interpretable model and able to provide explanations. The knowledge inside the model is coherent with human understanding.
As a new attempt in the direction of combining knowledge mining and quantitative argumentation in interpretable research for tabular data, some aspects of this articles are still preliminary. First, the abstraction of concept names still relies on human by providing abstracted descriptions of concepts. Second, CAM has been applied in Alibaba applications and helped users to understand the decisions, but we did not collect and evaluate users’ feedback. In our future work, we are going to collect data about the feedback of the users who receive answers and explanations of their queries, and use it to our empirical study.
A: Details in Experiments
Preprocessing and acquisition of datasets
For all datasets used in the experiment, we use 80-20 as our train-val splits, and we split all datasets with five random seeds. After that, a standardization scaler will be applied to three datasets respectively. For FICO dataset, we fill the missing values with the mean values and drop entirely empty rows. Link information of experiments benchmark datasets is listed in Table. 3.
For CAM training with tabular data, we target encode categorical features, apply quantile transform for all features and encode the binned features with one-hot encoder. For data description data, we first remove all special characters from the text and use pretrained multi-lingual sentence embedding models from Sentence-bert packagereimers-2020-multilingual-sentence-bert.
For CAM, we use cosine distance to calculate the similarities between data descriptions and set threshold to 0.55 to generate potential concepts. For the augmentation-based field-wise filter, we freeze the models weights of the original model and fine-tune the augmentation-based model with 5 epochs. For the Logistic Regression backbone model, we use LBFGSliu89 as the optimizer and train the model with one single large batch.
For XGB, we select the hyperparameters from Node-GAM to make sure the model is fully converged.
For LR, we use l2 regulation and the default regulation weights from the sklearn learn package.
For EBM, we use the default parameters and set up the number of max rounds to 20K to make sure the model is fully converged.
For MLP, we uses a three layer MLP with 128 and 64 hidden units, and we apply the batch normalizationIoffe2015BatchNA and use LeakyReLUXu2015EmpiricalEO
as the activation function.
For NONE-GA2M, we use the second-order interaction mode and use default values for all other hyperparameters.
B: Details in Fico dataset
The dataset contains 23 financial features that includes trade, inquiry, delinquency, satisfactory, and utilization information. Every credit agreement between the consumer and a lending institution is represented by a separate line of information called a trade line, and is often truncated to the term ‘trade’. An ‘inquiry’ is also a line of information, but captures when a lending institution has pulled a consumer’s credit bureau report in order to make a credit decision. The term ‘delinquency’ refers to a payment received some period of time past its due date. This is typically measured in 30-day intervals, such as 60 days delinquent or 90 days delinquent. NumSatisfactoryTrades counts the number of credit agreements on a consumer credit bureau report with on-time payments (satisfactory). NumTradesWHighUtilization counts the number of credit cards on a consumer credit bureau report carrying a balance that is at 75% of its limit or greater. The ratio of balance to limit is referred to as ‘utilization’. And the data description is list in Table. 4
|ExternalRiskEstimate||Consolidated version of risk markers.|
|MSinceOldestTradeOpen||Months Since Oldest Trade Open.|
|MSinceMostRecentTradeOpen||Months Since Most Recent Trade Open.|
|AverageMInFile||Average Months in File.|
|NumSatisfactoryTrades||Number Satisfactory Trades.|
|NumTrades60Ever2DerogPubRec||Number Trades 60+ Ever.|
|NumTrades90Ever2DerogPubRec||Number Trades 90+ Ever.|
|PercentTradesNeverDelinquency||Percent Trades Never Delinquent.|
|MSinceMostRecentDelinquency||Months Since Most Recent Delinquency.|
|MaxDelinquency2PublicRecLast12M||Max Delinquency/Public Records Last 12 Months. And value of it|
|from 0 to 7 indicates a drop in delinquent time from 120 days+ to never.|
|MaxDelinquencyEver||Max Delinquency Ever. And value of it from 0 to 7 indicates a drop|
|in delinquent time from 120 days+ to never.|
|NumTotalTrades||Number of Total Trades (total number of credit accounts).|
|NumTradesOpeninLast12M||Number of Trades Open in Last 12 Months.|
|PercentInstallTrades||Percent Installment Trades. Installment is one of several equal|
|payments for something, spread over an agreed period of time.|
|MSinceMostRecentInqexcl7days||Months Since Most Recent Inquiries excluding 7days.|
|NumInqLast6M||Number of Inquiries Last 6 Months.|
|NumInqLast6Mexcl7days||Number of Inquiries Last 6 Months excluding 7days.|
|NetFractionRevolvingBurden||Net Fraction Revolving Burden.|
|This is revolving balance divided by credit limit.|
|NetFractionInstallBurden||Net Fraction Installment Burden.|
|This is installment balance divided by original loan amount.|
|NumRevolvingTradesWBalance||Number Revolving Trades with Balance.|
|NumInstallTradesWBalance||Number Installment Trades with Balance. Installment is one of several|
|equal payments for something, spread over an agreed period of time.|
|NumBank2NatlTradesWHighUtilization||Number of Trades with high utilization ratio.|
|PercentTradesWBalance||Percent Trades with Balance.|
References for appendix