This paper aims to investigate the possible effects of cognitive biases on human understanding of machine learning models, in particular inductively learned rules. We use the term ”cognitive bias” as a representative for various cognitive phenomena that materialize themselves in the form of occasionally irrational reasoning patterns, which are thought to allow humans to make fast judgments and decisions.
Their cumulative effect on human reasoning should not be underestimated as already early work in this area showed that “cognitive biases seem reliable, systematic, and difficult to eliminate” (Kahneman and Tversky, 1972). The effect of some cognitive biases is more pronounced when people do not have well-articulated preferences (Tversky and Simonson, 1993), which is often the case in explorative machine learning.
Previous works have analysed the impact of cognitive biases on multiple types of human behaviour and decision making. A specific example is the seminal book “Social cognition” by Kunda (1999), which is concerned with the impact of cognitive biases on social interaction. Another, more recent work by Serfas (2011) is focused on the context of capital investment. Closer to the domain of machine learning, in their article entitled “Psychology of Prediction”, Kahneman and Tversky (1973)
warned that cognitive biases can lead to violations of the Bayes theorem when people make fact-based predictions under uncertainty. These results directly relate to inductively learned rules, since these are associated with measures such as confidence and support expressing the (un)certainty of the prediction they make. Despite some early works(Michalski, 1969, 1983) showing the importance of study of cognitive phenomena for rule induction and machine learning in general, there has been a paucity of follow-up research. In previous work (Fürnkranz et al., 2018), we have evaluated a selection of cognitive biases in the very specific context of whether minimizing the complexity or length of a rule will also lead to increased interpretability, which is often taken for granted in machine learning research.
In this paper, we attempt to systematically relate cognitive biases to the interpretation of machine learning results. To that end, we review twenty cognitive biases that can distort interpretation of inductively learned rules. The review is intended to help to answer questions such as: Which cognitive biases affect understanding of symbolic machine learning models? What could help as a “debiasing antidote”?
This paper is organized as follows. Section 2 provides a brief review of related work published at the intersection of rule learning and psychology, defining rule induction and cognitive bias along the way. Section 3 motivates our study on an example of the insensitivity to sample size effect. Section 4 describes the criteria that we applied to select a subset of cognitive biases into our review. The twenty selected biases are covered in Section 5. The individual cognitive biases have disparate effects and causes. Section 6 provides a concise set of recommendations aimed at developers of rule learning algorithms and user interfaces. In Section 7 we state the limitations of our review and outline directions for future work. The conclusions summarize the contributions of the paper.
2 Background and Related Work
We selected individual rules as learnt by many machine learning algorithms as the object of our study. Focusing on simple artefacts—individual rules—as opposed to entire models such as rule sets or rule lists allows a deeper, more focused analysis since a rule is a small self-contained item of knowledge. Making a small change in one rule, such as adding a new condition, allows to test the effect of an individual factor that can influence perception of rule plausibility. In this section, we will shortly introduce the inductively learned rule. Then, we will focus on rule plausibility as a measure of rule comprehension.
2.1 Decision Rules in Machine Learning
An example of an inductively learned decision rule which we focus on here is shown in Figure 1. Following the terminology of Fürnkranz et al. (2012), represent an arbitrary number of literals, i.e., Boolean expressions which are composed of attribute name (e.g., veil) and its value (e.g., white). The conjunction of literals on the left side of the rule is called antecedent, the single literal predicted by the rule is called consequent. Literals in the antecedent are sometimes referred to as conditions throughout the text, and the consequent as target. While this rule definition is restricted to conjunctive rules, other definitions, e.g., the formal definition given by Slowinski et al. (2006, page 2), also allow for negation and disjunction as connectives.
Rules on the output of rule learning algorithms are most commonly characterized by two parameters, confidence and support. The confidence of a rule is defined as , where is the number objects that match the conditions of the rule antecedent as well as the rule consequent, and is the number of objects that match the antecedent but not the consequent. The support of a rule is either defined as , where is the number of all objects (relative support), or simply as (absolute support).
In the special case of classification rule learning, the consequent of a rule consists only of a single literal, the so-called class. In this case, is also known as the true positives, and as the false positives.
Some rule learning frameworks, in particular association rule learning Agrawal et al. (1995); Zhang and Zhang (2002), require the user to set thresholds for minimum confidence and support. Only rules with confidence and support values meeting or exceeding these thresholds are included on the output of rule learning and presented to the user.
2.2 Study of Rules in Cognitive Science
. They also closely relate to Bayesian inference, which also frequently occurs in models of human reasoning. Consider the following rule learnt from data: “IF A AND B THEN C”. This rule can be interpreted as a hypothesis corresponding to the logical implication
. We can express the plausibility of such hypothesis in terms of Bayesian inference as the conditional probability. This corresponds to the confidence of the rule, a term used in rule learning, and to strength of evidence, a term used by cognitive scientists (Tversky and Kahneman, 1974).111In the terminology used within the scope of cognitive science (Griffin and Tversky, 1992), confidence corresponds to the strength of the evidence and support to the weight of the evidence. Interestingly, balancing of the likelihood of the judgment and the weight of the evidence in the assessed likelihood was already studied by Keynes (1922) (according to Camerer and Weber (1992)).
is a probability estimate computed on a sample, another relevant piece of information for determining the plausibility of the hypothesis is the robustness of this estimate. This corresponds to the number of instances for which the rule has been observed to be true. The size of the sample (typically expressed as ratio) is known as rule support in machine learning and asweight of the evidence in cognitive science (Tversky and Kahneman, 1974).
Psychological research on hypothesis testing in rule discovery tasks has been performed in cognitive science at least since the 1960’s. The seminal article by Wason (1960) introduced what is widely referred to as Wason’s 2-4-6 task. Participants are given the sequence of numbers 2, 4 and 6 and asked to find out the rule that generated this sequence. In search for the hypothesized rule they provide the experimenter other sequences of numbers, such as 3-5-7, and the experimenter answers whether the provided sequence conforms to the rule, or not. While the target rule is simple “ascending sequence”, people find it difficult to discover this specific rule, presumably because they use the positive test strategy, a strategy of testing a hypothesis by examining evidence confirming the hypothesis at hand rather then searching for disconfirming evidence (Klayman and Ha, 1987).
2.3 Cognitive Bias
According to the Encyclopedia of Human Behavior (Wilke and Mata, 2012), the term cognitive bias was introduced in the 1970s by Amos Tversky and Daniel Kahneman (Tversky and Kahneman, 1974), and is defined as:
Systematic error in judgment and decision-making common to all human beings which can be due to cognitive limitations, motivational factors, and/or adaptations to natural environments.
The narrow initial definition of cognitive bias as a shortcoming of human judgment was criticized by German psychologist Gerd Gigerenzer, who started in the late 1990s the “Fast and frugal heuristic” program to emphasize ecological rationality (validity) of judgmental heuristics, use of which results in some notable cognitive biases.
In the present view, we define cognitive biases and associated phenomena broadly. We include cognitive biases related to thinking, judgment, and memory. We also include descriptions of thinking strategies and judgmental heuristics that may result in cognitive biases, even if they are not necessarily biases themselves.
An important aspect related to the study of cognitive biases is the validation of strategies for mitigation of their effects in cases when they lead to incorrect judgment. A number of such debiasing techniques has been developed, with researchers focusing intensely on the clinical and judicial domains (cf. e.g. (Lau and Coiera, 2009; Croskerry et al., 2013; Martire et al., 2014)), apparently due to costs associated with any erroneous judgment in these domains. Nevertheless, general debiasing techniques can often be derived from such studies.
The choice of an appropriate debiasing technique typically depends on the source of an error induced by the bias, since the type of the error implies appropriate debiasing strategy (Arkes, 1991). Larrick (2004) recognizes the following three categories: psychophysically-based error, association-based error, and strategy-based error. The first two are attributable to unconscious, automatic processes, sometimes referred to as ”System 1”. The last one is attributed to reasoning processes (System 2). For biases attributable to System 1, the most generic debiasing strategy is to shift processing to the conscious System 2 (Lilienfeld et al., 2009), (Shafir, 2013, p. 491).
Another perspective on debiasing is provided by Croskerry et al. (2013), who organize debiasing techniques by their way of functioning, rather than the bias they address, into the following three categories: educational strategies, workplace strategies and forcing functions. While (Croskerry et al., 2013) is focused on clinicians, this review of debiasing oriented on practitioners can be used as a starting point for analogous guidelines aimed at machine learning audience. For example, the general workplace strategies applicable in the machine learning context include group decision making, personal accountability, and planning time-out sessions to help slowing down.
Function and validity of cognitive biases
In the introduction, we briefly characterized cognitive biases as seemingly irrational reasoning patterns that are thought to allow humans to make fast and risk-averse decisions. In fact, the function of cognitive biases is subject of scientific debate. According to the review of functional views by Pohl (2017), there are three fundamental positions among researchers. The first group considers them as dysfunctional errors of the system, the second group as faulty by-products of otherwise functional processes, and the third group as adaptive and thus functional responses. According to Pohl (2017), most researchers are in the second group, where cognitive biases are considered to be “built-in errors of the human information-processing systems”.
In this work, we consider cognitive biases as strategies that evolved to improve the fitness and chances of survival of the individual in particular situations or are consequences of such strategies. This defense of biases is succinctly expressed by Haselton and Nettle (2006): “Both the content and direction of biases can be predicted theoretically and explained by optimality when viewed through the long lens of evolutionary theory. Thus, the human mind shows good design, although it is design for fitness maximization, not truth preservation.”
According to the same paper, empirical evidence shows that cognitive biases are triggered or their effect strengthened by environmental cues and context (Haselton and Nettle, 2006). Given that the interpretation of machine learning results is a task unlike the simple automatic cognitive processes to which a human mind is adapted, cognitive biases are likely to have an influence upon it.
2.4 Measures of Interpretability, Perceived and Objective Plausibility
We claim that cognitive biases can affect the interpretation of rule-based models. However, how does one measure interpretability? According to our literature review, there is no generally accepted measure of interpretability of machine learning models. Model size, which was used in several studies, has recently been criticized (Freitas, 2014; Stecher et al., 2016; Fürnkranz et al., 2018) primarily on the grounds that the model’s syntactic size does not capture any aspect of the model’s semantics. A particular problem related to semantics is the compliance to pre-existing expert knowledge, such as domain-specific monotonicity constraints.
In our work, we embrace the concept of plausibility to measure interpretability. The word ’plausible’ is defined according to the Oxford Dictionary of US English as “seeming reasonable or probable” and according to the Cambridge dictionary of UK English as “seeming likely to be true, or able to be believed”. We can link the machine learning’s inductively learned rule to the concept of “hypothesis” used in cognitive science. There is a body of work in cognitive science on analyzing the perceived plausibility of hypotheses (Gettys et al., 1978, 1986; Anderson and Fleming, 2016).
In a recent review of interpretability definitions by Bibal and Frénay (2016), the term plausibility is not explicitly covered, but a closely related concept of justifiability is stated to depend on interpretability. Martens et al. (2011) define justifiability as “intuitively correct and in accordance with domain knowledge”. By adopting plausibility, we address the concern expressed in Freitas (2014) regarding the need to reflect domain semantics when interpretability is measured.
We are aware of the fact that if a decision maker finds a rule plausible, it does not necessarily mean that the rule is correctly understood, it can be quite the contrary in many cases. Nevertheless, we believe that the alignment of the perceived plausibility with objective, data-driven, plausibility of a hypothesis should be at the heart of an effort that strives for interpretable machine learning.
3 Motivational Example
It is well known in machine learning that chance rules with a deceptively high confidence can appear in the output of rule learning algorithms (Azevedo and Jorge, 2007). For this reason, the rule learning process typically outputs both confidence and support for the analyst to make an informed choice about merits of each rule.
In the example above, both rules are associated with values of confidence and support to inform about the strength and weight of evidence for both rules. While the first rule is less strong (80% vs 90% correct), its weight of the evidence is ten times higher than of the second rule.
According to the insensitivity to sample size effect (Tversky and Kahneman, 1974) there is a systematic bias in human thinking that makes humans overweigh the strength of evidence (confidence) and underweigh the weight of evidence (support). The bias has been also shown in psychologists knowledgable in statistics (Tversky and Kahneman, 1971) and thus is likely to be applicable to the widening number of professions that use rule learning to obtain insights from data.
The analysis of relevant literature from cognitive science not only reveals applicable biases, but also sometimes provides methods for limiting their effect (debiasing). The standard way used in rule learning software for displaying rule confidence and support metrics is to use percentages, as in our example. Extensive research in psychology has shown that if frequencies are used instead, then the number of errors in judgment drops (Gigerenzer and Goldstein, 1996; Gigerenzer and Hoffrage, 1995). Reflecting these suggestions, the first rule in our example could be presented as follows:
Rules can be presented in different ways (as shown), and depending on the way the information is presented, humans may perceive their plausibility differently. In this particular example, confidence is no longer conveyed as a percentage ”80%” but using expression ”80 out of 100”. Support is presented as an absolute number (100) rather than as a percentage (10%).
A correct understanding of machine learning models can be difficult even for experts. In this section, we tried to motivate why addressing cognitive biases can play an important role in making the results of inductive rule learning more understandable. In the remainder of this paper, the bias applied to our example will be revisited in greater depth, along with 19 other biases.
4 Selection Criteria
A number of cognitive biases have been discovered, experimentally studied, and extensively described in the literature. As Pohl (2017) states in a recent authoritative book on cognitive illusions: “There is a plethora of phenomena showing that we deviate in our thinking, judgment and memory from some objective and arguably correct standard.”
We first selected a subset of biases which would be reviewed. To select applicable biases, we considered those that have some relation to the following properties of inductively learned rules: 1. rule length (the number of literals in an antecedent), 2. rule interest measures (especially support and confidence), 3. position (ordering) of conditions in a rule and ordering of rules in the rule list, 4. specificity and predictive power of conditions (correlation with a target variable), 5. use of additional logical connectives (conjunction, disjunction, negation), 6. treatment of missing information (inclusion of conditions referring to missing values), and 7. conflict between rules in the rule list.
Through selection of appropriate learning heuristics, the rule induction learning algorithm can influence these properties. For example, most heuristics implement some form of a trade-off between the coverage or support of a rule, and its implication strength or confidence Fürnkranz and Flach (2005); Fürnkranz et al. (2012).
5 Review of Cognitive Biases
In this section, we cover a selection of twenty cognitive biases. For all of them, we include a short description including an example of a study demonstrating the bias and its proposed explanation. We pay particular attention to their potential effect on the interpretability of rule learning results, which has not been covered in previous works.
For all cognitive biases we also suggest a debiasing technique that could be effective in aligning the perceived plausibility of the rule with its objective plausibility.
In recent scientometric survey of research on cognitive biases in the Information Systems (IS) field, no papers aiming at machine learning are mentioned. For general IS research, “most articles’ research goal [is] to provide an explanation of the cognitive bias phenomenon rather than to develop ways and strategies for its avoidance or targeted use” (Fleischmann et al., 2014). Our research aims at advancement of the field beyond explanation of applicable phenomena, by discussing specific debiasing techniques. However, as Lilienfeld et al. (2009) note there is a general paucity of research on debiasing in psychological literature, and the existing techniques suffer from a lack theoretical coherence and mixed research evidence concerning their efficacy. To this end, in some cases the debiasing techniques that we describe have only limited grounding in applicable psychological research and require further validation.
An overview of the main features of the reviewed cognitive biases is presented in Table 1.
|phenomenon||implications for rule-learning||debiasing technique|
|Representativeness Heuristic||Overestimate the probability of condition representative of consequent||Use natural frequencies instead of ratios or probabilities|
|Probability of antecedent as the average of probabilities of conditions||
Reminder of probability theory
|Prefer more specific conditions over less specific||Inform on taxonomical relation between conditions; explain benefits of higher support|
|Emphasis on confidence, neglect for support||Express confidence and support in natural frequencies|
Insensitivity to Sample Size
|Analyst does not realize the increased reliability of confidence estimate with increasing value of support||Present support as absolute number rather than percentage; use support to compute confidence (reliability) intervals for the value of confidence|
|Ease of recollection of instances matching the rule||Explain to analyst why instances matching the particular rule are (not) easily recalled|
|Presentation of redundant rules or conditions increases plausibility||rule pruning; clustering; explaining overlap|
|Rules confirming analyst’s prior hypothesis are “cherry picked”||Explicit guidance to consider evidence for and against hypothesis; education about the bias; interfaces making users slow down|
Mere Exposure Effect
|Repeated exposure (even subconscious) results in increased preference||Changes to user interfaces that limit subliminal presentation of rules|
Overconfidence and underconfidence
|Rules with small support and high confidence are “overrated”||
Present less information when not relevant via pruning, feature selection, limiting rule length; actively present conflicting rules/knowledge.
|Recognition of attribute or its value increases preference||More time; knowledge of attribute/value|
|belief that more information (rules, conditions) will improve decision making even if it is irrelevant||Communicate attribute importance|
|Prefer rules without unknown conditions||Increase user motivation; instruct users to provide textual justifications|
Confusion of the Inverse
|Confusing the difference between the confidence of the rule with||Training in probability theory; unambiguous wording|
Misunderstanding of “and”
|“and” is understood as disjunction||Unambiguous wording; visual representation|
Context and Tradeoff Contrast
|Preference for a rule is influenced by other rules||Removal of rules, especially of those that are strong, yet irrelevant|
|Words with negative valence in the rule make it appear more important||Review words with negative valence in data, and possibly replace with neutral alternatives|
|Information presented first has the highest impact||Education on the bias; resorting; rule annotation|
Weak Evidence Effect
|Condition only weakly perceived as predictive of target decreases plausibility||Numerical expression of strength of evidence; omission of weak predictors (conditions)|
|Conditions are perceived to have same importance||Inform on discriminatory power of conditions|
5.1 Conjunction Fallacy and Representativeness Heuristic
The conjunction fallacy refers to a judgment that is inconsistent with the conjunction rule – the probability of a conjunction, , cannot exceed the probability of its constituents, and . It is often illustrated with the “Linda” problem in the literature (Tversky and Kahneman, 1983, page 299).
In the Linda problem (Figure 2), subjects are asked to compare conditional probabilities and , where refers to “bank teller”, to “active in feminist movement” and to the description of Linda (Bar-Hillel, 1991).
Multiple studies have shown that people tend to consistently select the second hypothesis as more probable, which is in conflict with the conjunction rule. In other words, it always holds for the Linda problem that
The conjunction fallacy is historically explained by use of the representativeness heuristic (Kahneman and Tversky, 1972). The representativeness heuristic refers to the tendency to make judgments based on similarity, based on the rule “like goes with like”, which is typically used to determine whether an object belongs to a specific category. When people use the representativeness heuristic, “probabilities are evaluated by the degree to which A is representative of B, that is by the degree to which A resembles B” (Tversky and Kahneman, 1974). This heuristic provides people with means for assessing a probability of an uncertain event. It is used to answer questions such as “What is the probability that object A belongs to class B? What is the probability that event A originates from process B?” (Tversky and Kahneman, 1974).
The representativeness heuristic is not the only explanation for the results of the conjunction fallacy experiments. Hertwig et al. (2008) hypothesized that the fallacy is caused by “a misunderstanding about conjunction”, in other words by a different interpretation of “probability” and “and” by the subjects than assumed by the experimenters. The validity of this alternative hypothesis has been subject to criticism (Tentori and Crupi, 2012), nevertheless some empirical evidence suggests that the problem of correct understanding of “and” is of particular importance to rule learning (Fürnkranz et al., 2018).
Recent research has provided several explanations for conjunctive and disjunctive (cf. Section 5.4) fallacies – Configural Weighting and Adding theory (Nilsson et al., 2009), applying principles of quantum cognition (Bruza et al., 2015) and inductive confirmation theory (Tentori et al., 2013). In the following, we will focus on the CWA theory. CWA essentially assumes that the causes of conjuctive and disjunctive fallacies relate to the fact that decision makers perform weighted average instead of multiplication of the component probabilities. For conjunctions, weights are set so that more weight is assigned to the lower component probability. For disjunctive probabilities, more weight is assigned to the likely component. This assumption was verified in at least one study (Fisk, 2002). For more discussion of the related averaging heuristic, cf. Subs 5.3.
Implications for rule learning
Rules are not composed only of conditions, but also of an outcome (the value of a target variable in the consequent). A higher number of conditions generally allows the rule to filter a purer set of objects with respect to the value of the target variable than a smaller number of conditions.
Application of representativeness heuristic can affect human perception of rule plausibility, in that rules that are more ”representative” of the user’s mental image of the concept may be preferred even in cases when their objective discriminatory power may be lower.
A number of factors that decrease the proportion of subjects exhibiting the conjunction fallacy have been identified: Charness et al. (2010) found that the number of participants committing the fallacy is reduced under a monetary incentive. Such an addition was reported to drop the fallacy rate in their study from 58% to 33%. The observed rate under a monetary incentive suggests smaller importance of this problem for important real-life decisions. Zizzo et al. (2000) found that unless the decision problem is simplified, neither monetary incentives nor feedback can ameliorate the fallacy rate. A reduced task complexity is a precondition for monetary incentives and feedback to be effective.
Stolarz-Fantino et al. (1996) observed that the rate of fallacies is reduced but still strongly present when the subjects receive training in logic. Gigerenzer and Goldstein (1996) as well as Gigerenzer and Hoffrage (1995) showed that the rate of fallacies can be reduced or even eliminated by presenting the problems in terms of frequencies rather than probabilities.
Nilsson et al. (2009) present a computer simulation showing that when the component probabilities are not precisely known, averaging often provides equally good alternative to the normative computation of probabilities (cf. also Juslin et al. (2009)). This computational model could be possibly adopted to detect high risk of fallacy, corresponding to the case when the deviation between the perceived probability and the normative probability is high.
5.2 Misunderstanding of “and”
The misunderstanding of “and” refers to a phenomenon affecting the syntactic comprehensibility of the logical connective “and”. As discussed by Hertwig et al. (2008), “and” in natural language can express several relationships, including temporal order, causal relationship, and most importantly, can also indicate a collection of sets222As in “He invited friends and colleagues to the party” as well as their intersection. People can therefore interpret “and” in a different meaning than intended.
For example, according to the two experiments reported by Hertwig et al. (2008), the conjunction “bank teller and active in the feminist movement” used in the Linda problem (cf. Section 5.1) was found by about half of subjects as ambiguous—they explicitly asked the experimenter how “and” was to be understood. Furthermore, when participants indicated how they understood “and” by shading Venn diagrams, it turned out that about a quarter of them interpreted “and” as union rather than intersection, which is usually assumed by experimenters using the Linda problem.
Implications for rule learning
The formation of conjunctions via “and” is a basic building block of rules. Its correct understanding is thus important for effective communication of results of rule learning. Existing studies suggest that the most common type of error is understanding “and” as a union rather than intersection. In such a case, a rule containing multiple “ands” will be perceived as having a higher support than it actually has. Each additional condition will be incorrectly perceived as increasing the coverage of the rule. This implies higher perceived plausibility of the rule. Misunderstanding of “and” will thus generally increase the preference of rules with more conditions.
According to Sides et al. (2002) “and” ceases to be ambiguous when it is used to connect propositions rather than categories. The authors give the following example of a sentence which is not prone to misunderstanding: “IBM stock will rise tomorrow and Disney stock will fall tomorrow.” Similar wording of rule learning results may be, despite its verbosity, preferred.
Mellers et al. (2001) showed that using “bank tellers who are feminists” or “feminist bank tellers” rather than “bank tellers and feminists” as a category in the Linda problem (Figure 2) might reduce the likelihood of committing the conjunction fallacy. It follows that using different wording instead of “and” such as “and also” might also help reduce the misunderstanding of “and”.
Representations that visually express the semantics of “and” such as decision trees may be preferred over rules, which do not provide such visual guidance.333We find limited grounding for this theory in the following: Conditions connected with an arch in a tree are to be interpreted as simultaneously valid (i.e. arch means conjunction). A recent empirical study on comprehensibility of decision trees (Piltaver et al., 2016) does not consider ambiguity of this notation to be a systematic problem among the surveyed users.
5.3 Averaging Heuristic
While the conjunction fallacy is most commonly explained by operation of the representativeness heuristic, the averaging heuristic provides an alternative explanation: it suggests that people evaluate the probability of a conjuncted event as the average of probabilities of the component events (Fantino et al., 1997). As reported by Fantino et al. (1997)
, in their experiment “approximately 49% of variance in subjects’ conjunctions could be accounted for by a model that simply averaged the separate component likelihoods that constituted a particular conjunction.”
Implications for rule learning
Due to the application of the averaging heuristic the analyst may not fully realize the consequences of the presence of a low-probability condition for the overall likelihood of the set of conditions in the antecedent of the rule.
Consider the following example: Let us assume that the learning algorithm only adds independent conditions that have a probability of , and we compare a 3-condition rule to a 2-condition rule. Averaging would evaluate both rules equally, because both have an average probability of . A correct computation of the joint probability, however, shows that the longer rule is considerably less likely ( vs. because all conditions are assumed to be independent).
Averaging can also affect same-length rules. Fantino et al. (1997) derive from their experiments on the averaging heuristic that humans tend to judge
“unlikely information [to be] relatively more important than likely information.” Continuing our example, if we compare the above 2-condition rule with another rule with two features with more diverse probability values, e.g., one condition has and the other has , then averaging would again evaluate both rules the same, but in fact the correct interpretation would be that the rule with equal probabilities is more likely than the other (). In this case, the low 0.6 probability in the new rule would “knock down” the normative conjoint probability below the one of the rule with two 0.8 conditions.
Experiments conducted by Zizzo et al. (2000) showed that prior knowledge of probability theory, and a direct reminder of how probabilities are combined, are effective tools for decreasing the incidence of conjunction fallacy, which is the hypothesized consequence of the averaging heuristic. A specific countermeasure for the biases caused by linear additive integration (weighted averaging) is the use of logarithm formats. Experiments conducted by Juslin et al. (2011) show that recasting probability computation in terms of logarithm formats, thus requiring additive rather than multiplicative integration, improves probabilistic reasoning.
5.4 Disjunction Fallacy
The disjunction fallacy refers to a judgment that is inconsistent with the disjunction rule, which states that the probability cannot be higher than the probability , where is a union of event with another event .
In experiments reported by Bar-Hillel and Neter (1993), and were nested pairs of categories, such as Switzerland and Europe. Subjects read descriptions of people such as: “Writes letter home describing a country with snowy wild mountains, clean streets, and flower decked porches. Where was the letter written?” It follows that since Europe contains Switzerland, Europe must be more likely than Switzerland. However, Switzerland was chosen as the more likely place by about 75% participants Bar-Hillel and Neter (1993).
The disjunction fallacy is considered as another consequence of the representativeness heuristic (Bar-Hillel and Neter, 1993): “Which of two events—even nested events—will seem more probable is better predicted by their representativeness than by their scope, or by the level in the category hierarchy in which they are located.”
The description in the example is more representative of Switzerland than of Europe, so when people use representativeness as the basis for their judgment, they judge Switzerland to be a more likely answer than Europe even though this judgment breaks the disjunction rule.
Implications for rule learning
In the context of data mining, it can be the case that the feature space is hierarchically ordered. The analyst can thus be confronted with rules containing attributes (literals) on multiple levels of granularity. Following the disjunction fallacy, the analyst will generally prefer rules containing more specific attributes, which can result in preference for rules with fewer backing instances and thus weaker statistical validity.
When asked to assign categories to concepts (such as land of origin of a letter) under conditions of certainty, people are known to prefer a specific category to a more general category that subsumes it, but only if the specific category is considered representative (Bar-Hillel and Neter, 1993): “whenever an ordering of events by representativeness differs from their ordering by set inclusion, there is a potential for an extension fallacy to occur.” From this observation a possible debiasing strategy emerges: making the analysts aware of the taxonomical relation of the individual attributes and their values. For example, the user interface can work with the information that Europe contains Switzerland, possibly actively notifying the analyst on the risk of disjunctive fallacy. This intervention can be complemented by “training in rules” (Larrick, 2004). In this case, the analysts should be explained the benefits of larger supporting sample associated with more general attributes.
5.5 Base-rate Neglect
People tend to underweigh the evidence provided by base rates, which results in the so-called base-rate neglect. For example, Kahneman and Tversky (1973) gave participants a description of a person who was selected randomly from a group and asked them whether the person is an engineer or a lawyer. Participants based their judgment mostly on the description of the person and paid little consideration to the occupational composition of the group, which was also given to them as a part of the task, even though the composition should play a significant role in the judgment.
Kahneman and Tversky (1973) view the base-rate neglect as a possible consequence of the representativeness heuristic (Kahneman and Tversky, 1972). When people base their judgment of an occupation of a person mostly on similarity of the person to a prototypical member of the occupation, they ignore other relevant information such as base rates, which results in the base-rate neglect.
Implications for rule learning
The application of the base rate neglect suggests that when facing two otherwise identical rules with different values of confidence and support metrics, an analyst’s preferences will be primarily shaped by the confidence of the rule. Support corresponds to ”base rate”, which is sometimes almost completely ignored (Kahneman and Tversky, 1973).
It follows that by increasing preference for higher confidence, the base-rate neglect will generally contribute to a positive correlation between rule length and plausibility, since longer rules can better adapt to a particular group in data and thus have a higher confidence than a more general, shorter rules. This is in contrast to the general bias for simple rules that are implemented by state-of-the-art rule learning algorithms, because simple rules tend to be more general, have a higher support, and are thus statistically more reliable.
Gigerenzer and Hoffrage (1995) show that representations in terms of natural frequencies, rather than conditional probabilities, facilitate the computation of cause’s probability. To the authors’ knowledge, confidence is typically presented as percentage in current software systems. The support rule quality metric is sometimes presented as a percentage and sometimes as a natural number. It would foster correct understanding if analysts are consistently presented natural frequencies in addition to percentages.
5.6 Insensitivity to Sample Size
People tend to underestimate the increased benefit of higher robustness of estimates that are made on a larger sample, which is called insensitivity to sample size. The insensitivity to sample size effect can be illustrated by the so-called hospital problem. In this problem, subjects are asked which hospital is more likely to record more days in which more than 60 percent of the newborns are boys. The options are a larger hospital, a smaller hospital, or both hospitals with about a similar probability. The correct expected answer—the smaller hospital—was chosen only by 22% of participants in an experiment reported by Tversky and Kahneman (1974). Insensitivity to sample size may be another bias resulting from use of the representativeness heuristic (Kahneman and Tversky, 1972). When people use the representativeness heuristic, they compare the proportion of newborns who are boys to the proportion expected in the population, ignoring other relevant information. Since the proportion is similarly representative of the whole population for both hospitals, most of the participants believed that both hospitals are equally likely to record days in which more than 60 percents of the newborns are boys (Tversky and Kahneman, 1974).
Implications for rule learning
This effect implies that analysts may be unable to appreciate the increased reliability of the confidence estimate with increasing value of support, i.e., they may fail to appreciate that the strength of the connection between antecedent and consequent of a rule rises with an increasing number of observations. If confronted with two rules, where one of them has a slightly higher confidence and the second rule a higher support, this cognitive bias suggests that the analyst will prefer the rule with higher confidence (all other factors equal) .
In the context of this bias, it is important to realize that population size is statistically irrelevant for determination of sample size for large populations (Kachelmeier and Messier Jr, 1990). However, previous research (Bar-Hillel, 1979) has shown that the perceived sample accuracy can incorrectly depend on the sample-to-population ratio rather than on the absolute sample size. For a small population, a 10% sample can be considered as more reliable than 1% sample drawn from much larger population.
This observation has substantial consequences for the presentation of rule learning results. The support of a rule is typically presented as a percentage of the dataset size. Assuming that support relates to sample size in research of (Bar-Hillel, 1979) and number of instances in the dataset to population size, it follows that the presentation of support as a percentage (relative support) induces the insensitivity to sample size effect. The recommended alternative is to present support as an absolute number (absolute support).
There have been successful experiments with providing decision aids to overcome the insensitivity to sample size bias. In particular, Kachelmeier and Messier Jr (1990) experimented with providing auditors a formula for computing appropriate sample size for substantive tests of details based on description of a case and tolerable error. Provision of the aid resulted in larger sample sizes being selected by the auditors in comparison to intuitive judgment without the aid. Similarly as the auditor can choose the sample size, a user of an association rule learning algorithm can specify the minimum support threshold. To leverage the debiasing strategy validated by Kachelmeier and Messier Jr (1990), the rule learning interface should also inform the user of the effects of chosen support threshold on the accuracy of the confidence estimate of the resulting rules. For algorithms and workflows where the user cannot influence the support of a discovered rule, relevant information should be available as a part of rule learning results. In particular, the value of rule support can be used to compute confidence (accuracy) interval for the value of confidence. Such supplementary information is already provided by Bayesian Decision List (Letham et al., 2015), a recently proposed algorithmic framework positively evaluated with respect to interpretability for example by De Laat (2017).
5.7 Confirmation Bias and Positive Test Strategy
Confirmation bias refers to the notion that people tend to look for evidence supporting the current hypothesis, disregarding conflicting evidence. According to Evans (1989, p. 552) confirmation bias is “the best known and most widely accepted notion of inferential error of human reasoning.”444Cited according to Nickerson (1998).
Research suggests that even neutral or unfavorable evidence can be interpreted to support existing beliefs, or, as Trope et al. (1997, p. 115-116) put it, “the same evidence can be constructed and reconstructed in different and even opposite ways, depending on the perceiver’s hypothesis.”
A closely related phenomenon is the Positive Test Strategy (PTS) described by Klayman and Ha (1987). This reasoning strategy suggests that when trying to test a specific hypothesis, people examine cases which they expect to confirm the hypothesis rather than the cases which have the best chance of falsifying it. The difference between PTS and confirmation bias is that PTS is applied to test a candidate hypothesis while confirmation bias is concerned with hypotheses that are already established (Pohl, 2004, p. 93). The experimental results of Klayman and Ha (1987) show that under realistic conditions, PTS can be a very good heuristic for determining whether a hypothesis is true or false, but it can also lead to systematic errors if applied to an inappropriate task.
Implications for rule learning
This bias can have significant impact depending on the purpose for which the rule learning results are used. If the analyst has some prior hypothesis before she obtains the rule learning results, according to the confirmation bias she will tend to “cherry pick” rules confirming this prior hypothesis and disregard rules that contradict it. Given that some rule learners may output contradicting rules, the analyst can select only the rules conforming to the hypothesis, disregarding applicable rules with the opposite conclusion, which could otherwise turn out to be more relevant.
Delaying final judgment and slowing down work has been found to decrease confirmation bias in several studies (Spengler et al., 1995; Parmley, 2006). User interfaces for rule learning should thus give the user not only the opportunity to save or mark interesting rules, but also allow the user to review and edit the model at a later point in time. An example rule learning system with this specific functionality is EasyMiner (Vojíř et al., 2018).
Wolfe and Britt (2008) successfully experimented with providing subjects explicit guidelines for considering evidence both for and against a hypothesis. Provision of “balanced” instructions to search evidence for and against given hypothesis reduced the incidence of myside bias from 50% exhibited by the control group to significantly lower 27.5%.
Similarly, providing explicit guidance combined with modifications of the user interface of the system presenting the rule learning results could also be considered. The assumption that educating users about cognitive illusions can be an effective debiasing technique for positive test strategy has been empirically validated on a cohort of adolescents by Barberia et al. (2013).
5.8 Availability Heuristic
The availability heuristic is a judgmental heuristic in which a person evaluates the frequency of classes or the probability of events by the ease with which relevant instances come to mind. This heuristic is explained by its discoverers, Tversky and Kahneman (1973), as follows: “That associative bonds are strengthened by repetition is perhaps the oldest law of memory known to man. The availability heuristic exploits the inverse form of this law, that is, it uses the strength of the association as a basis for the judgment of frequency.” The availability heuristic is not itself a bias, but it may lead to biased judgments when availability is not a valid cue.
In one of the original experiments, participants were asked whether letter “R” appears more frequently on first or third position in English texts (Tversky and Kahneman, 1973). About 70% of participants answered incorrectly that it appears more frequently on the first position, presumably because they estimated the frequency by recalling words containing “R” and it is easier to recall words starting with R than words with R on the third position.
To determine availability, it is sufficient to assess the ease with which instances or associations could be brought to mind – it is not necessary to perform the actual operations of retrieval or construction (Schwarz et al., 1991). An illustration of this phenomenon by Tversky and Kahneman (1973) is: “One may estimate the probability that a politician will lose an election by considering the various ways he may lose support.”
Implications for rule learning
An application of the availability heuristic in rule learning would be based on the ease of recollection of instances (examples) matching the complete rule (all conditions and consequent) by the analyst. Rules containing conditions for which instances can be easily recalled would be found more plausible compared to rules not containing such conditions. As an example, consider the rule pair
|:||IF latitude 44.189 AND longitude 6.3333|
|AND longitude 1.8397 THEN Unemployment is high|
|:||IF population 5 million THEN Unemployment is high.|
It is arguably easier to recall specific countries matching the second rule, than countries matching the conditions of the first rule.
It is conceivable that the availability heuristic could be applied also in the case when the easily recalled instances match only some of the conditions in the antecedent of the rule, such as only latitude in the example above. The remaining conditions would be ignored.
Several studies have found that people use ease of recollection in judgment only when they cannot attribute it to a source that should not influence their judgment (Schwarz, 2004). Alerting an analyst to the reason why instances matching the conditions in the rule under consideration are easily recalled should therefore reduce the impact of the availability heuristic as long as the reason is deemed irrelevant to the task at hand.
5.9 Reiteration Effect, Effects of Validity and Illusiory Truth
For example, in one experiment Hasher et al. (1977) presented subjects with general statements and asked them to asses their validity. Part of the statements were false and part were true. The experiment was conducted in several sessions, where some of the statements were repeated in subsequent sessions. The average perceived validity of both true and false repeated statements rose between the sessions, while for non-repeated statements it dropped slightly.
The effect is usually explained by use of processing fluency in judgment. Statements that are processed fluently (easily) tend to be judged as true and repetition makes processing easier. A recent alternative account argues that repetition makes the referents of statements more coherent and people judge truth based on coherency (Unkelbach and Rom, 2017).
The reiteration effect is known also under other labels, such as “frequency-validity” and “illusory truth” (Hertwig et al., 1997, 195). However, some research suggest, that these are not identical phenomena.
For example, the truth effect “disappears when the actual truth status is known” (Pohl, 2017, p. 253), which does not hold for validity effect in general. There is also a clear distinction between the effects covered here, and the mere exposure effect covered in Section 5.10: the truth effect has been found largely independent of duration of stimulus exposure (Dechêne et al., 2010, p. 245).
Implications for rule learning
In the rule learning context, a repeating statement which becomes more believable corresponds to the entire rule or possibly a “subrule” consisting of the consequent of the rule and a subset of conditions in its antecedent. A typical rule learning result contains multiple rules that are substantially overlapping. If the analyst is exposed to multiple similar statements, the reiteration effect will increase the analyst’s belief in the repeating sub rule. Especially in the area of association rule learning, a very large set of redundant rules—covering the same, or nearly same set of examples—is routinely included in the output.
Schwarz et al. (2007) suggest that mere 30 minutes of delay can be enough for information originally seen as negative to have positive influence. Applying this in a data exploration task, consider analyst who is presented a large number of “weak” rules corresponding to highly speculative patterns of data. Even if the analyst rejects the rule (for example based on the presented metrics, pre-existing domain knowledge or common sense), the validity and truthfulness effects will make the analyst more prone to accept a similar rule later.
The reiteration effect can be suppressed already on the algorithmic level by ensuring that rule learning output does not contain redundant rules. This can be achieved by pruning algorithms (Fürnkranz, 1997). Another possible technique is presenting the result of rule learning in several layers, where only clusters of rules (“rule covers”) summarizing multiple sub rules are presented at first (Ordonez et al., 2006). The user can expand the cluster to obtain more similar rules. A more recent algorithm that can be used for summarizing multiple rules is the meta-learning method proposed by (Berka, 2018).
Several lessons can be learnt from Hess and Hagen (2006), who studied the role of reiteration effect for spreading of gossip. Interestingly, already simple reiteration was found to increase gossip veracity, but only for those who found the gossip relatively uninteresting. Multiple sources of gossip were found to increase its veracity, especially when these sources were independent. Information that explained the gossip by providing benign interpretation decreased the veracity of gossip. These findings suggest that it is important to explain to the analyst which rules share the same source, i.e. what is the overlap in their coverage in terms of specific instances. Second, explanations can be improved by utilisation of recently proposed techniques that use domain knowledge to filter or explain rules, such as expert deduction rules proposed by Rauch (2018).
The research related to debiasing validity and truth effects has been largely centered around the problem of debunking various forms of misinformation, cf. e.g. (Schwarz et al., 2007; Lewandowsky et al., 2012; Ecker et al., 2017). The current largely accepted recommendation is that to correct a misinformation, it is better to retract it directly – repeat the misinformation along with arguments against it (Lewandowsky et al., 2012; Ecker et al., 2017). This can be applied, for example, in incremental machine learning settings, when the results of learning are revised when new data arrive, or when mining with formalized domain knowledge. Generally, when the system has knowledge of the analyst being previously presented a rule (a hypothesis), which is falsified following the current state of knowledge, the system can explicitly notify the analyst, listing the rule in question and explaining why it does not hold.
5.10 Mere Exposure Effect
According to the mere exposure effect, repeated exposure to an object results in an increased preference (liking, affect) for that object. When a concrete stimulus is repeatedly exposed, the preference for that stimulus increases logarithmically as a function of the number of exposures (Bornstein, 1989). The size of the mere exposure effect also depends on whether the stimulus the subject is exposed to is exactly the same as in prior exposure or only similar to it (Monahan et al., 2000) — the same stimuli are associated with larger mere exposure effect. The mere exposure effect is another consequence of increased fluency of processing associated with repeated exposure (cf. Section 5.9) (Winkielman et al., 2003). While the reitaration effect referred to the use of processing fluency in judgment of truth, the mere exposure effect relates to the positive feeling that is associated with fluent processing.
Duration of the exposure below 1 second produces the strongest effects, with increasing time of exposure the effect drops and repeating exposures decrease the mere exposure effect. The liking induced by the effect drops more quickly with increasing exposures when the presented stimuli is simple (e.g. an ideogram) as opposed to complex (e.g. a photograph) (Bornstein, 1989). A recent meta-analysis suggests that there is an inverted-U shaped relation between exposure and affect (Montoya et al., 2017).
Implications for rule learning
The extent to which the mere exposure effect can affect interpretation of rule leaning results is limited by the fact that that its magnitude decreases with extended exposure to the stimuli. It can be expected that the analysts inspect the rule learning results for much longer period of time than the 1 second, below which exposure results in the strongest effects (Bornstein, 1989). However, it is not unusual for rule-based models to be composed of several thousand rules (Alcala-Fdez et al., 2011). When the user scrolls through a list of rules, each rule can be shown only for a fraction of a second. The analyst is not aware of having seen the rule, yet the rule can influence the analyst’s judgment through the mere exposure effect.
The mere exposure effect can also play a role when rules from the text mining or sentiment analysis domains are interpreted. The initial research of the mere exposure effect byZajonc (1968) included experimental evidence on the positive correlation between word frequency and affective connotation of the word. From this it follows that a rule containing frequently occurring words can induce the mere exposure effect.
While there is a considerable body of research focusing on the mere exposure effect, our literature review did not surface any directly applicable debiasing techniques. Only recently, Becker and Rinck (2016) reported the first reversal of the mere exposure effect. This was achieved by presenting threatening materials (spider pictures) to people fearful of spiders in an unpleasant detection situation. This result, although interesting, is difficult to transpose to the domain of rules.
Nevertheless, there are some conditions known to decrease the mere exposure effect that can be utilized in machine learning interfaces. The effect is strongest for repeated, “flash-like” presentation of information. A possible workaround is avoiding the subliminal exposure completely, by changing the the mode of operation of the corresponding user interfaces, for example to search as proposed by Škrabal et al. (2012).
5.11 Overconfidence and underconfidence
Decision maker’s judgment is normally associated with belief that the judgment is true, i.e., confidence in the judgment. Griffin and Tversky (1992) argue that confidence in judgment is based on combination of the strength of evidence and its weight (credibility). According to their studies, people tend to combine strength with weight in suboptimal ways, resulting in the decision maker being too much or too little confident about the hypothesis at hand than would be normatively appropriate given the information available. This discrepancy between the normative confidence and the decision maker’s confidence is called overconfidence or underconfidence.
People use the provided data to assess a hypothesis, but they insufficiently regard the quality of the data. Griffin and Tversky (1992) describe this manifestation of bounded rationality as follows: “If people focus primarily on the warmth of the recommendation with insufficient regard for the credibility of the writer, or the correlation between the predictor and the criterion, they will be overconfident when they encounter a glowing letter based on casual contact, and they will be underconfident when they encounter a moderately positive letter from a highly knowledgeable source.”
Implications for rule learning
Research has revealed systematic patterns of overconfidence and underconfidence (Griffin and Tversky, 1992, p. 426): If the estimated difference between two hypotheses is large, it is easy to say which one is better and there is a pattern of underconfidence. As the degree of difficulty rises (the difference between the normative confidence of two competing hypotheses is decreasing), there is an increasing pattern of overconfidence.
The strongest overconfidence was recorded for problems where the weight of evidence is low and the strength of evidence is high. This directly applies to rules with high value of confidence and low value of support. The empirical results related to the effect of difficulty therefore suggest that the predictive ability of such rules will be substantially overrated by analysts. This is particularly interesting because rule learning algorithms often suffer from a tendency to unduly prefer overly specific rules that have a high confidence on small parts of the data to more general rules that have a somewhat lower confidence, a phenomenon also known as overfitting. The above-mentioned results seem to indicate that humans suffer from a similar problem (albeit presumably for different reasons), which, e.g., implies that a human-in-the-loop solution may not alleviate this problem.
Research applicable to debiasing of overconfidence originated in 1950’, nevertheless most initial efforts to reduce overconfidence have failed (Fischoff, 1981; Arkes et al., 1987). Some recent research focuses on the hypothesis that feeling of confidence reflects factors indirectly related to choice processes (Fleisig, 2011; Hall et al., 2007). For example, in a sport betting experiment performed by Hall et al. (2007), participants underweighted statistical cues while betting when they knew the names of players. This research leads to the conclusion that “more knowledge can decrease accuracy and simultaneously increase prediction confidence” (Hall et al., 2007). Applying this to debiasing in the rule learning context, presenting less information can be achieved by reducing the number of rules and removing some conditions in the remaining rules. This can be achieved by a number of methods, such as feature selection to external setting of maximum antecedent length, which is permitted by some algorithms. Also, rules and conditions that do not pass a statistical significance test can be removed from the output.
As with other biases, research on debiasing overconfidence points at the importance of educating the experts on principles of subjective probability judgment and the associated biases (Clemen and Lichtendahl, 2002). Shafir (2013, p. 487) recommends to debias overconfidence (in policy making) by making the subject hear both sides of an argument. In the rule learning context, this would correspond to the user interface making easily accessible also rules and knowledge which is in ”unexpectedness” or ”exception” relation with the rule in question, as e.g. experimented with in frameworks postprocessing association rule learning results (Kliegr et al., 2011).
5.12 Recognition Heuristic
Pachur et al. (2011) define the recognition heuristic as follows: “For two-alternative choice tasks, where one has to decide which of two objects scores higher on a criterion, the heuristic can be stated as follows: If one object is recognized, but not the other, then infer that the recognized object has a higher value on the criterion.” In contrast with the availability heuristic, which is based on ease of recall, the recognition heuristic is based only on the fact that a given object is recognized. The two heuristics could be combined. When only one object in a pair is recognized, then the recognition heuristic would be used for judgment. If both objects are recognized, then the speed of the recognition could be used (Hertwig et al., 2008).
The use of this heuristic could be seen from an experiment performed by Goldstein and Gigerenzer (1999), which focused on estimating which of two cities in a presented pair is more populated. People using the recognition heuristic would say that the city they recognize is more popuplated. The median proportion of judgments complying to the recognition heuristic was 93
Implications for rule learning
The recognition heuristic can manifest itself by preference for rules containing a recognized attribute name or value in the antecedent of the rule.
Analysts processing rule learning results are typically presented many rules, contributing to time pressure. This can further increase the impact of the recognition heuristic.
Empirical results reported by Michalkiewicz et al. (2018) indicate that intelligent people use the recognition heuristic more when it is successful and less when it is not. The work of Pohl et al. (2017) shows that people adapt their decision strategy with respect to the more general environment rather than the specific items they are faced with. Considering that the application of the recognition heuristic can in some situations lead to better results than the use of available knowledge, the recognition heuristic may not necessarily have overly negative impacts on the intepretation of rule learning results.
Under time pressure people assign a higher value to recognized objects than to unrecognized objects. This happens also in situations when recognition is a poor cue (Pachur and Hertwig, 2006). Changes to user interfaces that induce “slowing down” could thus help to address this bias. As to the alleviation of effects of recognition heuristic in situations where it is ecologically unsuitable, Pachur and Hertwig (2006) note that suspension of the heuristic requires additional time or direct knowledge of the “criterion variable”. In typical real-world machine learning tasks the data can include a high number of attributes that even subject-matter experts are not acquainted with in detail. When these recognized – but not understood – attributes are present in the rule model, even the experts are liable to the recognition heuristic. When information on the meaning of individual attributes and literals is made easily accessible, we conjecture that the application of the recognition heuristic can be suppressed.
5.13 Information Bias
Information bias refers to the tendency to seek more information to improve the perceived validity of a statement even if the additional information is not relevant or helpful. The typical manifestation of the information bias is evaluating questions as worth asking even when the answer cannot affect the hypothesis that will be accepted (Baron et al., 1988).
For example, Baron et al. (1988) asked subjects to assess to what degree a medical test is suitable for deciding which of three diseases to treat. The test detected a chemical, which was with a certain probability associated with each of the three diseases. These probabilities varied across the cases. Even though in some of the cases an outcome of the test would not change the most likely disease and thus the treatment, people tended to judge the test as worth doing. While information bias is primarily researched in the context of information acquisition (Nelson et al., 2010; Nelson, 2005), some scientists interpret this more generally as judging features with zero probability gain as useful, having potential to change one’s belief (Nelson, 2008, p. 158).
Implications for rule learning
Many rule learning algorithms allow the user to select the size of the generated model – in terms of the number of rules that will be presented, as well as by setting the maximum length of conditions of the generated rules. Either as part of the feature selection, or when defining constraints for the learning, the users decide which attributes are relevant. These can then appear among conditions of the discovered rules.
According to the information bias, people will be prone to setup the task so that they receive more information – resulting in larger rule list with longer rules containing attributes with little information value.
It is unclear if the information effect applies also to the case when the user is readily presented with more information, rather then given the possibility to request more information. Given the proximity of these two scenarios, we conjecture that information bias (or some related bias) will make people prefer more information to less, even if it is obviously not relevant.
According to the information bias, a rule containing additional (redundant) condition may be preferred to a rule not containing this condition.
While informing people about the diagnosticity of considered questions does not completely remove the information bias, it reduces it (Baron et al., 1988). To this end, communicating attribute importance can help guide the analyst in the task definition phase.
Although existing algorithms and systems already provide ways for determining the importance of individual rules, for example via values of confidence, support, and lift, the cues on the importance of individual conditions in rule antecedent are typically not provided. While feature importance is computed within many learning algorithms, it is often used only internally. Exposing this information to the user can help counter the information bias.
5.14 Ambiguity Aversion
Ambiguity aversion refers to the tendency to prefer known risks over unknown risks. This is often illustrated by the Ellsberg paradox (Ellsberg, 1961), which shows that humans tend to systematically prefer a bet with known probability of winning over a bet with not precisely known probability of winning, even if it means that their choice is systematically influenced by irrelevant factors.
As argued by Camerer and Weber (1992), ambiguity aversion is related to the information bias: the demand for information in cases when it has no effect on decision can be explained by the aversion to ambiguity — people dislike having missing information.
Implications for rule learning
The ambiguity aversion may have profound implications for rule learning. The typical data mining task will contain a number of attributes the analyst has no or very limited knowledge of. The ambiguity aversion will manifest itself in a preference for rules that do not contain ambiguous conditions.
An empirically proven way to reduce ambiguity aversion is accountability – “the expectation on the side of the decision maker of having to justify her decisions to somebody else” (Vieider, 2009). This debiasing technique is hypothesized to work through higher cognitive effort that is induced by accountability.
This can be applied in the rule learning context by requiring the analysts to provide justifications for why they evaluated a specific discovered rule as interesting. Such explanation can be textual, but also can have a structured form. To decrease demands on the analyst, the explanation may only be required only if a conflict with existing knowledge has been automatically detected, for example, using approach proposed by Rauch (2018).
Since the application of the ambiguity aversion can partly stem from the lack of knowledge of the conditions included in the rule, it is conceivable this bias would be alleviated if description of the meaning of the conditions is made easily accessible to the analyst, as demonstrated in e.g. (Kliegr et al., 2011).
5.15 Confusion of the Inverse
This effect corresponds to confusing the probability of cause and effect, or, formally, confidence of an implication with its inverse , i.e., is confused with the inverse probability . For example, Villejoubert and Mandel (2002) showed in an experiment that about half of the participants estimating the probability of membership in a class gave most of their estimates that corresponded to the inverse probability.
Implications for rule learning
The confusion of the direction of an implication sign has significant consequences on the interpretation of a rule. Already Michalski (1983) noted that there are two different kinds of rules, discriminative and characteristic. Discriminative rules can quickly discriminate an object of one category from objects of other categories. A simple example is the rule
|IF trunk THEN elephant|
which states that an animal with a trunk is an elephant. This implication provides a simple but effective rule for recognizing elephants among all animals.
Characteristic rules, on the other hand, try to capture all properties that are common to the objects of the target class. A rule for characterizing elephants could be
|IF elephant THEN heavy, large, grey, bigEars, tusks, trunk.|
Note that here the implication sign is reversed: we list all properties that are implied by the target class, i.e., by an animal being an elephant. From the point of understandability, characteristic rules are often preferable to discriminative rules. For example, in a customer profiling application, we might prefer to not only list a few characteristics that discriminate one customer group from the other, but are interested in all characteristics of each customer group.
Characteristic rules are very much related to formal concept analysis (Wille, 1982; Ganter and Wille, 1999). Informally, a concept is defined by its intent (the description of the concept, i.e., the conditions of its defining rule) and its extent (the instances that are covered by these conditions). A formal concept is then a concept where the extension and the intension are Pareto-maximal, i.e., a concept where no conditions can be added without reducing the number of covered examples. In Michalski’s terminology, a formal concept is both discriminative and characteristic, i.e., a rule where the head is equivalent to the body.
The confusion of the inverse thus seems to imply that humans will not clearly distinguish between these types of rules, and, in particular, tend to interpret an implication as an equivalence. From this, we can infer that characteristic rules, which add all possible conditions even if they do not have additional discriminative power, may be preferable to short discriminative rules.
This confusion may manifest itself strongest in the area of association rule learning, where an attribute can be of interest to the analyst both in the antecedent and consequent of a rule.
Edgell et al. (2004) studied the influence of the effect of training of analysts in probabilistic theory with the conclusion that it is not effective in addressing the confusion of the inverse fallacy.
Werner et al. (2018, p. 195) point at a concern regarding use of language liable to misinterpretation in statistical textbooks teaching fundamental concepts such as independence. The authors illustrate the misinterpretation on the statement whenever Y has no effect on X
as “This statement is used to explain that two variables, X and Y, are independent and their joint distribution is simply the product of their margins. However, for many experts, the term ’effect’ might imply a causal relationship.” From this it follows that representations of rules should strive for unambiguous meaning of the wording of the implication construct. The specific recommendations provided byDíaz et al. (2010)
for teaching probability can also be considered in the next generation of textbooks aimed at the data science audience.
5.16 Context and Tradeoff Contrast Effects
People evaluate objects in relation to other available objects, which may lead to various effects of context of presentation of a choice. For example, in one of the experiments described by Tversky and Simonson (1993), subjects were asked to choose between two microwave ovens (Panasonic priced 180 USD and Emerson priced 110 USD), both a third off the regular price. The number of subjects who chose Emerson was 57% and 43% chose Panasonic. Another group of subjects was presented the same problem with the following manipulation: A more expensive Panasonic valued at 200 USD (10% off the regular price) was added to the list of possible options. The newly added device was described to look as inferior to the other Panasonic, but not to the Emerson device. After this manipulation, only 13% chose the more expensive Panasonic, but the number of subjects choosing the less expensive Panasonic rose from 43% to 60%. That is, even though the additional option was dominated by the cheaper Panasonic device and it should have been therefore irrelevant to the relative preference of the other ovens, its addition changed the preference in favor of the better Panasonic device. The experiment thus shows that selection of one of the available alternatives, such as products or job candidates, can be manipulated by addition or deletion of alternatives that are otherwise irrelevant. Tversky and Simonson (1993) attribute the tradeoff effect to the fact that “people often do not have a global preference order and, as a result, they use the context to identify the most ’attractive’ option.”
It should be noted that according to Tversky and Simonson (1993) if people have well-articulated preferences, the background context has no effect on the decision.
Implications for rule learning
The effect could be illustrated on the inter-rule comparison level. In the base scenario, a constrained rule learning yields only a rule with a confidence value of . Due to the relatively low value of confidence, the user does not find the rule very plausible. By lowering the minimum confidence threshold, multiple other rules predicting the same target class are discovered and shown to the user. These other rules, inferior to , would increase the plausibility of by the tradeoff contrast effect.
Marketing professionals sometimes introduce more expensive versions of the main product, which induces the tradeoff contrast. The presence of a more expensive alternative with little added value increases sales of the main product (Simonson and Tversky, 1992). Somewhat similarly, a rule learning algorithm can have on its output rules with very high confidence, sometimes even 1.0, but very low values of support. Removal of such rules can help to debias the analysts.
The influence of context can in some cases improve communication (Simonson and Tversky, 1992, p. 293). An attempt at making contextual attributes explicit in the rule learning context was made by Gamberger and Lavrač (2003), who introduced supporting factors as a means for complementing the explanation delivered by conventional learned rules. Essentially, supporting factors are additional attributes that are not part of the learned rule, but nevertheless have very different distributions with respect to the classes of the application domain. In line with the results of Kononenko (1993), medical experts found that these supporting factors increase the plausibility of the found rules.
5.17 Negativity Bias
According to the negativity bias, negative evidence tends to have a greater effect than neutral or positive evidence of equal intensity (Rozin and Royzman, 2001).
For example, the experiments by Pratto and John (2005) investigated whether the valence of a word (desirable or undesirable trait) has effect on the time required to identify the color in which the word appears on the screen. The results showed that the subjects took longer to name the color of an undesirable word than for a desirable word. The authors argued that the response time was higher for undesirable words because undesirable traits get more attention. Information with negative valence is given more attention partly because people seek diagnostic information, and negative information is more diagnostic (Skowronski and Carlston, 1989). Some research suggests that negative information is better memorized and subsequently recognized (Robinson-Riegler and Winton, 1996; Ohira et al., 1998).
Implications for rule learning
Putting a higher weight to negative information may in some situations be a valid heuristic. What needs to be addressed are cases, when the relevant piece of information is positive and a less relevant piece of information is negative (Huber, 2010; Tversky and Kahneman, 1981). It is therefore advisable that any such suspected cases are detected in the data preprocessing phase, and the corresponding attributes or values are replaced with more neutral sounding alternatives.
5.18 Primacy Effect
Once people form an initial assessment of plausibility (favorability) of an option, its subsequent evaluations will reflect this initial disposition.
Bond et al. (2007) investigated to what extent changing the order of information which is presented to a potential buyer affects the propensity to buy. For example, in one of the experiments, if the positive information (product description) was presented as first, the number of participants indicating they would buy the product was 48%. When the negative information (price) was presented first, this number decreased to 22%. Bond et al. (2007) argue that the effect is caused by distortion of interpretation of new information in the direction of the already held opinion. The information presented first not only influences disproportionately the final opinion, but it also influences interpretation of novel information.
Implications for rule learning
Following the primacy effect, the analyst will favor rules that are presented as first in the rule model. Largest negative effects of this bias are likely to occur, when such ordering is not observed, for example, when rules are presented in the order in which they were discovered by a breadth-first algorithm. In this case, mental contamination is another applicable bias related to the primacy effect (or in general order effects). This refers to the case when a presented hypothesis can influence subsequent decision making by its content, even if the subject is fully aware of the fact that the presented information is purely speculative (Fitzsimons and Shiv, 2001). Note that our application scenario differs from (Fitzsimons and Shiv, 2001) and some other related research, in that cognitive psychology mostly investigated the effect of asking a hypothetical question, while we are concerned with considering the plausibility of a presented hypothesis (inductively learnt rule). Fitzsimons and Shiv (2001) found that respondents are not able to prevent the contamination effects of the hypothetical questions and that the bias increases primarily when the hypothetical question is relevant. This bias is partly attributed to the application of expectations related to conversational maxims (Gigerenzer and Hoffrage, 1999).
Three types of debiasing techniques were examined by Mumma and Wilson (1995) in the context of clinical-like judgments. The bias inoculation intervention involves direct training on the applicable bias or biases, consisting of information on the bias, strategies for adjustment, as well as completing several practical assignments. The second technique was consider-the-opposite debiasing strategy, which sorts the information according to diagnosticity before it is reviewed. The third strategy evaluated was simply taking notes when reviewing each cue before the final judgment was made. Interestingly, bias inoculation, a representative of direct debiasing techniques, was found to be the least effective. Consider-the-opposite and taking notes were found to work equally well.
To this end, a possible debiasing strategy can be founded in presentation of the most relevant rules first. Similarly, the conditions within the rules can be ordered by predictive power. Some rule learning algorithms, such as CBA (Liu et al., 1998), readily take advantage of the primacy effect, since they naturally create rule models that contain rules sorted by their strength. Other algorithms order rules so that more general rules (i.e., rules that cover more examples) are presented first. This typically also corresponds to the order in which rules are learned with the commonly used separate-and-conquer or covering strategies Fürnkranz (1999). Simply reordering the rules output by these algorithms may not work in situations, when rules compose a rule list that is automatically processed for prediction purposes.555One technique that can positively influence comprehensibility of the rule list is prepending (adding to the beginning) a new rule to the previously learned rules (Webb, 1994). The intuition behind this argument is that there are often simple rules that would cover many of the positive examples, but also cover a few negative examples that have to be excluded as exceptions to the rule. Placing the simple general rule near the end of the rule list allows us to handle exceptions with rules that are placed before the general rule and keep the general rule simple. In order to take advantage of the note taking debiasing strategy, the user interface can support the analyst in annotating the individual rules.
Lau and Coiera (2009) provide a reason for optimism concerning the debiasing effect stemming from the proposed changes to user interface of machine learning tools. Their paper showed debiasing effect of similar changes implemented in a user interface to an information retrieval system used by consumers to find health information. Three versions of the system were compared: a baseline “standard” search interface, anchor debiasing interface, which asked the users to annotate the read documents as providing evidence for/against/neutral the proposition in question. Finally, the order debiasing interface reordered the documents to neutralize the primacy bias by creating a “counteracting order bias”. This was done by randomly reshuffling a part of the documents. When participants used the baseline and anchor debiasing interface, the order effect was present. On the other hand, the use of the order debiasing interface eliminated the order effect (Lau and Coiera, 2009).
5.19 Weak Evidence Effect
According to the weak evidence effect, presenting weak evidence in favor of an outcome can actually decrease the probability that a person assigns to the outcome. For example, in an experiment in the area of forensic science reported by Martire et al. (2013), it was shown that participants presented with evidence weakly supporting guilt tended to “invert” the evidence, thereby counterintuitively reducing their belief in the guilt of the accused. Fernbach et al. (2011) argue that the effect occurs because people give undue weight to the weak evidence and fail to take into account alternative evidence that more strongly favors the hypothesis at hand.
Implications for rule learning
The weak evidence effect can be directly applied to rules: the evidence is represented by the rule antecedent; the consequent corresponds to the outcome. The analyst can intuitively interpret each of the conditions in the antecedent as a piece of evidence in favor of the outcome. Typical of many machine learning problems is the uneven contribution of individual attributes to the prediction. Let us assume that the analyst is aware of the prediction strength of the individual attributes. If the analyst is to choose from a rule containing only one strong condition (predictor) and another rule containing a strong predictor and a weak (weak enough to trigger this effect) predictor, according to the weak evidence effect the analyst should choose the shorter rule with one predictor.
Martire et al. (2014) performed an empirical study aimed at evaluating what mode of communication of the strength of evidence is most resilient to the weak evidence effect. The surveyed modes of expression were numerical, verbal, a table, and a visual scale. It should be noted that the study was performed in the specific field of assessing evidence by a juror in a trial and the verbal expressions were following standards proposed by the Association of Forensic Science Providers (Willis, 2010).666These provide guidelines on translation of numerical likelihood ratios into verbal formats. For example, likelihood “” is translated as “weak or limited”, and likelihood of “” as “strong”. The results clearly suggested that numerical expressions of evidence are most suitable for expressing uncertainty.
Likelihood ratios studied by Martire et al. (2014) are conceptually close to the lift metric, used to characterize association rules. While lift is still typically presented as a number in machine learning user interfaces, there has been research towards communicating rule learning results in natural language since at least 2005 (Strossa et al., 2005). With recent resurgence of interest in interpretable models, the use of natural language has been taken up by commercial machine learning services, such as BigML, which allow to generate predictions via spoken questions and answers using Amazon Alexa voice service.777https://bigml.com/tools/alexa-voice Similarly, machine learning interfaces increasingly rely on visualizations. The research on debiasing of the weak evidence effect suggests that when communicating machine learning results using modern means, such as transformation to natural language or various graphical means, care must be taken when numerical information is communicated.
Martire et al. (2014) also observe high level of miscommunication associated with low-strength verbal expressions. In these instances, it is “appropriate to question whether expert opinions in the form of verbal likelihood ratios should be offered at all” (Martire et al., 2014). Transposing this result to the machine learning context, we suggest to consider intentional omission of weak predictors from rules either directly by the rule learner or as part of feature selection.
5.20 Unit Bias
The unit bias refers to the tendency to give each unit similar weight while ignoring or underweighing the size of the unit (Geier et al., 2006).
Geier et al. (2006) offered people various food items in two different sizes on different days and observed how this would affect consumption of the food. They found that people ate larger amount of food when the size of a single unit of the food item was big than when it was small. A possible explanation is that people ate one unit of food at a time without taking into account how big it was. Because the food was not consumed in larger amounts at any single occasion, but was rather eaten intermittently, the behavior led to higher consumption when a unit of food was larger.
Implications for rule learning
Unit bias was so far primarily studied for quite different purposes than is the domain of machine learning. Nevertheless, as we will argue in the following, it can be very relevant for the domain of rule learning.
From a technical perspective, the number of conditions in rules is not important. What matters is the actual discriminatory power of the individual conditions, which can vary substantially. However, following the application of unit bias, people can view conditions as units of similar importance, disregarding their sometimes vastly different discriminatory and predictive power.
One of the common ways how regulators address unhealthy food consumption patterns related to varying sizes of packaging is introduction of mandatory labelling of the size and calorie contents. Following an analogy to clearly communicating the size of food item, informing analysts about the discriminatory power of the individual conditions may alleviate unit bias. Such indicator can be generated automatically, for example, by listing the number of instances in the entire dataset that meet the condition.
6 Recommendations for Rule Learning Algorithms and Software
This section provides a concise list of considerations that is aimed to raise awareness among machine learning practitioners regarding availability of measures that could potentially suppress effect of cognitive biases on comprehension of rule-based models. We expect part of the list to be useful also for other symbolic machine learning models, such as decision trees. In our recommendations, we focus on systems that present the rule model to a human user, which we refer to as the analyst. We consider two basic roles the analyst can have in the process: approval of the complete classification model (”interpretable classifiation task”), and selection of interesting rules (”nugget discovery”).
6.1 Representation of a rule
The interpretation of natural language expressions used to describe a rule can lead to systematic distortions.
Our review revealed the following recommendations applicable to individual rules:
Syntactic elements. There are several cognitive studies indicating that AND is often misunderstood (Hertwig et al., 2008), (Gigerenzer, 2001, p. 95-96). The results of our experiments (Fürnkranz et al., 2018) support the conclusion that AND needs to be presented unambiguously in the rule learning context. Research has shown that “and” ceases to be ambiguous when it is used to connect propositions rather than categories. Similarly, the communication of the implication construct IF THEN connecting antecedent and consequent should be made unambiguous.
Another important syntactic construct is negation (NOT). While processing of negation has not been included among the surveyed biases, our review of literature (cf. Section 7.4) suggests that its use is discouraged on the grounds that its processing requires more cognitive effort.
Conditions. Attribute-value pairs comprising conditions are typically either formed of words with meaningful semantics for the user, or of codes that are not directly meaningful. When conditions contain words with negative valence, these need to be reviewed carefully, since negative information is known to receive more attention and is associated with higher weight than positive information. A number of biases can be triggered or strengthened by the lack of understanding of attributes and their values appearing in rules. Providing easily accessible information on conditions in the rules, including their predictive power, can thus prove as an effective debiasing technique.
People have the tendency to put higher emphasis on information they are exposed to first. By ordering the conditions by strength, machine learning software can conform to human conversational maxims. The output could also visually delimit conditions in the rules based on their significance or predictive stength.
Interestingness measures. The values of interestingness measures should be communicated using numerical expressions. Alternate verbal expressions, with wordings such as “strong relationship” replacing specific numerical values, are discouraged because there is some evidence that such verbal expressions are prone to miscommunication.
Currently, rule interest measures are typically represented as probabilities (confidence) or ratios (lift), whereas results in cognitive science indicate that natural frequencies are better understood.
The tendency of humans to ignore base rates and sample sizes (which closely relate to rule support) is a well established fact in cognitive science. Results of our experiments on inductively learned rules also provide evidence for this conclusion (Fürnkranz et al., 2018). Our proposition is that this effect can be addressed by presenting confidence (reliability) intervals for confidence.
6.2 Rule models
In many cases, rules are not presented in isolation to the analyst, but instead within a collection of rules comprising a rule model. Here, we relate the results of our review to the following aspects of rule models:
Model size. Rule models often incorporate output that is considered as marginally relevant. This can take a form of (nearly) redundant rules or (nearly) redundant conditions in the rule. Our analysis shows that these redundancies can induce a number of biases. These can be addressed by utilizing various pruning techniques, or some of the recently proposed rule learning algorithms (e.g. (Letham et al., 2015; Lakkaraju et al., 2016)), which allow the user to set or influence size of the resulting model, and in case of (Lakkaraju et al., 2016) can optimize for diversity and non-overlap of discovered rules, directly countering the reiteration effect.
Another potentially effective approach to discarding some rules can be using domain knowledge or constraints set by the user to remove the strong (e.g. highly confident), yet “obvious” rules confirming common knowledge.888 An example pre-existing knowledge might be that diastolic blood pressure rises with body mass index (BMIDBP) (Kliegr et al., 2011). Rules confirming this relationship might be removed. Removal of weak rules could help to address tradeoff contrast as well as weak evidence effect.
Rule grouping. The rule learning literature has seen multiple attempts to develop methods for grouping similar rules (such as clustering). Our review suggests that presenting clusters of similar rules can help to reduce cognitive biases caused by reiteration.
Rule ordering. Algorithms that learn rule lists provide mandatory ordering of rules, while the rule order in rule-set learning algorithms is not important. In either case, the rule order as presented to the user will affect perception of the model due to conversational maxims and the primacy effect. It is recommended to sort the presented rules by strength. However, due to paucity of applicable research, it is unclear which particular definition of rule strength would lead to the best results in terms of bias mitigation. inline,author=Tomasinline,author=Tomastodo: inline,author=TomasThis is a strange recommendation. You essentially say ”it does not matter how you define strength, but it is good to sort rules according to this definition.” I don’t believe this is the case. inline,author=Tomasinline,author=Tomastodo: inline,author=TomasNow the recommendation refers to follow-up research for exact definition of strength
6.3 User Engagement
Some results of our review suggest that increasing user interaction can help counter some biases. Some specific suggestions for machine learning user interfaces (UIs) follow:
Domain knowledge. Selectively presenting domain knowledge ”conflicting” with the considered rule can help to invoke the ’consider-the-opposite’ debiasing strategy. Other research has shown that the plausibility of a model depends on compliance to monotonicity constraints (Freitas, 2014). We thus suggest that UIs make background information on discovered rules easily accessible.
Eliciting rule annotation. Activating the deliberate “System 2” is one of the most widely applicable debiasing strategies. One way to achieve this is to require accountability, for example, through visual interfaces motivating users to annotate selected rules, which would induce the ’note taking’ debiasing strategy. Giving people additional time to consider the problem has been in some cases shown as an effective debiasing strategy. This can be achieved by making the selection process (at least) two stage, allowing the user to revise the selected rules.
User search for rules rather than scroll.
Repeating rules can affect users even if they are exposed to them for a short moment, e.g. when scrolling the rule list, via the mere exposure effect. The user interfaces should thus deploy alternatives to scrolling in discovered rules, such as search.
6.4 Bias inoculation
In some studies, basic education about specific biases, such as brief tutorials, decreased the fallacy rate. This approach has been called in literature bias inoculation debiasing strategy.
Education on specific biases. Several studies have shown that providing explicit guidance and education on formal logic, hypothesis testing, and critical assessment of information can reduce fallacy rates in some tasks. However, the effect of psychoeducational methods is still a subject of dispute Lilienfeld et al. (2009), and cannot be thus recommended as a sole or sufficient measure.
7 Limitations and Future Work
Our goal was to examine whether cognitive biases can affect the interpretation of machine learning models and to propose possible remedies if they do. Since this field is untapped from the machine learning perspective, we tried to approach the problem holistically. Our work yielded a number of partial contributions, rather than a single profound result. We mapped applicable cognitive biases, identified prior works on their suppression, and proposed how these could be transferred to machine learning. All the shortcomings of human judgment pertaining to the interpretation of inductively learned rules that we have reviewed are based on empirical cognitive science research. For each cognitive bias, we provided a justification how it would relate to machine learning. Due to the absence of applicable prior research in this intersection between cognitive science and machine learning, this justification is mostly based on authors’ experience in machine learning. In the following, we outline some promising direction of future work.
7.1 Role of Domain Knowledge
It has been long recognized that external knowledge plays an important rule in the rule learning process. Already Mitchell (1980)
recognized at least two distinct roles external knowledge can play in machine learning: it can constrain the search for appropriate generalizations, and guide learning based on the intended use of the learned generalizations. Interaction with domain knowledge has played an important role in multiple stages of the machine learning process. For example, it can improve semi-supervised learning(Carlson et al., 2010), and in some applications it is vital to convert discovered rules back into domain knowledge (Fürnkranz et al., 2012, p. 288). Some results also confirm the common intuition that compliance to constraints valid in the given domain increases the plausibility of the learned models (Freitas, 2014). Our review shows that domain knowledge can be one of the important instruments in the toolbox aimed at debiasing interpretation of discovered rules. To give a specific example, the presence or strength of the validity effect depends on the familiarity of the subject with the topic area from which the information originates (Boehm, 1994). Future work should focus on a systematic review of the role of domain knowledge on activation or inhibition of cognitive phenomena applicable to interpretability of rule learning results.
7.2 Individual differences
The presence of multiple cognitive biases and their strengths have been linked to specific personality traits. For example, overconfidence and the rate of conjunctive fallacy have been shown to be inversely related to numeracy (Winman et al., 2014). According to Juslin et al. (2011), the application of the averaging heuristic rather than the normative multiplication of probabilities seems to depend on the working memory capacity and/or high motivation.
Some research can even be interpreted as indicating that data analysts can be more susceptible to the myside bias than the general population. An experiment reported by Wolfe and Britt (2008) shows that subjects who defined good arguments as those that can be “proved by facts” (this stance, we assume, would also apply to many data analysts) were more prone to exhibiting the myside bias.999This tendency is explained by Wolfe and Britt (2008) as follows: “For people with this belief, facts and support are treated uncritically. …More importantly, arguments and information that may support another side are not part of the schema and are also ignored.” Stanovich et al. (2013) show that the incidence of myside bias is surprisingly not related to general intelligence. This suggests that even highly intelligent analysts can be affected. Albarracín and Mitchell (2004) propose that the susceptibility to the confirmation bias can depend on one’s personality traits. They also present a diagnostic tool called “defense confidence scale” that can identify individuals who are prone to confirmational strategies. Further research into personality traits of users of machine learning outputs, as well as into development of appropriate personality tests, would help to better target education focused on debiasing.
7.3 Incorporating Additional Biases
There are about 24 cognitive biases covered in Cognitive Illusions, the authoritative overview of cognitive biases by Pohl (2017), and even 51 different biases are covered by Evans et al. (2007). While doing the initial selection of cognitive biases to study, we tried to identify those most relevant for machine learning research matching our criteria. In the end, our review focused on a selection of 20 cognitive biases (effects, illusions). Future work might focus on expanding the review with additional relevant biases, such as labelling and overshadowing effects (Pohl, 2017).
7.4 Extending Scope Beyond Biases
There is a number of cognitive phenomena affecting the interpretability of rules, which are not classified as cognitive biases. Remarkably, since 1960 there is a consistent line of work by psychologists studying cognitive processes related to rule induction, which is centred around the so-called Wason’s 2-4-6 problem (Wason, 1960). Cognitive science research on rule induction in humans has so far not been noticed in the rule learning subfield of machine learning.101010Based on our analysis of cited reference search in Google Scholar for (Wason, 1960). It was out of the scope of the objectives of this review to conduct an analysis of the significance of these results for rule learning, nevertheless we believe that such investigation could bring interesting insights for cognitively-inspired design of rule learning algorithms.
Another promising direction for further work is research focused on the interpretation of negations (“not”). Experiments conducted by Jiang et al. (2014) show that the mental processes involved in processing negations slow down reasoning. Negation can be also sometimes ignored or forgotten (Deutsch et al., 2009), as it decreases veracity of long-term correct remembrance of the information.
Most rule learning algorithms are capable of generating rules containing negated literals. For example, a healthy company can be represented as status = not(bankrupt). Our precautionary suggestion based on interpretation of results obtained in general studies performed in experimental psychology (Deutsch et al., 2009) and neurolinguistics (Jiang et al., 2014) is that artificial learning systems should refrain, wherever feasible, from the use of negation in the discovered rules that are to be presented to the user. Due the adverse implications of the use of negation on cognitive load and remembrance, empirical research focused interpretability of negation in machine learning is urgently needed.
To our knowledge, cognitive biases have not yet been discussed in relation to the interpretability of machine learning results. We thus initiated this review of research published in cognitive science with the intent of providing a psychological basis to further research in inductive rule learning algorithms, and the way their results are communicated. Our review identified twenty cognitive biases, heuristics, and effects that can give rise to systematic errors when inductively learned rules are interpreted.
For most biases and heuristics included in our review, psychologists have proposed “debiasing” measures. Application of prior empirical results obtained in cognitive science allowed us to propose several methods that could be effective in suppressing these cognitive phenomena when machine learning models are interpreted.
Overall, in our review, we processed only a fraction of potentially relevant psychological studies of cognitive biases, but we were unable to locate a single study focused on machine learning. Future research should thus focus on empirical evaluation of effects of cognitive biases in the machine learning domain.
TK was supported by long term institutional support of research activities. ŠB and TK were supported by grant IGA 33/2018 by Faculty of Informatics and Statistics, University of Economics, Prague. ŠB was supported by Internal Grant Agency of Faculty of Business Administration, University of Economics, Prague (grant no. IP300040).
- Kahneman and Tversky (1972) D. Kahneman, A. Tversky, Subjective probability: A judgment of representativeness, Cognitive Psychology 3 (1972) 430–454.
- Tversky and Simonson (1993) A. Tversky, I. Simonson, Context-dependent preference, Management Science 39 (1993) 1179–1189.
- Kunda (1999) Z. Kunda, Social Cognition: Making Sense of People, MIT press, 1999.
- Serfas (2011) S. Serfas, Cognitive biases in the capital investment context, in: Cognitive Biases in the Capital Investment Context, Springer, 2011, pp. 95–189.
- Kahneman and Tversky (1973) D. Kahneman, A. Tversky, On the psychology of prediction, Psychological Review 80 (1973) 237.
- Michalski (1969) R. S. Michalski, On the quasi-minimal solution of the general covering problem, in: Proceedings of the V International Symposium on Information Processing (FCIP 69)(Switching Circuits), Yugoslavia, Bled, 1969, pp. 125–128.
- Michalski (1983) R. S. Michalski, A theory and methodology of inductive learning, in: Machine Learning, Springer, 1983, pp. 83–134.
- Fürnkranz et al. (2018) J. Fürnkranz, T. Kliegr, H. Paulheim, On cognitive preferences and the interpretability of rule-based models, CoRR abs/1803.01316 (2018).
- Fürnkranz et al. (2012) J. Fürnkranz, D. Gamberger, N. Lavrač, Foundations of Rule Learning, Springer-Verlag, 2012.
Slowinski et al. (2006)
R. Slowinski, I. Brzezinska,
Application of bayesian confirmation measures for
mining rules from support-confidence pareto-optimal set,
Proceedings of the 7th International Conference on Artificial Intelligence and Soft Computing (ICAISC 2006) (2006) 1018–1026.
- Agrawal et al. (1995) R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A. I. Verkamo, Fast discovery of association rules, in: U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining, AAAI Press, 1995, pp. 307–328.
- Zhang and Zhang (2002) C. Zhang, S. Zhang, Association Rule Mining: Models and Algorithms, Springer-Verlag, 2002.
- Smith et al. (1992) E. E. Smith, C. Langston, R. E. Nisbett, The case for rules in reasoning, syntacticcience 16 (1992) 1–40.
- Nisbett (1993) R. E. Nisbett, Rules for reasoning, Psychology Press, 1993.
- Pinker (2015) S. Pinker, Words and rules: The ingredients of language, Basic Books, 2015.
- Tversky and Kahneman (1974) A. Tversky, D. Kahneman, Judgment under uncertainty: Heuristics and biases, Science 185 (1974) 1124–1131.
- Griffin and Tversky (1992) D. Griffin, A. Tversky, The weighing of evidence and the determinants of confidence, Cognitive Psychology 24 (1992) 411–435.
- Keynes (1922) J. M. Keynes, A Treatise on Probability, Macmillan & Co, 1922.
- Camerer and Weber (1992) C. Camerer, M. Weber, Recent developments in modeling preferences: Uncertainty and ambiguity, Journal of Risk and Uncertainty 5 (1992) 325–370.
- Wason (1960) P. C. Wason, On the failure to eliminate hypotheses in a conceptual task, Quarterly Journal of Experimental Psychology 12 (1960) 129–140.
- Klayman and Ha (1987) J. Klayman, Y.-W. Ha, Confirmation, disconfirmation, and information in hypothesis testing, Psychological Review 94 (1987) 211.
- Wilke and Mata (2012) A. Wilke, R. Mata, Cognitive bias, in: V. Ramachandran (Ed.), Encyclopedia of Human Behavior (Second Edition), second edition ed., Academic Press, San Diego, 2012, pp. 531 – 535.
- Lau and Coiera (2009) A. Y. Lau, E. W. Coiera, Can cognitive biases during consumer health information searches be reduced to improve decision making?, Journal of the American Medical Informatics Association 16 (2009) 54–65.
- Croskerry et al. (2013) P. Croskerry, G. Singhal, S. Mamede, Cognitive debiasing 2: impediments to and strategies for change, BMJ Quality & Safety (2013) bmjqs–2012.
- Martire et al. (2014) K. Martire, R. Kemp, M. Sayle, B. Newell, On the interpretation of likelihood ratios in forensic science evidence: Presentation formats and the weak evidence effect, Forensic Science International 240 (2014) 61–68.
- Arkes (1991) H. R. Arkes, Costs and benefits of judgment errors: Implications for debiasing, Psychological Bulletin 110 (1991) 486.
- Larrick (2004) R. P. Larrick, Debiasing, Blackwell Handbook of Judgment and Decision Making (2004) 316–338.
- Lilienfeld et al. (2009) S. O. Lilienfeld, R. Ammirati, K. Landfield, Giving debiasing away: Can psychological research on correcting cognitive errors promote human welfare?, Perspectives on Psychological Science 4 (2009) 390–398.
- Shafir (2013) E. Shafir, The behavioral foundations of public policy, Princeton University Press, 2013.
- Pohl (2017) R. Pohl, Cognitive Illusions: A Handbook on Fallacies and Biases in Thinking, Judgement and Memory, Psychology Press, 2017. 2nd ed.
- Haselton and Nettle (2006) M. G. Haselton, D. Nettle, The paranoid optimist: An integrative evolutionary model of cognitive biases, Personality and Social Psychology Review 10 (2006) 47–66.
- Freitas (2014) A. A. Freitas, Comprehensible classification models: a position paper, ACM SIGKDD Explorations 15 (2014) 1–10.
- Stecher et al. (2016) J. Stecher, F. Janssen, J. Fürnkranz, Shorter rules are better, aren’t they?, in: Proceedings of the 19th International Conference on Discovery Science (DS-16), Bari, Italy, 2016, pp. 279–294.
- Gettys et al. (1978) C. F. Gettys, S. D. Fisher, T. Mehle, Hypothesis Generation and Plausibility Assessment, Technical Report, Decision Processes Laboratory, University of Oklahoma, Norman, 1978. Annual report TR 15-10-78 (AD A060786.
- Gettys et al. (1986) C. F. Gettys, T. Mehle, S. Fisher, Plausibility assessments in hypothesis generation, Organizational Behavior and Human Decision Processes 37 (1986) 14–33.
- Anderson and Fleming (2016) J. Anderson, D. Fleming, Analytical procedures decision aids for generating explanations: Current state of theoretical development and implications of their use, Journal of Accounting and Taxation 8 (2016) 51.
Bibal and Frénay (2016)
A. Bibal, B. Frénay,
Interpretability of machine learning models and
representations: an introduction,
in: Proceedings of the 24th European Symposium on Artificial Neural Networks (ESANN), 2016, pp. 77–82.
- Martens et al. (2011) D. Martens, J. Vanthienen, W. Verbeke, B. Baesens, Performance of classification models from a user perspective, Decision Support Systems 51 (2011) 782–793.
- Azevedo and Jorge (2007) P. J. Azevedo, A. M. Jorge, Comparing rule measures for predictive association rules, in: Proceedings of the 18th European Conference on Machine Learning (ECML-07), Springer, Warsawa, Poland, 2007, pp. 510–517.
- Tversky and Kahneman (1971) A. Tversky, D. Kahneman, Belief in the law of small numbers, Psychological Bulletin 76 (1971) 105.
- Gigerenzer and Goldstein (1996) G. Gigerenzer, D. G. Goldstein, Reasoning the fast and frugal way: models of bounded rationality, Psychological Review 103 (1996) 650.
- Gigerenzer and Hoffrage (1995) G. Gigerenzer, U. Hoffrage, How to improve Bayesian reasoning without instruction: frequency formats, Psychological Review 102 (1995) 684.
- Fürnkranz and Flach (2005) J. Fürnkranz, P. A. Flach, Roc ‘n’rule learning—towards a better understanding of covering algorithms, Machine Learning 58 (2005) 39–77.
- Fleischmann et al. (2014) M. Fleischmann, M. Amirpur, A. Benlian, T. Hess, Cognitive biases in information systems research: A scientometric analysis, in: Proceedings of the 22st European Conference on Information Systems (ECIS 2014), Tel Aviv, Israel, 2014.
- Tversky and Kahneman (1983) A. Tversky, D. Kahneman, Extensional versus intuitive reasoning: the conjunction fallacy in probability judgment, Psychological Review 90 (1983) 293.
- Bar-Hillel (1991) M. Bar-Hillel, Commentary on Wolford, Taylor, and Beck: The conjunction fallacy?, Memory & Cognition 19 (1991) 412–414.
- Hertwig et al. (2008) R. Hertwig, B. Benz, S. Krauss, The conjunction fallacy and the many meanings of and, Cognition 108 (2008) 740–753.
- Tentori and Crupi (2012) K. Tentori, V. Crupi, On the conjunction fallacy and the meaning of and, yet again: A reply to Hertwig, Benz, and Krauss (2008), Cognition 122 (2012) 123–134.
- Fürnkranz et al. (2018) J. Fürnkranz, T. Kliegr, H. Paulheim, On cognitive preferences and the interpretability of rule-based models, arXiv preprint arXiv:1803.01316 (2018).
- Nilsson et al. (2009) H. Nilsson, A. Winman, P. Juslin, G. Hansson, Linda is not a bearded lady: Configural weighting and adding as the cause of extension errors, Journal of Experimental Psychology: General 138 (2009) 517.
- Bruza et al. (2015) P. D. Bruza, Z. Wang, J. R. Busemeyer, Quantum cognition: a new theoretical approach to psychology, Trends in Cognitive Sciences 19 (2015) 383–393.
- Tentori et al. (2013) K. Tentori, V. Crupi, S. Russo, On the determinants of the conjunction fallacy: Probability versus inductive confirmation, Journal of Experimental Psychology: General 142 (2013) 235.
- Fisk (2002) J. E. Fisk, Judgments under uncertainty: Representativeness or potential surprise?, British Journal of Psychology 93 (2002) 431–449.
- Charness et al. (2010) G. Charness, E. Karni, D. Levin, On the conjunction fallacy in probability judgment: New experimental evidence regarding Linda, Games and Economic Behavior 68 (2010) 551 – 556.
- Zizzo et al. (2000) D. J. Zizzo, S. Stolarz-Fantino, J. Wen, E. Fantino, A violation of the monotonicity axiom: Experimental evidence on the conjunction fallacy, Journal of Economic Behavior & Organization 41 (2000) 263–276.
- Stolarz-Fantino et al. (1996) S. Stolarz-Fantino, E. Fantino, J. Kulik, The conjunction fallacy: Differential incidence as a function of descriptive frames and educational context, Contemporary Educational Psychology 21 (1996) 208–218.
- Juslin et al. (2009) P. Juslin, H. Nilsson, A. Winman, Probability theory, not the very guide of life, Psychological Review 116 (2009) 856.
- Sides et al. (2002) A. Sides, D. Osherson, N. Bonini, R. Viale, On the reality of the conjunction fallacy, Memory & Cognition 30 (2002) 191–198.
- Mellers et al. (2001) B. Mellers, R. Hertwig, D. Kahneman, Do frequency representations eliminate conjunction effects? an exercise in adversarial collaboration, Psychological Science 12 (2001) 269–275.
- Piltaver et al. (2016) R. Piltaver, M. Lustrek, M. Gams, S. Martincic-Ipsic, What makes classification trees comprehensible?, Expert Systems with Applications 62 (2016) 333 – 346.
- Fantino et al. (1997) E. Fantino, J. Kulik, S. Stolarz-Fantino, W. Wright, The conjunction fallacy: A test of averaging hypotheses, Psychonomic Bulletin & Review 4 (1997) 96–101.
- Juslin et al. (2011) P. Juslin, H. Nilsson, A. Winman, M. Lindskog, Reducing cognitive biases in probabilistic reasoning by the use of logarithm formats, Cognition 120 (2011) 248–267.
- Bar-Hillel and Neter (1993) M. Bar-Hillel, E. Neter, How alike is it versus how likely is it: A disjunction fallacy in probability judgments, Journal of Personality and Social Psychology 65 (1993) 1119.
- Kachelmeier and Messier Jr (1990) S. J. Kachelmeier, W. F. Messier Jr, An investigation of the influence of a nonstatistical decision aid on auditor sample size decisions, Accounting Review (1990) 209–226.
- Bar-Hillel (1979) M. Bar-Hillel, The role of sample size in sample evaluation, Organizational Behavior and Human Performance 24 (1979) 245–257.
- Letham et al. (2015) B. Letham, C. Rudin, T. H. McCormick, D. Madigan, et al., Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model, The Annals of Applied Statistics 9 (2015) 1350–1371.
- De Laat (2017) P. B. De Laat, Algorithmic decision-making based on machine learning from big data: Can transparency restore accountability?, Philosophy & Technology (2017) 1–17.
- Evans (1989) J. S. B. Evans, Bias in human reasoning: Causes and consequences, Lawrence Erlbaum Associates, Inc, 1989.
- Nickerson (1998) R. S. Nickerson, Confirmation bias: A ubiquitous phenomenon in many guises, Review of General Psychology 2 (1998) 175.
- Trope et al. (1997) Y. Trope, B. Gervey, N. Liberman, Wishful thinking from a pragmatic hypothesis-testing perspective, The mythomanias: The nature of deception and self-deception (1997) 105–31.
- Pohl (2004) R. Pohl, Cognitive illusions: A handbook on fallacies and biases in thinking, judgement and memory, Psychology Press, 2004.
- Spengler et al. (1995) P. M. Spengler, D. C. Strohmer, D. N. Dixon, V. A. Shivy, A scientist-practitioner model of psychological assessment: Implications for training, practice and research, The Counseling Psychologist 23 (1995) 506–534.
- Parmley (2006) M. C. Parmley, The effects of the confirmation bias on diagnostic decision making, Ph.D. thesis, Drexel University, 2006.
- Vojíř et al. (2018) S. Vojíř, V. Zeman, J. Kuchař, T. Kliegr, Easyminer.eu: Web framework for interpretable machine learning based on rules and frequent itemsets, Knowledge-Based Systems (2018).
- Wolfe and Britt (2008) C. R. Wolfe, M. A. Britt, The locus of the myside bias in written argumentation, Thinking & Reasoning 14 (2008) 1–27.
- Barberia et al. (2013) I. Barberia, F. Blanco, C. P. Cubillas, H. Matute, Implementation and assessment of an intervention to debias adolescents against causal illusions, PLoS One 8 (2013) e71303.
- Tversky and Kahneman (1973) A. Tversky, D. Kahneman, Availability: A heuristic for judging frequency and probability, Cognitive Psychology 5 (1973) 207–232.
- Schwarz et al. (1991) N. Schwarz, H. Bless, F. Strack, G. Klumpp, H. Rittenauer-Schatka, A. Simons, Ease of retrieval as information: Another look at the availability heuristic, Journal of Personality and Social psychology 61 (1991) 195.
- Schwarz (2004) N. Schwarz, Metacognitive experiences in consumer judgment and decision making, Journal of Consumer Psychology 14 (2004) 332–348.
- Hertwig et al. (1997) R. Hertwig, G. Gigerenzer, U. Hoffrage, The reiteration effect in hindsight bias, Psychological Review 104 (1997) 194.
- Pachur et al. (2011) T. Pachur, P. M. Todd, G. Gigerenzer, L. Schooler, D. G. Goldstein, The recognition heuristic: A review of theory and tests, Frontiers in Psychology 2 (2011) 147.
- Hasher et al. (1977) L. Hasher, D. Goldstein, T. Toppino, Frequency and the conference of referential validity, Journal of Verbal Learning and Verbal Behavior 16 (1977) 107–112.
- Unkelbach and Rom (2017) C. Unkelbach, S. C. Rom, A referential theory of the repetition-induced truth effect, Cognition 160 (2017) 110–126.
- Dechêne et al. (2010) A. Dechêne, C. Stahl, J. Hansen, M. Wänke, The truth about the truth: A meta-analytic review of the truth effect, Personality and Social Psychology Review 14 (2010) 238–257.
- Schwarz et al. (2007) N. Schwarz, L. J. Sanna, I. Skurnik, C. Yoon, Metacognitive experiences and the intricacies of setting people straight: Implications for debiasing and public information campaigns, Advances in Experimental Social Psychology 39 (2007) 127–161.
- Fürnkranz (1997) J. Fürnkranz, Pruning algorithms for rule learning, Machine Learning 27 (1997) 139–172.
- Ordonez et al. (2006) C. Ordonez, N. Ezquerra, C. A. Santana, Constraining and summarizing association rules in medical data, Knowledge and Information Systems 9 (2006) 1–2.
- Berka (2018) P. Berka, Comprehensive concept description based on association rules: A meta-learning approach, Intelligent Data Analysis 22 (2018) 325–344.
- Hess and Hagen (2006) N. H. Hess, E. H. Hagen, Psychological adaptations for assessing gossip veracity, Human Nature 17 (2006) 337–354.
- Rauch (2018) J. Rauch, Expert deduction rules in data mining with association rules: a case study, Knowledge and Information Systems (2018).
- Lewandowsky et al. (2012) S. Lewandowsky, U. K. Ecker, C. M. Seifert, N. Schwarz, J. Cook, Misinformation and its correction: Continued influence and successful debiasing, Psychological Science in the Public Interest 13 (2012) 106–131.
- Ecker et al. (2017) U. K. Ecker, J. L. Hogan, S. Lewandowsky, Reminders and repetition of misinformation: Helping or hindering its retraction?, Journal of Applied Research in Memory and Cognition 6 (2017) 185–192.
- Bornstein (1989) R. F. Bornstein, Exposure and affect: overview and meta-analysis of research, 1968–1987, Psychological Bulletin 106 (1989) 265.
- Monahan et al. (2000) J. L. Monahan, S. T. Murphy, R. B. Zajonc, Subliminal mere exposure: Specific, general, and diffuse effects, Psychological Science 11 (2000) 462–466.
- Winkielman et al. (2003) P. Winkielman, N. Schwarz, T. Fazendeiro, R. Reber, et al., The hedonic marking of processing fluency: Implications for evaluative judgment, The psychology of evaluation: Affective processes in cognition and emotion (2003) 189–217.
- Montoya et al. (2017) R. M. Montoya, R. S. Horton, J. L. Vevea, M. Citkowicz, E. A. Lauber, A re-examination of the mere exposure effect: The influence of repeated exposure on recognition, familiarity, and liking, Psychological Bulletin 143 (2017) 459.
- Alcala-Fdez et al. (2011) J. Alcala-Fdez, R. Alcala, F. Herrera, A fuzzy association rule-based classification model for high-dimensional problems with genetic rule selection and lateral tuning, IEEE Transactions on Fuzzy Systems 19 (2011) 857–872.
- Zajonc (1968) R. B. Zajonc, Attitudinal effects of mere exposure, Journal of Personality and Social Psychology 9 (1968) 1.
- Becker and Rinck (2016) E. S. Becker, M. Rinck, Reversing the mere exposure effect in spider fearfuls: Preliminary evidence of sensitization, Biological Psychology 121 (2016) 153–159.
- Škrabal et al. (2012) R. Škrabal, M. Šimůnek, S. Vojíř, A. Hazucha, T. Marek, D. Chudán, T. Kliegr, Association rule mining following the web search paradigm, in: P. A. Flach, T. Bie, N. Cristianini (Eds.), Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD 2012), Springer Berlin Heidelberg, 2012, pp. 808–811.
- Fischoff (1981) B. Fischoff, Debiasing, Technical Report, Decision Research, Eugene, OR, 1981.
- Arkes et al. (1987) H. R. Arkes, C. Christensen, C. Lai, C. Blumer, Two methods of reducing overconfidence, Organizational Behavior and Human Decision Processes 39 (1987) 133–144.
- Fleisig (2011) D. Fleisig, Adding information may increase overconfidence in accuracy of knowledge retrieval, Psychological Reports 108 (2011) 379–392.
- Hall et al. (2007) C. C. Hall, L. Ariss, A. Todorov, The illusion of knowledge: When more information reduces accuracy and increases confidence, Organizational Behavior and Human Decision Processes 103 (2007) 277–290.
- Clemen and Lichtendahl (2002) R. T. Clemen, K. C. Lichtendahl, Debiasing expert overconfidence: A bayesian calibration model, in: Sixth International Conference on Probablistic Safety Assessment and Management (PSAM6), 2002.
- Kliegr et al. (2011) T. Kliegr, V. Svátek, M. Ralbovský, M. Šimůnek, SEWEBAR-CMS: Semantic analytical report authoring for data mining results, Journal of Intelligent Information Systems 37 (2011) 371–395.
- Hertwig et al. (2008) R. Hertwig, S. M. Herzog, L. J. Schooler, T. Reimer, Fluency heuristic: A model of how the mind exploits a by-product of information retrieval, Journal of Experimental Psychology: Learning, Memory, and Cognition 34 (2008) 1191.
- Goldstein and Gigerenzer (1999) D. G. Goldstein, G. Gigerenzer, The recognition heuristic: How ignorance makes us smart, in: Simple heuristics that make us smart, Oxford University Press, 1999, pp. 37–58.
- Michalkiewicz et al. (2018) M. Michalkiewicz, K. Arden, E. Erdfelder, Do smarter people employ better decision strategies? the influence of intelligence on adaptive use of the recognition heuristic, Journal of Behavioral Decision Making 31 (2018) 3–11.
- Pohl et al. (2017) R. F. Pohl, M. Michalkiewicz, E. Erdfelder, B. E. Hilbig, Use of the recognition heuristic depends on the domain’s recognition validity, not on the recognition validity of selected sets of objects, Memory & Cognition 45 (2017) 776–791.
- Pachur and Hertwig (2006) T. Pachur, R. Hertwig, On the psychology of the recognition heuristic: Retrieval primacy as a key determinant of its use, Journal of Experimental Psychology: Learning, Memory, and Cognition 32 (2006) 983.
- Baron et al. (1988) J. Baron, J. Beattie, J. C. Hershey, Heuristics and biases in diagnostic reasoning: II congruence, information, and certainty, Organizational Behavior and Human Decision Processes 42 (1988) 88–110.
- Nelson et al. (2010) J. D. Nelson, C. R. McKenzie, G. W. Cottrell, T. J. Sejnowski, Experience matters: Information acquisition optimizes probability gain, Psychological Science 21 (2010) 960–969.
- Nelson (2005) J. D. Nelson, Finding useful questions: on bayesian diagnosticity, probability, impact, and information gain, Psychological Review 112 (2005) 979.
- Nelson (2008) J. D. Nelson, Towards a rational theory of human information acquisition, The probabilistic mind: Prospects for Bayesian cognitive science (2008) 143–163.
- Ellsberg (1961) D. Ellsberg, Risk, ambiguity, and the Savage axioms, The Quarterly Journal of Economics 75 (1961) 643–669.
- Vieider (2009) F. M. Vieider, The effect of accountability on loss aversion, Acta psychologica 132 (2009) 96–101.
- Villejoubert and Mandel (2002) G. Villejoubert, D. R. Mandel, The inverse fallacy: An account of deviations from bayes’s theorem and the additivity principle, Memory & Cognition 30 (2002) 171–178.
- Wille (1982) R. Wille, Restructuring lattice theory: An approach based on hierarchies of concepts, in: I. Rival (Ed.), Ordered Sets, Reidel, Dordrecht-Boston, 1982, pp. 445–470.
- Ganter and Wille (1999) B. Ganter, R. Wille, Formal Concept Analysis – Mathematical Foundations, Springer, 1999.
- Edgell et al. (2004) S. E. Edgell, J. Harbison, W. P. Neace, I. D. Nahinsky, A. S. Lajoie, What is learned from experience in a probabilistic environment?, Journal of Behavioral Decision Making 17 (2004) 213–229.
- Werner et al. (2018) C. Werner, A. M. Hanea, O. Morales-Nápoles, Eliciting multivariate uncertainty from experts: Considerations and approaches along the expert judgement process, in: Elicitation, Springer, 2018, pp. 171–210.
- Díaz et al. (2010) C. Díaz, C. Batanero, J. M. Contreras, Teaching independence and conditional probability, Boletín de Estadística e Investigación Operativa 26 (2010) 149–162.
- Simonson and Tversky (1992) I. Simonson, A. Tversky, Choice in context: Tradeoff contrast and extremeness aversion, Journal of Marketing Research 29 (1992) 281.
- Gamberger and Lavrač (2003) D. Gamberger, N. Lavrač, Active subgroup mining: A case study in coronary heart disease risk group detection, Artificial Intelligence in Medicine 28 (2003) 27–57.
- Kononenko (1993) I. Kononenko, Inductive and Bayesian learning in medical diagnosis, Applied Artificial Intelligence 7 (1993) 317–337.
- Rozin and Royzman (2001) P. Rozin, E. B. Royzman, Negativity bias, negativity dominance, and contagion, Personality and Social Psychology Review 5 (2001) 296–320.
- Pratto and John (2005) F. Pratto, O. P. John, Automatic vigilance: The attention-grabbing power of negative social information, Social Cognition: Key Readings 250 (2005).
- Skowronski and Carlston (1989) J. J. Skowronski, D. E. Carlston, Negativity and extremity biases in impression formation: A review of explanations, Psychological Bulletin 105 (1989) 131.
- Robinson-Riegler and Winton (1996) G. L. Robinson-Riegler, W. M. Winton, The role of conscious recollection in recognition of affective material: Evidence for positive-negative asymmetry, The Journal of General Psychology 123 (1996) 93–104.
- Ohira et al. (1998) H. Ohira, W. M. Winton, M. Oyama, Effects of stimulus valence on recognition memory and endogenous eyeblinks: Further evidence for positive-negative asymmetry, Personality and Social Psychology Bulletin 24 (1998) 986–993.
- Fiske (1980) S. T. Fiske, Attention and weight in person perception: The impact of negative and extreme behavior, Journal of Personality and Social Psychology 38 (1980) 889.
- Huber (2010) M. Huber, From mindless to mindful decision making: Reflecting on prescriptive processes, Ph.D. thesis, University of Colorado at Boulder, 2010.
- Tversky and Kahneman (1981) A. Tversky, D. Kahneman, The framing of decisions and the psychology of choice, Science 211 (1981) 453–458.
- Bond et al. (2007) S. D. Bond, K. A. Carlson, M. G. Meloy, J. E. Russo, R. J. Tanner, Information distortion in the evaluation of a single option, Organizational Behavior and Human Decision Processes 102 (2007) 240–254.
- Fitzsimons and Shiv (2001) G. J. Fitzsimons, B. Shiv, Nonconscious and contaminative effects of hypothetical questions on subsequent decision making, Journal of Consumer Research 28 (2001) 224–238.
- Gigerenzer and Hoffrage (1999) G. Gigerenzer, U. Hoffrage, Overcoming difficulties in Bayesian reasoning: A reply to Lewis and Keren (1999) and Mellers and McGraw (1999), Psychological Review (1999) 425–430.
- Mumma and Wilson (1995) G. H. Mumma, S. B. Wilson, Procedural debiasing of primacy/anchoring effects in clinical-like judgments, Journal of Clinical Psychology 51 (1995) 841–853.
- Liu et al. (1998) B. Liu, W. Hsu, Y. Ma, Integrating classification and association rule mining, in: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, KDD’98, AAAI Press, 1998, pp. 80–86.
- Fürnkranz (1999) J. Fürnkranz, Separate-and-conquer rule learning, Artificial Intelligence Review 13 (1999) 3–54.
- Webb (1994) G. I. Webb, Recent progress in learning decision lists by prepending inferred rules, in: Proceedings of the 2nd Singapore International Conference on Intelligent Systems, 1994, pp. B280–B285.
- Martire et al. (2013) K. A. Martire, R. I. Kemp, I. Watkins, M. A. Sayle, B. R. Newell, The expression and interpretation of uncertain forensic science evidence: verbal equivalence, evidence strength, and the weak evidence effect, Law and Human Behavior 37 (2013) 197.
- Fernbach et al. (2011) P. M. Fernbach, A. Darlow, S. A. Sloman, When good evidence goes bad: The weak evidence effect in judgment and decision-making, Cognition 119 (2011) 459–467.
- Willis (2010) S. Willis, Standards for the formulation of evaluative forensic science expert opinion association of forensic science providers, Science & Justice 50 (2010) 49.
- Strossa et al. (2005) P. Strossa, Z. Černỳ, J. Rauch, Reporting data mining results in a natural language, in: Foundations of Data Mining and Knowledge Discovery, Springer, 2005, pp. 347–361.
- Geier et al. (2006) A. B. Geier, P. Rozin, G. Doros, Unit bias a new heuristic that helps explain the effect of portion size on food intake, Psychological Science 17 (2006) 521–525.
- Gigerenzer (2001) G. Gigerenzer, Content-blind norms, no norms, or good norms? A reply to Vranas, Cognition 81 (2001) 93–103.
- Lakkaraju et al. (2016) H. Lakkaraju, S. H. Bach, J. Leskovec, Interpretable decision sets: A joint framework for description and prediction, in: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, ACM, New York, NY, USA, 2016, pp. 1675–1684.
- Kliegr et al. (2011) T. Kliegr, S. Vojíř, J. Rauch, Background knowledge and PMML: first considerations, in: Proceedings of the 2011 workshop on Predictive markup language modeling, PMML ’11, ACM, New York, NY, USA, 2011, pp. 54–62.
- Mitchell (1980) T. M. Mitchell, The need for biases in learning generalizations, Department of Computer Science, Laboratory for Computer Science Research, Rutgers Univ. New Jersey, 1980.
- Carlson et al. (2010) A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, T. M. Mitchell, Toward an architecture for never-ending language learning, in: Proceedings of the 24th AAAI Conference on Artificial Intelligence, volume 5, Atlanta, Atlanta, Georgia, 2010, p. 3.
- Boehm (1994) L. E. Boehm, The validity effect: A search for mediating variables, Personality and Social Psychology Bulletin 20 (1994) 285–293.
- Winman et al. (2014) A. Winman, P. Juslin, M. Lindskog, H. Nilsson, N. Kerimi, The role of ANS acuity and numeracy for the calibration and the coherence of subjective probability judgments, Frontiers in Psychology 5 (2014) 851.
- Stanovich et al. (2013) K. E. Stanovich, R. F. West, M. E. Toplak, Myside bias, rational thinking, and intelligence, Current Directions in Psychological Science 22 (2013) 259–264.
- Albarracín and Mitchell (2004) D. Albarracín, A. L. Mitchell, The role of defensive confidence in preference for proattitudinal information: How believing that one is strong can sometimes be a defensive weakness, Personality and Social Psychology Bulletin 30 (2004) 1565–1584.
- Evans et al. (2007) J. S. B. Evans, et al., Hypothetical thinking: Dual processes in reasoning and judgement, volume 3, Psychology Press, 2007.
- Jiang et al. (2014) Z.-q. Jiang, W.-h. Li, Y. Liu, Y.-j. Luo, P. Luu, D. M. Tucker, When affective word valence meets linguistic polarity: Behavioral and erp evidence, Journal of Neurolinguistics 28 (2014) 19–30.
- Deutsch et al. (2009) R. Deutsch, R. Kordts-Freudinger, B. Gawronski, F. Strack, Fast and fragile: A new look at the automaticity of negation processing, Experimental Psychology 56 (2009) 434.