The use of novel machine learning tools for artificial intelligence (AI) applied to clinical psychiatry data sets have consistently increased in recent years[Beam and Kohane2018], mostly due to the prevalence of algorithms that can ingest heterogeneous data sets and at the same time produce highly predictive models. Yet, while high predictability is indeed a desirable result, the healthcare community also requires that the abstractions generated by machine learning models are also interpretable, so experts can potentially incorporate new machine learning insights to currently classical tools, or even better, so experts can improve the performance of the abstraction by tuning the data-driven models. In this paper we take a practical approach towards solving this problem, both developing a new algorithm capable of mining association rules from wide categorical datasets, and applying our mining method towards building predictable and interpretable models to build transdiagnostic screening tools for psychiatric disorders.
The artificial intelligence research community has been urged to develop interpretable machine learning methods, which can provide accessible and explicit explanations. A report by the AI Now Institute at New York University remarks as the top recommendation in their 2017 report that core government agencies, including those responsible for healthcare, “should no longer use black box AI and algorithmic systems” [Campolo et al.2017]. The Explainable Artificial Intelligence (XAI) program at DARPA has as one of its goals to “enable human users to understand, appropriately trust, and effectively manage the emerging generation of artificially intelligent partners” [Gunning2017]
. But popular machine learning methods such as artificial neural networks[Hopfield1988, LeCun, Bengio, and Hinton2015] and ensemble models [Dietterich2000] are known for their elusive readout. For example, while artificial neural network applications exist for tumor detection in CT scans [Anthimopoulos et al.2016], it is virtually impossible for a person to understand the rational behind such a mathematical abstraction.
Interpretability is often loosely defined as not only understanding what a model emitted, but also why it did [Gilpin et al.2018]
. In this context, straightforward linguistic explanations are frequently considered as the most interpretable when compared to examining coefficients of a linear model or evaluating the importance of perceptrons in artificial neural networks[Morcos et al.2018]
. Recent efforts towards interpretable machine learning models in healthcare can be found in the literature, such as the development of a boosting method to create decision trees as the combination of single decision nodes[Valdes et al.2016]. Bayesian Rule List (BRL) [Rudin, Letham, and Madigan2013, Letham et al.2013, Letham et al.2015]
mixes the interpretability of sequenced logical rules for categorical datasets, together with the inference power of Bayesian statistics. Compared to decision trees, BRL rule lists take the form of a hierarchical series ofif-then-else statements where model emissions are correspond to the successful association to a given rule. BRL results in models that are inspired, and therefore similar, to standard human-built decision-making algorithms.
While BRL is by itself an interesting model to try on clinical psychiatry datasets, it relies on the existence of a set of rules from which the actual rule lists are built, similar to the approach taken by other associative classification methods [Liu, Hsu, and Ma1998, Yin and Han2003, Li, Han, and Pei2001]. Frequent pattern mining has been a standard tool to build such initial set of rules, with methods like Apriori [Agrawal and Srikant1994] and FP-Growth [Han, Pei, and Yin2000] being commonly used to extract rules from categorical datasets. However, frequent pattern mining methods do not scale well for wide datasets, i.e., datasets where the total number of categorical features is much larger than the number of samples, commonly denoted as . Most clinical healthcare datasets are wide and thus require new mining methods to enable the use of BRL in this research area.
In this paper we propose a new rule mining technique that is not based on the frequency in which certain categories simultaneously appear. Instead, we use Multiple Correspondence Analysis (MCA) [Greenacre1984, Greenacre and Blasius2006], a particular application of correspondence analysis to categorical datasets, to establish a similarity score between different associative rules. We show that our new MCA-miner method is significantly faster than commonly used frequent pattern mining methods, and that it scales well to wide datasets. Moreover, we show that MCA-miner performs equally well than other miners when used together with BRL. Finally, we use MCA-miner and BRL to analyze a transdiagnostic dataset for psychiatric disorders, building both interpretable and accurate predictors to support clinician screening tasks.
The remainder of this paper is organized as follows. In Sec. 2
we layout the problem of constructing a rule based classifier and establish the notation we use throughout the paper. We illustrate the algorithmic structure of MCA-miner in Sec.3, and discuss the mathematical property of pruning rules based on coordinates scoring. We compare the performance of our new method against standard benchmark datasets in Sec. 4, and we study new screening data-driven models for psychiatric disorders in Sec. 5.
2 Problem Description and Definitions
We begin by introducing relevant notation and definitions used throughout this paper. An attribute, denoted , is a categorical property of each data sample, which can take a discrete and finite number of values, denoted . A literal is a Boolean statement checking if an attribute takes a given value, e.g., given an attribute with categorical values we can define the following literals: is , and is . Given a collection of attributes , a data sample is a list of categorical values, one per attribute. A rule, denoted , is a collection of literals, with length , which is used to produce Boolean evaluations of data samples as follows: a rule evaluates to True whenever all the literals are also True, and evaluates to False otherwise.
In this paper we consider the problem of efficiently building rule lists for data sets with a large total number of categories among all attributes (i.e., ), a common situation among data sets related to health care or pharmacology. Given data samples, we represent a data set as a matrix with dimensions , where is the category assigned to the -th sample for the
-th attribute. We also consider a categorical label for each data sample, collectively represented as a vectorwith length . We denote the number of label categories by , where . If then we are in presence of a standard binary classification problem. If, instead, then we solve a multi-class classification problem, as shown in Sec. 5.
2.1 Bayesian Rule Lists
Bayesian Rule Lists (BRL) is a framework proposed by Rudin et al. [Rudin, Letham, and Madigan2013, Letham et al.2013, Letham et al.2015] to build lists of rules for data sample classification. An example of a BRL output trained on the commonly used Titanic survival data set [Hendricks2015], as shown in [Letham et al.2015], is included in Fig. 1.
Although BRL is a significant step forward in the development of explainable AI methods, searching over the configuration space of all possible rules containing all possible combinations of literals obtained from a given data set is simply infeasible. Letham et al. [Letham et al.2015] offer a good compromise solution to this problem, where first a set of rules is mined from a data set, and then BRL searches over the configuration space of combinations of the prescribed set of rules using a custom-built MCMC algorithm.
While efficient rule mining methods are available in the literature such as Apriori and FP-Growth, as explained in Sec. 1, we have found that such methods fail to execute on data sets with a large total number of categories, due to either unacceptably long computation time or prohibitively high memory usage. We show an example of this situation in Sec. 5.
In this paper we build upon the method in [Letham et al.2015] developing two improvements. First, we propose a novel rule mining algorithm based on Multiple Correspondence Analysis that is both computational and memory efficient, enabling us to apply BRL on datasets with a large total number of categories. Our MCA-based rule mining algorithm is explained in detail in Sec. 3.
Second, we parallelized the MCMC search method in BRL by executing individual Markov chains in separate CPU cores of a computer. Moreover, we periodically check the convergence of the multiple chains using the generalized Gelman & Rubin convergence criteria[Brooks and Gelman1998, Gelman and Rubin1992], thus stopping the execution once the convergence criteria is met. As shown in Fig. 4, our implementation is significantly faster than the original single-core version, enabling the study of more data sets with longer rules or a large number of features.
3 MCA-based Rule Mining
is a method that applies the power of Correspondence Analysis (CA) to categorical data sets. For the purpose of this paper it is important to note that MCA is the application of CA to the indicator matrix of all categories in the set of attributes, thus generating principal vectors projecting each of those categories into a euclidean space. We use these principal vectors to build a heuristic merit function over the set of all available rules given the categories in a data set. Moreover, the structure of our merit function enables us to efficiently mine the best rules, as detailed below.
3.1 Rule Score Calculation
First, we define the extended data matrix concatenating and , denoted with dimensions . We then compute the MCA principal vectors for each category present of . Let us call the MCA principal vectors associated each categorical value as categorical vectors, denoted by , where is the set of attributes in the data set . Also, let us call the MCA principal vectors associated to label categories as label vectors, denoted by .
Since each category can be mapped to a literal statement, as explained in Sec. 2, these principal vectors serve as a heuristic to evaluate the quality of a given literal to predict a label, as suggested in [Zhu et al.2010]. Therefore, we compute the a score between each categorical vector and each label vector as the cosine of their angle:
Note that in the context of random variables,is equivalent to the correlation between the two principal vectors [Loève1977].
We compute the score between a rule and label category , denoted , as the average among the scores between the literals in and the same label category, i.e.:
Finally, we search the configuration space of rules built using the combinations of all available literals in a data set such that , and identify those with highest scores for each label category. These top rules are the output of our miner, and are passed over to the BRL method as the set of rules from which rule lists will be built.
3.2 Rule Prunning
Since the number of rules generated by all combinations of all available literals up to length is excessively large even for modest values of , our miner includes two conditions under which we efficiently eliminate rules from consideration.
First, similar to the approach in FP-Growth [Han, Pei, and Yin2000] and other popular miners, we eliminate rules whose support over each label category is smaller than a user-defined threshold . Recall that the support of a rule for label category , denoted , is the fraction of data samples that the rule evaluates to True among the total number of data samples associated to a given label. Given a rule , note that the support of every other rule containing the collection of literals in satisfies . Hence, once a rule fails to pass our minimum support test, we stop considering all rules longer than that also contain the all the literals in .
Second, we eliminate rules whose score is smaller than a user-defined threshold . Now, suppose that we want to build a new rule by taking a rule and adding a literal . In that case, given a category the score of this rule must satisfy:
Let be the largest score for label category among all available literals, then we can predict that an extension of will have a score greater than if:
Finally, given the maximum number of rules to be mined per label , we recompute as we iterate combining literals to build new rules. Indeed, we periodically sort the scores for our temporary list of candidate rules and set equal to the score of the -th rule in the sorted list. As increases due to better candidate rules becoming available, the condition in eq. (4) becomes more restrictive, resulting in less rules being considered and therefore in a faster overall mining.
4 Benchmark Experiments
|Dataset||FP-Growth + BRL||MCA-miner + BRL|
Our MCA-miner method in Fig. 2, when used together with BRL, offers the power of rule list interpretability while maintaining the predictive capabilities of already established machine learning methods.
We benchmark the performance and computational efficiency of our MCA-miner against the “Titanic” dataset [Hendricks2015], as well as the following 5 datasets available in the UCI Machine Learning Repository [Dheeru and Karra Taniskidou2017]: “Adult,” “Autism Screening Adult,” “Breast Cancer Wisconsin (Diagnostic),” “Heart Disease,” and “HIV-1 protease cleavage,” which due to space constraints we designate as Adult, ASD, Cancer, Heart, and HIV, respectively. These datasets represent a wide variety of real-world experiments and observations, thus enabling us to fairly compare our improvements against the original BRL implementation using the FP-Growth miner.
All 6 benchmark datasets correspond to binary classification tasks. We conduct the experiments using the same set up in each of the benchmarks. First, we transform the dataset into a format that is compatible with our BRL implementation. Second, we quantize all continuous attributes into either 2 or 3 categories, while keeping the original categories of all other variables. It is worth noting that depending on the dataset and how its data was originally collected, we prioritize the existing taxonomy and expert domain knowledge to generate the continuous variable quantization. We simply generate a balanced quantization when no other information was available. Third, we train and test a model using 5-fold cross-validations, reporting the average accuracy and Area Under the ROC Curve (AUC) as model performance measurements.
Table 1 presents the empirical result of comparing both implementations.
The notation in the table follows the definitions in Sec. 2.
To guarantee a fair comparison between both implementations we fixed the parameters and for both methods, and in particular for MCA-miner we also set , and .
Our multi-core implementations for both MCA-miner and BRL were executed on 6 parallel processes, and only stopped when the Gelman & Rubin parameter [Brooks and Gelman1998] satisfied .
We ran all the experiments using a single AWS EC2
c5.18xlarge instance with 72 cores.
It is clear from our experiments in Table 1 that our MCA-miner matches the performance of FP-Growth in each case, while significantly reducing the computation time required to mine rules and train a BRL model.
5 Transdiagnostic Screen for Mental Health
The Consortium for Neuropsychiatric Phenomics (CNP) [Poldrack et al.2016] is a research project aimed at understanding shared and distinct neurobiological characteristics among multiple diagnostically distinct patient populations. Four groups of subjects are included in the study: healthy controls (HC, ), Schizophrenia patients (SCHZ, ), Bipolar Disorder patients (BD, ), and Attention Deficit and Hyperactivity Disorder patients (ADHD, ). The total number of subjects in the dataset is . Our goal analyzing the CNP dataset is to develop interpretable and effective screening tools to identify the diagnosis of these three psychiatric disorders in patients.
5.1 CNP Self-Reported Instruments Dataset
Among other data modalities, the CNP study includes responses to individual questions, belonging to 13 self-report clinical questionnaires, per subject [Poldrack et al.2016]. The total number of categories generated by the 578 questions is . The 13 questionnaires are the following (in alphabetical order):
Adult ADHD Self-Report Screener (ASRS),
Barratt Impulsiveness Scale (Barratt),
Chapman Perceptual Aberration Scale (ChapPer),
Chapman Physical Anhedonia Scale (ChapPhy),
Chapman Social Anhedonia Scale (ChapSoc),
Dickman Function and Dysfunctional Impulsivity Inventory (Dickman),
Eysenck’s Impusivity Inventory (Eysenck),
Golden & Meehl’s 7 MMPI Items Selected by Taxonomic Method (Golden),
Hopkins Symptom Check List (Hopkins),
Hypomanic Personality Scale (Hypomanic),
Multidimensional Personality Questionnaire – Control Subscale (MPQ),
Temperament and Character Inventory (TCI), and
Scale for Traits that Increase Risk for Bipolar II Disorder (BipolarII).
The details about these questionnaires are beyond the scope of this paper, and due to space constraints we abbreviate the individual questions using the name in parenthesis in the list above together with the question number. For example, Hopkins#57 denotes the 57-th question in the “Hopkins Symptom Check List” questionnaire.
Depending on the particular clinical questionnaire, each question has results in a binary answer (i.e., True or False) or a rating integer (e.g., from 1 to 5). We used each question as a literal attribute, resulting in a range from 2 to 5 categories per attribute.
5.2 Performance Benchmark
Rather than prune the number of attributes a priori to reduce the search space for both the rule miner and BRL, we applied our novel MCA-miner to identify the best rules over complete search space of literal combinations. Note that this results in a challenging problem for most machine learning algorithms since this is a wide dataset with more features than samples, i.e., . Indeed, just generating all rules with 3 literals from this dataset results in approximately 23 million rules. Fig. 3 compares the wall execution time of our MCA-miner against three popular associative mining methods: FP-Growth, Apriori, and Carpenter, all using the implementation in the PyFIM package [Borgelt2012]. As shown in Fig. 3, while the associative mining methods are reasonably efficient on datasets with few features, they are incapable of handling from than roughly 70 features from the CNP dataset, resulting in out-of-memory errors or impractically long executions even for large-scale compute-optimized AWS EC2 instances. In comparison, MCA-miner empirically exhibits a grow rate compatible with datasets much larger than CNP, as it runs many orders of magnitude faster than associative mining methods. It is worth noting that while FP-Growth is shown as the fastest associative mining method in [Borgelt2012], its scaling behavior vs. the number of attributes is practically the same as Apriori in our experiments.
In addition to the increased performance due to MCA-miner, we also improved the implementation of the BRL training MCMC algorithm by running parallel Markov chains simultaneously in different CPU cores, as explained in Sec. 2.1. Fig. 4 shows the BRL training time comparison, given the same rule set and both using 6 chains, between our multi-core implementation against the original single-core implementation reported in [Letham et al.2015]. Also, Fig. 5 shows that the multi-core implementation convergence wall time scales linearly with the number of Markov chains, with . While both implementations display a similar grow rate as the rule set size increases, our multi-core implementation is roughly 3 times faster in this experiment.
5.3 Interpretable Transdiagnostic Classifiers
In the interest of building the best possible transdiagnostic screening tool for the three types of psychiatric patients present in the CNP dataset, we build three different classifiers. First, we build a binary classifier to separate HC from the set of Patients, defined as the union of SCHZ, BD, and ADHD subjects. Second, we build a multi-class classifier to directly separate all four original categorical labels available in the dataset. Finally, we evaluate the performance of the multi-class classifier by repeating the binary classification task and comparing the results. In addition to using Accuracy and AUC as performance metrics as in Sec. 4, we also report Cohen’s coefficient [Cohen1960] as another indication for the effect size of our classifier. Cohen’s is compatible with both binary and multi-class classifiers. It ranges between -1 (complete misclassification) to 1 (perfect classification), with 0 corresponding to a chance classifier. To avoid a biased precision calculation, we sub-sample the dataset to balance out each label, resulting in subjects for each of the four classes, with a total of samples. Finally, we use 5-fold cross-validation to ensure the robustness of our training and testing methodology.
|MCA-miner + BRL||0.79||0.82||0.58|
Besides using our proposed MCA-miner together with BRL to build an interpretable rule list, we also benchmark its performance against other commonly used machine learning algorithms compatible with categorical data, which we applied using the Scikit-learn [Pedregosa et al.2011] implementations and default parameters. As shown in Table 2, our method is statistically as good, if not better, than the other methods we compared against.
The rule list generated using MCA-miner and BRL is shown in Fig. 7. Also, a breakdown analysis of the number of subjects being classified per rule in the list is shown in Fig. 7. The detailed description of the questions in Fig. 7 is shown in Table 3. Note that most of the subjects are classified with a high probability in the top two rules, which is a very useful feature in situations where fast clinical screening is required.
Fig. 9 shows the output rule list after training a BRL model using the all 4 labels in the CNP dataset, as explained above. Note that the rules in Fig. 9 emit the maximum likelihood estimate corresponding to the multinomial distribution generated by the same rule in the BRL model, since this is the most useful output for practical clinical use. After 5-fold cross-validation our MCA-miner with BRL classifier has an accuracy of and Cohen’s of . Fig. 10
shows the average confusion matrix for the multi-class classifier using all 5 cross-validation testing cohorts. The actual questions referenced in the rule list in Fig.9 are shown in detail in Table 3.
|Barratt#12||I am a careful thinker||1 (rarely) to 4 (almost always)|
|BipolarII#1||My mood often changes, from happiness to sadness, withour my knowing why||Boolean|
|BipolarII#2||I have frequent ups and downs in mood, with and without apparent cause||Boolean|
|ChapSoc#9||I sometimes become deeply attached to people I spend a lot of time with||Boolean|
|ChapSoc#13||My emotional responses seem very different from those of other people||Boolean|
|Dickman#22||I don’t like to do things quickly, even when I am doing something that is not very difficult||Boolean|
|Dickman#28||I often get into trouble because I don’t think before I act||Boolean|
|Dickman#29||I have more curiosity than most people||Boolean|
|Eyenseck#1||Weakness in parts of your body||Boolean|
|Golden#1||I have not lived the right kind of life||Boolean|
|Hopkins#39||Heart pounding or racing||0 (not at all) to 3 (extremely)|
|Hopkins#56||Weakness in parts of your body||0 (not at all) to 3 (extremely)|
|Hypomanic#1||I consider myself to be an average kind of person||Boolean|
|Hypomanic#8||There are often times when I am so restless that it is impossible for me to sit still||Boolean|
|TCI#231||I usually stay away from social situations where I would have to meet strangers, even if I am assured that they will be friendly||Boolean|
The interpretability and transparency of the rule list in Fig. 9 enables us to obtain further insights regarding the population in the CNP dataset. Indeed, similar to the binary classifier, Fig. 9 shows the mapping of all CNP subjects using the 4-class rule list. While the accuracy of the rule list as a multi-class classifier is not perfect, it is worth noting how just 7 questions out of a total of 578 are enough to produce a relatively balanced output among the rules, while significantly separating the label categories.
Also note that even though each of the 13 questionnaires in the dataset have been thoroughly tested in the literature as clinical instruments to detect and evaluate different traits and behaviors, the 7 questions picked by our rule list do not favor any of the questionnaires in particular. This is an indication that transdiagnostic classifiers are better obtained from different sources of data, and likely improve their performance as other modalities, such as mobile digital inputs, are included in the dataset.
Binary classification using multi-class rule list
We further evaluate the performance of the multi-class classifier in Fig. 9 by using it as binary classifier, i.e., we replace the ADHD, BD, and SCHZ labels with Patients. Using the same 5-fold cross-validated models obtained in the multi-class section above, we compute their performance as binary classifiers obtaining an accuracy of , AUC of , and Cohen’s of . These values are on par with those in Table 2, showing that our method does not decrease performance by adding more categorical labels.
In the paper we propose a novel methodology to analyze categorical datasets with a large number of attributes, a property that is prevalent in the clinical psychiatry community. Our contributions consist of a novel MCA-based rule mining method with excellent scaling properties against the number of categorical attributes, and a new implementation of the BRL algorithm using multi-core parallel execution. Then, we study the CNP dataset for psychiatric disorders using our new methodology, resulting in rule-based interpretable classifiers capable of screening patients from self-reported questionnaire data. Our results not only show the viability of building interpretable models for state-of-the-art clinical psychiatry datasets, but also that these models can be scaled to larger datasets to understand the interactions and differences between these disorders.
- [Agrawal and Srikant1994] Agrawal, R., and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, 487–499.
[Anthimopoulos et al.2016]
Anthimopoulos, M.; Christodoulidis, S.; Ebner, L.; Christe, A.; and
Lung pattern classification for interstitial lung diseases using a deep convolutional neural network.IEEE Transactions on Medical Imaging 35(5):1207–1216.
- [Beam and Kohane2018] Beam, A. L., and Kohane, I. S. 2018. Big data and machine learning in health care. JAMA 319(13):1317–1318.
- [Borgelt2012] Borgelt, C. 2012. Frequent item set mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(6):437–456.
- [Brooks and Gelman1998] Brooks, S. P., and Gelman, A. 1998. General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics 7(4):434–455.
- [Campolo et al.2017] Campolo, A.; Sanfilippo, M.; Whittaker, M.; and Crawford, K. 2017. AI Now 2017 report. AI Now Institute at New York University.
- [Cohen1960] Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20(1):37–46.
- [Dheeru and Karra Taniskidou2017] Dheeru, D., and Karra Taniskidou, E. 2017. UCI machine learning repository.
- [Dietterich2000] Dietterich, T. G. 2000. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40(2):139–157.
- [Gelman and Rubin1992] Gelman, A., and Rubin, D. B. 1992. Inference from iterative simulation using multiple sequences. Statistical Science 7(4):457–472.
- [Gilpin et al.2018] Gilpin, L. H.; Bau, D.; Yuan, B. Z.; Bajwa, A.; Specter, M.; and Kagal, L. 2018. Explaining explanations: An approach to evaluating interpretability of machine learning. ArXiv Preprints.
- [Greenacre and Blasius2006] Greenacre, M. J., and Blasius, J. 2006. Multiple correspondence analysis and related methods. Chapman & Hall/CRC.
- [Greenacre1984] Greenacre, M. J. 1984. Theory and Applications of Correspondence Analysis. Academic Press.
- [Gunning2017] Gunning, D. 2017. Explainable artificial intelligence (XAI). Defense Advanced Research Projects Agency (DARPA).
- [Han, Pei, and Yin2000] Han, J.; Pei, J.; and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 1–12.
- [Hendricks2015] Hendricks, P. 2015. titanic: Titanic Passenger Survival Data Set. R package version 0.1.0.
- [Hopfield1988] Hopfield, J. J. 1988. Artificial neural networks. IEEE Circuits and Devices Magazine 4(5):3–10.
- [LeCun, Bengio, and Hinton2015] LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436.
- [Letham et al.2013] Letham, B.; Rudin, C.; McCormick, T. H.; and Madigan, D. 2013. An interpretable stroke prediction model using rules and bayesian analysis. In Proceedings of the 17th AAAI Conference on Late-Breaking Developments in the Field of Artificial Intelligence, 65–67.
- [Letham et al.2015] Letham, B.; Rudin, C.; McCormick, T. H.; and Madigan, D. 2015. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics 9(3):1350–1371.
- [Li, Han, and Pei2001] Li, W.; Han, J.; and Pei, J. 2001. CMAR: Accurate and efficient classification based on multiple class-association rules. In Proceedings of the 2001 IEEE International Conference on Data Mining, 369–376.
- [Liu, Hsu, and Ma1998] Liu, B.; Hsu, W.; and Ma, Y. 1998. Integrating classification and association rule mining. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 80–86.
- [Loève1977] Loève, M. 1977. Probability Theory I. Number 45 in Graduate Texts in Mathematics. Springer.
- [Morcos et al.2018] Morcos, A. S.; Barrett, D. G.; Rabinowitz, N. C.; and Botvinick, M. 2018. On the importance of single directions for generalization. ArXiv Preprints.
- [Pedregosa et al.2011] Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830.
- [Poldrack et al.2016] Poldrack, R. A.; Congdon, E.; Triplett, W.; Gorgolewski, K. J.; Karlsgodt, K. H.; Mumford, J. A.; Sabb, F. W.; Freimer, N. B.; London, E. D.; Cannon, T. D.; and Bilder, R. M. 2016. A phenome-wide examination of neural and cognitive function. Scientific Data 3:160110.
- [Rudin, Letham, and Madigan2013] Rudin, C.; Letham, B.; and Madigan, D. 2013. Learning theory analysis for association rules and sequential event prediction. Journal of Machine Learning Research 14:3441–3492.
- [Valdes et al.2016] Valdes, G.; Luna, J. M.; Eaton, E.; II, C.; Ungar, L. H.; and Solberg, T. D. 2016. MediBoost: a patient stratification tool for interpretable decision making in the era of precision medicine. Scientific Reports 6:37854.
- [Yin and Han2003] Yin, X., and Han, J. 2003. CPAR: Classification based on predictive association rules. In Proceedings of the 2003 SIAM International Conference on Data Mining, 331–335.
- [Zhu et al.2010] Zhu, Q.; Lin, L.; Shyu, M.-L.; and Chen, S.-C. 2010. Feature selection using correlation and reliability based scoring metric for video semantic detection. In Proceedings of the IEEE Fourth International Conference on Semantic Computing, 462–469.