Automated Supervised Feature Selection for Differentiated Patterns of Care

by   Catherine Wanjiru, et al.
Carnegie Mellon University

An automated feature selection pipeline was developed using several state-of-the-art feature selection techniques to select optimal features for Differentiating Patterns of Care (DPOC). The pipeline included three types of feature selection techniques; Filters, Wrappers and Embedded methods to select the top K features. Five different datasets with binary dependent variables were used and their different top K optimal features selected. The selected features were tested in the existing multi-dimensional subset scanning (MDSS) where the most anomalous subpopulations, most anomalous subsets, propensity scores, and effect of measures were recorded to test their performance. This performance was compared with four similar metrics gained after using all covariates in the dataset in the MDSS pipeline. We found out that despite the different feature selection techniques used, the data distribution is key to note when determining the technique to use.



page 1

page 2

page 3

page 4


Sparsity-based Feature Selection for Anomalous Subgroup Discovery

Anomalous pattern detection aims to identify instances where deviation f...

Relevant based structure learning for feature selection

Feature selection is an important task in many problems occurring in pat...

Model-free feature selection to facilitate automatic discovery of divergent subgroups in tabular data

Data-centric AI encourages the need of cleaning and understanding of dat...

On the Consistency of Optimal Bayesian Feature Selection in the Presence of Correlations

Optimal Bayesian feature selection (OBFS) is a multivariate supervised s...

AutoSpearman: Automatically Mitigating Correlated Metrics for Interpreting Defect Models

The interpretation of defect models heavily relies on software metrics t...

Efficient Wrapper Feature Selection using Autoencoder and Model Based Elimination

We propose a computationally efficient wrapper feature selection method ...

Static and Dynamic Feature Selection in Morphosyntactic Analyzers

We study the use of greedy feature selection methods for morphosyntactic...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Healthcare is characterized by large differences in disease patterns, patient response to interventions, and cost of care across patient subpopulations Appleby et al. (2011); Krumholz (2013); Senn (2016)

. Some of the key challenges to understanding such non-random variations of care delivery and costs are complicated by the lack of appropriate approaches for analyzing complex interactions of factors and interventions captured in large-scale real-world evidence data. Conventional stratification and subgroup analyses approaches are highly manual, require domain expertise, and, more importantly, limits stakeholders to analyzing only a few variables beyond which it becomes computationally infeasible. Furthermore, these approaches lack a ‘data-driven knowledge discovery’ aspect as investigators must suggest beforehand which variables they would like to use in their analyses. Additionally, even though supervised machine learning approaches can be used to investigate variability in care, these approaches are either subject to several modeling assumptions and limitations or lack adequate interpretability

McFowland III et al. (2018).

Fortunately, state-of-the-art subset scanning techniques from the anomalous pattern detection literature McFowland III et al. (2018); Neill (2012); Zhang and Neill (2016); Somanchi and Neill (2017) can be leveraged to enable principled, scalable, and unsupervised discovery of specific segments of a patient subpopulation that are anomalous. For example, our research team has developed a suite of functionalities for discovering and analyzing variations of care through automatic stratification and subgroup analysis across any arbitrary combination of features and interventions captured in observational healthcare data Ogallo et al. (2021). These subset scanning methods focus on identifying anomalous subsets of records in a multidimensional array that differ from expected behavior. The functionalities can be used to discover and analyze non-obvious segments of patient populations most significantly impacted by disease outcomes, costs of care, and related interventions across different application areas. They address the limitations of the state-of-the-art stratification and subgroup analysis approaches to enable fast and efficient search of the most anomalous subsets across exponentially many possible subsets of records in a dataset. Anomalousness is quantified using a scoring function, such as a log-likelihood ratio statistic, that is maximized over all subsets in a dataset to identify the subset with the highest score Neill (2012). The scoring function exploits a mathematical property called Linear-Time Subset Scanning (LTSS) that allows the search to be conducted in linear time rather than exponential time Neill (2012)

. Furthermore, to adjust for multiple hypothesis testing and estimate the statistical significance of identified anomalous subsets, randomization testing where parametric bootstrapping is used to estimate the empirical p-value of the subset

McFowland et al. (2013).

Currently, however, subset scanning techniques primarily rely on manual domain-expert-driven feature selection before anomalous pattern detection. This limits the utility of the techniques in large-scale datasets. Fortunately, several automated feature selection methods Miao and Niu (2016) already exist in literature and can be leveraged to potentially improve downstream efficiency and interpretability of anomalous pattern detection tasks. However, there is a dearth of literature on the application of these feature selection techniques in anomalous pattern detection and the potential biases they introduced remain unknown.

The overarching goal of this study was to evaluate scalable supervised techniques for the automated selection of features for subsequent use in anomalous pattern detection. The specific objectives were twofold. First, we aimed to automatically identify the top-K candidate features to use for automated stratification and subgroup analysis given observational health data with an arbitrary number of features. Second, we aimed to compare the characteristics of the most anomalous subsets identified by subset scanning after applying different automated feature selection approaches. To this end, we developed a pipeline combining state-of-the-art supervised feature selection techniques with anomalous pattern detection and applied it to exemplar datasets. This paper reports our findings.

2 Related Work

This study combines feature selection with subset scanning techniques from the anomalous pattern detection literature. Feature Selection is a dimensionality reduction technique that can be used to removing irrelevant, duplicate, or label leaking features from a large subset of features Kumar (2015). Feature selection can either be supervised or unsupervised. Supervised feature selection techniques include Filters, Wrappers, and Embedded techniques Miao and Niu (2016)

, while unsupervised techniques include techniques such as Principal Component Analysis

Kumar (2015). In this study, Filters, Wrappers, and Embedded techniques are applied. Filters are used to remove non-useful features before modeling. This can be viewed as a preprocessing step where the relationships among features, as well as between features and the outcome variable are evaluated to pre-select features independent of any modeling algorithms Molina et al. (2002). In this study, we use measures such as the the Pearson correlation coefficientBenesty et al. (2009)

, the Variance Inflation Factor

O’brien (2007), the Chi-Square test McHugh (2013), and Normalized Mutual Information Estévez et al. (2009) for pre-selecting features.

Wrapper methods use specific machine learning models to evaluate and select features Miao and Niu (2016)

. They measure the usefulness of features by learning a stepwise linear classifier or regressor using a specific variable selection method such as forward selection or backward elimination, and proceeding recursively until a stopping rule is reached. In this study, we use ordinary least squares regression with backward elimination to select the optimal

features. Finally, the Embedded methods are feature selection techniques where the by-product of the model fitting is a feature importance ranking Molina et al. (2002); Miao and Niu (2016). In embedded techniques, top K features can be selected from the ranking list returned by the model.

There are several subset scanning techniques in the anomalous pattern detection literature including Bias-ScanZhang and Neill (2016), treatment effect subset scanning (TESS)McFowland III et al. (2018), and anomalous patterns of care (APC) ScanSomanchi and Neill (2017). Our study specifically builds upon Bias-ScanZhang and Neill (2016)

, an algorithm that was designed to on the discovery of the subpopulation with the most divergence between the true outcomes and the predicted probabilities of a binary classifier. Bias Scan analyzes tabular data with discrete/discretized covariates, a binary outcome, and predictions generated by a binary classifier. It maximizes a Bernoulli likelihood ratio scoring statistic over the exponentially many subsets of records in the data to identify the subset of records with the strongest evidence of the expected odds of the outcome being different from the predicted odds. To do so in linear time, Bias Scan exploits a priority function

McFowland III et al. (2018) that ranks the values of a given feature and then select the highest-scoring subset as the subset consisting of the “top-k” priority values for some . In this study, we use a simplified version of Bias Scan that replaces the predictions of a classifier with a simple mean of the outcome across all records in the data.

3 Methods

Fig. 1 illustrates our proposed pipeline for combining feature selection and differentiated pattern detection. The feature selection component takes as input processed data and user-specified stopping criteria (e.g. number of desired top K features) and uses different approaches to automatically select features from the supplied data. The differentiated pattern detection component scans over features selected by any of the different feature selection approaches to identify the highest-scoring (most anomalous) subsets. Next, we formulate the problem mathematically and provide detailed descriptions of each component of the pipeline.

Figure 1: Pipeline combining feature selection and differentiated pattern detection.

3.1 Problem Formulation

Let denote a dataset containing samples where each sample is characterised by a set of features , and represents the outcome label. The proposed automated feature selection process is defined as a function that takes and the required number of features as input and provides , where , i.e., and is represented by . Then differentiated pattern detection process is defined as function that takes as input the selected features and provides a subset of samples that are anomalous, . Here, , is represented by the identified anomalous feature values , where and represents the value(s) of the feature in that makes anomalous, e.g., and . Note that . The divergence of the identified subgroups could be evaluated based on the anomalous score () and odds ratio, .

In this work, varieties of are employed to select features that can be fed into . These include Filter and Wrapper () and Embedded () techniques as described below.

3.2 Filters and Wrapper Techniques

The first approach in our feature selection pipeline involves using filters to pre-select features and and then applying stepwise feature elimination via ordinary least squares regression as illustrated in Fig. 1. A filter-based feature selection technique () takes the set of input features and provides the required number of features by first applying statistical methods to encode pairwise feature relations using a function , and then quantifying feature-outcome relations using a function . Varieties of are applied based on the type of features. To this end, features in are categorised as either continuous ( ) or categorical (). For , we use the Pearson correlation coefficientBenesty et al. (2009), , to measure the strength of a linear association between a pair of continuous features, i.e., . If is above a given threshold, we compute the correlation between each feature in the pair and the outcome and select the one with the higher correlation. We also use the Variance Inflation Factor O’brien (2007) , to measure the degree of multi-collinearity between a continuous feature and other continuous features, i.e., , . For , we use the Chi-Square test McHugh (2013) to measure the significance of the association between two features, i.e., . Subsequently, we use the Cramer’s V Akoglu (2018) to evaluate the strength of association between the two variables, i.e, , and if above a given threshold, we use Mutual Information, , to select the feature that has more mutual dependence with the target variable. Consequently, we obtain and , the sets of continuous and categorical features that satisfy the thresholds related to the statistical significance and strengths of the relationships between features.

Finally, we apply a wrapper method, that uses backward elimination feature selection to obtain the required features from and . Specifically, the wrapper technique involves fitting a linear model (e.g., Ordinary Least Squares) and drops less significant features recursively until a stopping criterion (until features remaining from , i.e., ) is met.

3.3 Embedded Techniques

The second approach of our feature selection pipeline (see Fig. 1) is embedded based technique (). This approach employs varieties of tree-based classifiers, i.e., Catboost Dorogush et al. (2018) (

) and XGBoost 

Chen and Guestrin (2016) (), which have the ability to identify duplicate features in the set and drop them without the need of using filters. A user might pre-define to use one of these classifiers or a combination of both (). After training of these classifiers using , importance of features are extracted per each classifier, resulting and from and , respectively. Accordingly, the top features could be extracted by ranking and in descending order. In cases where a combination of the two classifiers () is employed, a committee vote is applied to merge and into unified ranking. To this end, MinMax  normalization of model-specific ranking is done, resulting and , e.g., . After normalisation, is obtained by averaging and and top features () are extracted by ranking in descending order.

3.4 Multi-Dimensional Subset Scanning (MDSS)

We employ the Multi-Dimensional Subset Scanning(MDSS) Neill (2012); McFowland III et al. (2018) from the anomalous pattern detection literature in order to validate the above automated feature selection techniques to identify differentiated patterns of care. Subset scanning could be posed as a search problem over possible subsets in a multi-dimensional array, and it aims to identify differentiated (anomalous) subsets by searching across all possible combination of covariates obtained from the previous selection stages. Automatic stratification (AutoStrat) is a version of MDSS where the deviation between average outcomes in () and each sample () is evaluated by maximizing a Bernoulli likelihood ratio scoring statistic,

. The null hypothesis assumes that the likelihood of the outcome in each sample

similar to expected (), i.e., ; while the alternative hypothesis assumes a constant multiplicative increase in the outcome odds for some given subgroup, where . The scoring function, for a subset, , which contains samples, is computed as:

Consequently, subsets in which average of () is greater than will have higher scores. Subset selection is iterated until convergence to a local maximum is found, and the global maximum is subsequently optimized using multiple random restarts. Once the differentiated subset of samples is identified using MDSS, empirical p-value (via randomisation testing) is computed to evaluate the significance of the differentiation. Moreover, subset characterisation is conducted to provide interpretation of these anomalous features and subset .

4 Experiments and Results

4.1 Dataset Used, Cohort, Covariates, and Outcomes

The Dataset used in this project was the MIMIC-III (‘Medical Information Mart for Intensive Care’) datasetJohnson et al. (2016). MIMIC-III is a freely accessible critical care dataset recording vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data and other data collected from over 46,000 intensive Care Unit (ICU) patients. We selected a study cohort of adult patients (16 years or older) who were admitted to the ICU for the first time, where the length of stay was greater than 1 day, and with no hospital readmissions, no surgical cases, and having at least one chart events. The final cohort consisted of 18,761 patients. We constructed 42 covariates based on observations made on the first 24 hours of ICU admission. These included 15 numerical features, 22 binary features, and 5 nominal features. We defined the target outcome as a binary indicator variable such that for patients who who died within 28 days of the onset of their ICU admission, and otherwise.

4.2 Feature Selection Process

We used the 4 different techniques in our feature selection pipeline: a Filter+Wrapper method, an XGBoost-based Embedded method, a Catboost-based Embedded method, and committee vote approach that combined the rankings from the XGBoost and Catboost rankings. For each method, the our first goal was to identify the top features in our analytic dataset. Our second goal was to investigate the optimal K value for use in differentiated pattern detection. To do so, we used the methods to identify the top K features where K {5, 10, 15, 20, 25, 30}. Table 1 lists the top features selected by the different techniques in our pipeline. The F1 scores of the fitted XGBboost and Catboost models were 0.909 and 0.910 respectively. This indicates that both classifiers performed well. Seven features were similar from the Top K=10 selected features by XGBoost and Catboost separately.

Wrapper fluid electrolyte, liver disease, cardiac arrhythmias, coagulopathy, vent first hour, deficiency anemias, congestive heart failure, alcohol abuse, angus, hypertension
XGBoost SpO2 Mean, TempC Mean, DiasBP Mean, RespRate Mean, EndoTrachFlag, diabetes complicated, SysBP Mean, chronic pulmonary, day name icu intime, PLATELET first
Catboost day name icu intime, EndoTrachFlag, marital status, angus, ethnicity, SpO2 Mean, DiasBP Mean, liver disease, vent first hour, peripheral vascular disease
Committee Vote SpO2 Mean, EndoTrachFlag, DiasBP Mean, day name icu intime, TempC Mean, marital status, RespRate Mean, angus, ethnicity, diabetes complicated
Table 1: Top 10 feature selected by the feature selection pipeline

4.3 Differentiated Pattern Detection

We used the multidimensional subset scanning (MDSS) approach to identify the highest-scoring (most anomalous) subset when given the full set of 42 features in our analytic dataset and when given each of the top K {5, 10, 15, 20, 25, 30} features selected by the four feature selection techniques in our pipeline. A total of 25 scans were conducted and compared. Before each scan, all numerical features were discretized. For each scan, we defined the observed outcome as a binary indicator of mortality within 28 days of ICU admission, and the expected outcome as a simple mean of the observed outcome. Consequently, each scan maximized the MDSS scoring function over the feature values of the input features, to identify the highest-scoring subset with the strongest evidence of the likelihood of mortality within 28 days in the subset being higher than what is expected in the sampled population. After each scan, we estimated the statistical significance of the highest-scoring subset by computing its empirical p-value via parametric bootstrapping.

The results of each scan included the highest-scoring subset, the score of the highest-scoring subset, the empirical p-value of the score, the subpopulation of patients in the highest-scoring subset, and the ratio of the odds of the outcome in the anomalous subpopulation to the odds of the outcome in the complement subpopulation. We found that scanning over all the features in the analytic dataset generated the highest score and the and odds ratio. Figure 2 illustrates this by comparing the odds ratio statistics after subset scanning over all features compared to scanning over top features selected by different techniques.

Figure 2: Odds Ratio comparison among feature Selection techniques and whole MIMIC-III dataset

Of note is that the confidence interval of odds ratio when scanning over all features is higher and does not overlap with any of the confidence intervals of the odds ratios when scanning over the top features selected by the different techniques. Interestingly, the confidence intervals of the odds ratios of the different techniques have considerable degrees of overlap implying that they are not truly different from each other. Assuming that scanning over all features discovers a truly anomalous subpopulation, these findings suggest that the supervised feature selection processes used in this study can introduce biases in anomalous pattern detection that are characterized by different compositions of the anomalous subsets and lower measures of effect. Consequently, it is important to find the optimal value of K that minimizes this bias such that scores and measures of effect obtained when using a feature selection technique approach those obtained when scanning over all features.

After running the MDSS experiment to get the optimal value of K selected features, the results in figure 3

suggested that it was possible to attain the same scores as all the covariates in a dataset using less features with K = 25 saving on resources and time to run the MDSS scan. The feature selection technique is heuristic and is dependent on the dataset for the optimal value of K.

Figure 3: Optimal Value of K in the MIMIC-III dataset

5 Conclusions and Future Work

This work proposes a supervised automated feature selection pipeline to select optimal K features for the multi-dimensional subset scanning project. From this work, it is observed that finding an optimal value of K selected features for any dataset saves on resources rather than scanning through the whole dataset containing possible redundant covariates in the feature space.

The proposed supervised automated feature selection pipeline exhibits an optimal way for a user to select features by using one or more feature selection techniques. The various feature selection techniques can be considered optimal as the selected features overlap across the techniques.

Using the proposed supervised automated feature selection techniques succoured us achieve the main objective of using less K features in the Multi-Dimensional Subset Scanning pipeline and achieving equally better scores as using all N features of the dataset.

The limitation of the study is that the multi-dimensional subset scanning is an unsupervised machine learning technique and the feature selection pipeline is based on a supervised technique. Setting of several thresholds in the filtering technique also introduces a limitation in our study given that these thresholds have to be dependent on the datasets used.

We propose future work to explore how unsupervised feature selection techniques perform in relation to our supervised approach. Also, the use of thresholds in our work introduces an extra step to get the appropriate threshold for each dataset. This process could also be automated. We also intend to evaluate the runtime performances of the supervised feature selection techniques and compare these with unsupervised feature selection approaches.


  • [1] H. Akoglu (2018) User’s guide to correlation coefficients. Turkish Journal of Emergency Medicine 18 (3), pp. 91–93. Cited by: §3.2.
  • [2] J. Appleby, V. Raleigh, F. Frosini, G. Bevan, H. Gao, and T. Lyscom (2011) Variations in health care. The good, the bad and the inexplicable. London: The King’s Fund. Cited by: §1.
  • [3] J. Benesty, J. Chen, Y. Huang, and I. Cohen (2009) Noise reduction in speech processing. Vol. 2, Springer Science & Business Media. Cited by: §2, §3.2.
  • [4] T. Chen and C. Guestrin (2016) Xgboost: a scalable tree boosting system. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. Cited by: §3.3.
  • [5] A. V. Dorogush, V. Ershov, and A. Gulin (2018)

    CatBoost: gradient boosting with categorical features support

    arXiv preprint arXiv:1810.11363. Cited by: §3.3.
  • [6] P. A. Estévez, M. Tesmer, C. A. Perez, and J. M. Zurada (2009) Normalized mutual information feature selection.

    IEEE Transactions on Neural Networks

    20 (2), pp. 189–201.
    Cited by: §2.
  • [7] A. Johnson, T. Pollard, L. Shen, L. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Celi, and R. Mark (2016-05) Data descriptor: mimic-iii, a freely accessible critical care database. Scientific Data 3. External Links: Document, Link Cited by: §4.1.
  • [8] H. M. Krumholz (2013) Variations in health care, patient preferences, and high-quality decision making. Journal of the American Medical Association 310 (2), pp. 151–152. Cited by: §1.
  • [9] A. Kumar (2015) A survey on feature selection algorithms. International Journal on Recent and Innovation Trends in Computing and Communication 3, pp. 1895–1899. External Links: Document Cited by: §2.
  • [10] E. McFowland, S. Speakman, and D. B. Neill (2013) Fast generalized subset scan for anomalous pattern detection. The Journal of Machine Learning Research 14 (1), pp. 1533–1561. Cited by: §1.
  • [11] E. McFowland III, S. Somanchi, and D. B. Neill (2018) Efficient discovery of heterogeneous treatment effects in randomized experiments via anomalous pattern detection. arXiv preprint arXiv:1803.09159. Cited by: §1, §1, §2, §3.4.
  • [12] M. L. McHugh (2013) The chi-square test of independence. Biochemia Medica 23 (2), pp. 143–149. Cited by: §2, §3.2.
  • [13] J. Miao and L. Niu (2016) A survey on feature selection. Procedia Computer Science 91, pp. 919–926. External Links: Document Cited by: §1, §2, §2.
  • [14] L.C. Molina, L. Belanche, and A. Nebot (2002) Feature selection algorithms: a survey and experimental evaluation. Proceedings of the IEEE International Conference on Data Mining. External Links: Document Cited by: §2, §2.
  • [15] D. B. Neill (2012) Fast subset scan for spatial pattern detection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74 (2), pp. 337–360. Cited by: §1, §3.4.
  • [16] R. M. O’brien (2007) A caution regarding rules of thumb for variance inflation factors. Quality & Quantity 41 (5), pp. 673–690. Cited by: §2, §3.2.
  • [17] W. Ogallo, G. A. Tadesse, S. Speakman, and A. Walcott-Bryant (2021) Detection of anomalous patterns associated with the impact of medications on 30-day hospital readmission rates in diabetes care. In AMIA Annual Symposium Proceedings, Vol. 2021, pp. 495. Cited by: §1.
  • [18] S. Senn (2016) Mastering variation: variance components and personalised medicine. Statistics in Medicine 35 (7), pp. 966–977. Cited by: §1.
  • [19] E. Somanchi and D. B. Neill (2017) Detecting anomalous patterns of care using health insurance claims. Presented at Conference on Information Systems and Technology. Cited by: §1, §2.
  • [20] Z. Zhang and D. B. Neill (2016) Identifying significant predictive bias in classifiers. arXiv preprint arXiv:1611.08292. Cited by: §1, §2.