Model-free feature selection to facilitate automatic discovery of divergent subgroups in tabular data

by   Girmaw Abebe Tadesse, et al.

Data-centric AI encourages the need of cleaning and understanding of data in order to achieve trustworthy AI. Existing technologies, such as AutoML, make it easier to design and train models automatically, but there is a lack of a similar level of capabilities to extract data-centric insights. Manual stratification of tabular data per a feature (e.g., gender) is limited to scale up for higher feature dimension, which could be addressed using automatic discovery of divergent subgroups. Nonetheless, these automatic discovery techniques often search across potentially exponential combinations of features that could be simplified using a preceding feature selection step. Existing feature selection techniques for tabular data often involve fitting a particular model in order to select important features. However, such model-based selection is prone to model-bias and spurious correlations in addition to requiring extra resource to design, fine-tune and train a model. In this paper, we propose a model-free and sparsity-based automatic feature selection (SAFS) framework to facilitate automatic discovery of divergent subgroups. Different from filter-based selection techniques, we exploit the sparsity of objective measures among feature values to rank and select features. We validated SAFS across two publicly available datasets (MIMIC-III and Allstate Claims) and compared it with six existing feature selection methods. SAFS achieves a reduction of feature selection time by a factor of 81x and 104x, averaged cross the existing methods in the MIMIC-III and Claims datasets respectively. SAFS-selected features are also shown to achieve competitive detection performance, e.g., 18.3 the Claims dataset detected divergent samples similar to those detected by using the whole features with a Jaccard similarity of 0.95 but with a 16x reduction in detection time.



page 9

page 10


Sparsity-based Feature Selection for Anomalous Subgroup Discovery

Anomalous pattern detection aims to identify instances where deviation f...

Automated Supervised Feature Selection for Differentiated Patterns of Care

An automated feature selection pipeline was developed using several stat...

Binary Stochastic Filtering: feature selection and beyond

Feature selection is one of the most decisive tools in understanding dat...

Probabilistic Value Selection for Space Efficient Model

An alternative to current mainstream preprocessing methods is proposed: ...

Feature Selection On Boolean Symbolic Objects

With the boom in IT technology, the data sets used in application are mo...

Discovering Conditionally Salient Features with Statistical Guarantees

The goal of feature selection is to identify important features that are...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

AI research has been focused on building sophisticated models that aim to exploit large baseline datasets across different domains. Existing technologies, such as automated machine learning (AutoML) 

He et al. (2021)

, make building and training AI models easy while achieving competitive performance with the state-of-the-art models. However, progress in research and practices to extract data-centric perspectives have been relatively limited, and hence significant resource is still allocated to clean and analyze data prior to feeding it to a model. Recent studies also show that baseline datasets, such as ImageNet 

Krizhevsky et al. (2012) contain considerable misannotations Northcutt et al. (2021). Data-centric AI is a growing field of research that aims to clean, evaluate data, and extract insights that are crucial for AI researchers/practitioners, domain-experts, and policy makers.

Stratification of data is a common technique to understand deviations across different values of a feature of interest. Examples include variations in COVID-19 cases across age groups Levin et al. (2020) or variations of child mortality across Sub-Saharan African countries Mejía-Guevara et al. (2019). However, such manual stratification does not scale up to encode interactions among higher number of features. Furthermore, human-level exploration is limited as we prioritize some hypotheses while ignoring others, early stopping of the exploration upon finding the first “significant” pattern in the data, and a tendency to identify patterns in the data that are not actually there (i.e., Type-1 error). To this end, automatic discovery techniques that 1) scale stratification to a higher number of features, 2) are less reliant on humans to pose the questions as this transfers our bias, 3) prioritize detecting patterns with the most evidence, and 4) guard against false discoveries are necessary.

Existing divergent (also known as outlier or anomalous) subgroup detection techniques could be mainly categorized into

reconstruction, classification and probabilistic groups Ruff et al. (2021)

. The well-known principal component analysis and autoencoders are examples of reconstruction-based methods that, first, transform the data (e.g., to a latent space) so that anomalousness could be detected from failing to reconstruct the data back from the transformed data 

Hawkins et al. (2002). Classification-based approaches, particularly one-class classification, are often employed due to the lack of examples representing anomalous cases Khan and Madden (2014)

. Furthermore, the traditional probabilistic models have also been used to identify anomalous samples using estimation of the normal data probability distribution, e.g., Gaussian mixture models 

Roberts and Tarassenko (1994) and Mahalanobis distance evaluation Laurikkala et al. (2000). Moreover, there are purely distance-based methods, such as k-nearest neighborhood Gu et al. (2019), that do not require a prior training phase nor data transformations. Of note is that most existing methods infer anomalousness by exploiting individual sample characteristics rather than group-based characteristics. To this end, researchers proposed techniques, such as pysubgroup Lemmerich and Becker (2018), slice-finder Chung et al. (2019) and multi-dimensional subset scanning (MDSS) Neill et al. (2013), that aim to identify subsets of anomalous samples by exploiting group-level characteristics. Application of divergent group detection is crucial across different domains that include healthcare Ogallo et al. (2021); Kim et al. (2021); Zhao et al. (2019), cybersecurity Xin et al. (2018), insurance and finance sectors Zheng et al. (2018), and industrial monitoring Hundman et al. (2018). For example, in healthcare, deviations could be erratic data annotations, vulnerable groups, least risk group, and heterogeneous treatment effects.

However, most existing detection techniques use the whole input feature set as their search space, which includes exponentially growing combinations of feature values. For example, if there are binary features, there are possible combinations of feature values that may characterize a subgroup. In addition to the extended requirement of computational resources for large , the identified subgroup might also be less interpretable when too many features are used to describe it. To this end, feature selection techniques could be employed to select features to reduce the computational cost associated with detecting the subgroup due to the reduced search space while maintaining the detection performance.

Existing feature selection techniques could be categorized as supervised and unsupervised based on their use of ground-truth labels in the data. Examples of supervised feature selection techniques include filters, wrappers, and embedded techniques Miao and Niu (2016); Molina et al. (2002); Wanjiru et al. (2021). In contrast, auto-encoders and principal component analysis are examples of unsupervised techniques that reduce the feature dimension in the latent space Kumar (2015). Existing filter-based feature selection techniques Wanjiru et al. (2021); Molina et al. (2002) employ objective measures, such as mutual information, to encode the association between each feature and the outcome of interest. Wrapper methods use specific machine learning models to evaluate and select features Miao and Niu (2016)

. They measure the usefulness of features by learning a stepwise linear classifier or regressor using a recursive selection method, such as forward selection or backward elimination, until a stopping criteria is reached. On the other hand, embedded techniques utilize the output of model-fitting (e.g., using XGBoost 

Lundberg et al. (2020) or CatBoost Wanjiru et al. (2021)) upon which feature importance ranking is extracted Molina et al. (2002); Miao and Niu (2016). Embedded techniques might require hyper-parameter fine-tuning of the model as in  Wanjiru et al. (2021) or training with a default tree-based classifier as in  Lundberg et al. (2020).

Generally, existing wrapper and embedded feature selection techniques mainly optimize over aggregated prediction performance of a trained model Wanjiru et al. (2021); Molina et al. (2002); Miao and Niu (2016), which results in extra computational overhead (due to model fitting), and the feature selection output is prone to model hyper-parameters, class imbalance, and under-/over-fitting. Even though existing filter-based feature selection techniques do not require model training, they are also limited in exploring the object measure variations across different unique values of a particular feature.

In this paper, we propose a sparsity-based automatic feature selection framework (SAFS) which employs normalized odds ratio as an objective measure to evaluate the association between a feature value and the target outcome. Similar to the state-of-the-art filter-based techniques, SAFS is


as it does not require training a particular model. In addition, SAFS encodes deviations of associations among feature values with the target using Gini-based sparsity evaluation metric 

Hurley and Rickard (2009). Generally, the contributions of this work are as follows: 1) a model-free feature selection technique tailored to encode systemic deviations among subsets in a tabular data using a combination of association (i.e., normalized odds ratio) and sparsity (i.e., Gini index) measures that satisfy corresponding requirements; 2) we validate the proposed framework using two publicly available datasets: the MIMIC-III (Medical Information Mart for Intensive Care) datasetJohnson et al. (2016) and the Allstate Claim Severity dataset Kaggle (2022); 3) we compare the proposed SAFS to multiple existing feature selection techniques that include a mutual information based filter Estévez et al. (2009), Wrappers Miao and Niu (2016), and Embedded techniques such as XGBoost Chen and Guestrin (2016), CatBoost Hancock and Khoshgoftaar (2020), Committee Wanjiru et al. (2021) and Shap Lundberg et al. (2020). The results show that the proposed SAFS outperforms the baselines in ranking the features with significantly reduced ranking time, i.e., and in MIMIC-III (with features) and Claims (with features) datasets, respectively. We employ multi-dimensional subset scanning Neill et al. (2013) to validate the detection of the subgroups. Results show that and features selected by SAFS from Claim and MIMIC-III datasets, respectively, achieved competitive detection performance compared with the whole feature sets, with a Jaccard similarity of and , in the identified divergent samples.

The organization of the paper is as follows. Section 2 describes the details of the proposed feature selection framework. Section 3 presents the datasets used for validation and existing techniques selected for comparison. Section 4 describes the results from the ranking of features across methods and the detection performance of divergent subgroups. Finally, Section 5 concludes the paper.

2 Proposed framework

The proposed framework is illustrated in Fig. 1 where the sparsity-based automatic feature selection (SAFS) precedes the divergent subgroup detection and characterization steps. SAFS exploits the sparsity of objective measures by first quantifying the association between each feature value and the outcome. Then a sparsity evaluation is applied to encode the deviations of the objective measures across unique values of a feature, and the features are ranked as per their sparsity values. The top features in the SAFS rankings are then fed into an existing multi-dimensional subset scanning framework that automatically stratifies and detects a divergent subgroup with the extreme deviation from the expected. This group discovery process is followed by characterizing its feature descriptions, size, and divergent metrics between the identified subgroup and the remaining input data, such as odds ratio. Each component of the framework is described below in detail.

Figure 1: The overview of proposed framework (SAFS) in which features are selected prior to automated divergence subgroup detection using sparsity-based evaluation of their associations with the outcome.

2.1 Problem Formulation

Let denote a dataset containing samples where each sample is characterized by a set of discretized features and represents the binary outcome label. Note that each feature has unique values, . The proposed automated feature selection process is defined as a function that takes as input and provides represented by the top features, i.e., and is represented by , where . Then an existing subgroup discovery method (MDSS) Neill et al. (2013), , takes as input and identifies the anomalous subgroup () represented by the set of anomalous features , . The overall anomalous feature subspace is described as the logical (AND and OR) combinations of anomalous feature values as , where represents the value of the and . Note that . The detected divergent subgroup contains samples from whose feature values are characterized by , i.e., , where . The divergence of the identified subgroup is evaluated using the anomalous score, followed by post-discovery metrics such as odds ratio to quantify its divergence from the remaining subset of the input data.

2.2 Automated Feature Selection

The sparsity-based automatic feature selection (SAFS) component in Fig. 1 is tasked with selecting the top features from a given -dimensional feature space that are more useful for the subsequent anomalous subgroup discovery step. Unlike the existing wrappers and embedded techniques in the state-of-the-art feature selection, SAFS is model-free and does not require training of a particular model. Similar to other filter-based techniques, SAFS employs a principled objective measure but to evaluate the association between the outcome variable and each unique value of input feature. To this end, SAFS uses Yule’s Y-coefficient Yule (1912), which is a normalized odds ratio. We select Yule’s Y-coefficient as it is proven to be an objective measure that satisfies the most fundamental and additional proprieties for a good measure Tan et al. (2004).

Given a feature with unique values, we manually stratify per each feature value , resulting two subsets and , where and . For example, say a feature under-consideration is with three unique values in : , and . Then stratifying for gives containing all samples in the with whereas containing the remaining samples in with or . In order to compute the Yule’s Y coefficient, we generate a contingency table from and as follows,

Y=1 Y=0

where is the number of samples in with the binary outcome , is the number of samples in with . Similarly, is the number of samples in with , is the number of samples in with . Note that is size of and is the size of , i.e., . Yule’s Y objective measure for the feature value is computed as


Once is computed for of feature , we we employ Gini-index Hurley and Rickard (2009) to evaluate the sparsity of Yule’s Y coefficient across the feature values. Gini-index is selected as it is the only sparsity measure that satisfies the required six properties as described in  Hurley and Rickard (2009). These six properties include Dalton’s four laws (Robin Hood, Scaling, Rising Tide and Cloning) in addition to Bill Gates and Babies Hurley and Rickard (2009)

. Gini-index is computed over the ranked vector of objective measures

in ascending order, i.e., , and are the indices after the sorting operation. Then the Gini-index () of is then computed as


where represents the norm. The Gini-index is computed for all the features and they are ranked in decreasing order where a feature with the largest Gini-index takes the top spot. The summary of the steps for sparsity-based feature selection is shown in Algorithm 1. The follow up divergent subgroup detection step takes the top features as described below.

input : Dataset: ,
Set of features: ,
Required number of features: .
output : Set of selected features:
1 ZerosArray ();
2 for  in  do
3       IdentifyUniqueValues ();
4       ;
5       ZerosArray ();
6       for   do
7             Stratification (,);
8             - ;
9             YuleY (,);
11       GiniSparsity ();
13 SortDescendingIndices ();
14 ;
15 TopK ();
Algorithm 1 Pseudo-code for the proposed automated feature selection (SAFS) based on the Gini sparsity of Yule’s Y coefficients as objective measure between features and a binary outcome.

2.3 Subgroup Discovery via Subset Scanning

We employ Multi-Dimensional Subset Scanning (MDSS) Neill et al. (2013) from the anomalous pattern detection literature in order to identify significantly divergent subset of samples. MDSS could be posed as a search problem over possible subsets in a multi-dimensional array to identify the subsets with a systematic deviation between observed outcomes (i.e., ) and expectation of the outcomes, the latter of which could be set differently for variants of the algorithm. In the simple automatic stratification setting, the expectation is the global outcome average in . The deviation between the expectation and observation is evaluated by maximizing a Bernoulli likelihood ratio scoring statistic for a binary outcome,

. The null hypothesis assumes that the likelihood of the observed outcome in each subgroup

is similar to the expected, i.e., ; while the alternative hypothesis assumes a constant multiplicative increase in the odds of the observed outcome in the anomalous or extremely divergent subgroup , i.e., where ( for extremely over observed subgroup (e.g., high risk population); and for extremely under observed subgroup (e.g., low risk population). The anomalous scoring function for a subgroup () with reference is formulated as, and computed as:


where is the number of samples in . Divergent subgroup identification is iterated until convergence to a local maximum is found, and the global maximum is subsequently optimized using multiple random restarts. The characterization of the identified anomalous subgroup () includes quantifying the anomalousness score , the analysis of the anomalous features and their values , the size of the subgroup , the odds ratio between and and Confidence Interval (CI) of the odds ratio, the significance tests quantified using empirical p-value, and the time elapsed to identify .

3 Experimental Setup

3.1 Datasets

We employ two publicly available tabular datasets to validate the proposed feature selection framework and identify divergent subgroups in these datasets. These datasets are the Medical Information Mart for Intensive Care (MIMIC-III) Johnson et al. (2016) and All State Claim Severity dataset (Claim) Kaggle (2022). MIMIC-III is a freely accessible critical care dataset recording vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, and survival data. We selected a study cohort of adult patients (16 years or older) who were admitted to the ICU for the first time. The length of stay was greater than a day, with no hospital readmissions, no surgical cases, and having at least one day one chart events. The final cohort consisted of rows of patients data. We constructed features ( numerical and categorical) based on observations made on the first 24 hours of ICU admission. The numerical features are later discretized. We defined the target outcome as a binary indicator variable such that for patients who died within 28 days of the onset of their ICU admission, and otherwise.

The Claim dataset is released by Allstate, an US-based insurance company as a Kaggle challenge Kaggle (2022) to predict the severity of the claims. In our validation, we use training claim examples with anonymized categorical features in our validation, and the numeric loss

feature is used as the outcome of interest. We also transform the outcome to a binary variable using the median loss as a threshold, i.e., loss values greater than equal to the median loss are set to

, and loss values less than the median are set to .

3.2 Existing methods selected for comparison

In order to compare our sparsity-based feature selection framework, we selected multiple existing methods in the state-of-the-art for selecting features from tabular data. These methods are Filter, Wrapper, and Embedded methods Miao and Niu (2016); Wanjiru et al. (2021); Molina et al. (2002).

Filter-based methods exploit the statistical characteristics of input data to select features independent of any modeling algorithms Wanjiru et al. (2021). In this study, we implemented a filter method based on mutual information gain  Molina et al. (2002); Vergara and Estévez (2014). Features are ranked in decreasing order of their mutual information, and top-ranked features are assumed to be more important than low-ranked features.

Wrapper methods  Wanjiru et al. (2021); Miao and Niu (2016); Molina et al. (2002)

measure the usefulness of features by learning a stepwise Ordinary Least Squares regression and dropping less significant features recursively until a stopping rule, such as the required top

features, is reached. In this study, we implemented a wrapper method using recursive feature elimination Guyon et al. (2002).

Embedded methods select features based on rankings from a model. We implemented this by employing two tree-based classifiers, i.e., CatBoost Dorogush et al. (2018) and XGBoost Chen and Guestrin (2016) and a Committee vote Wanjiru et al. (2021) from both. Unlike wrapper method, embedded methods begin with the fitting of the tree-based model followed by ranking features based on their importance. In the case of Committee-based selection, the importance score from each of the two tree-based models is normalized using min-max scaling separately. Then the average of these importance scores is computed to rank the features. The three embedded methods above require calibration of the model. We also experimented with Shap Lundberg et al. (2020)-value based feature importance using XGBoost classifier but using the default setting without calibration (herein referred to as Fast-Shap).

3.3 Setup

We set-up the subgroup discovery task as form of automatic data stratification use cases in the two datasets as follows. The subgroup in MIMIC-III dataset refers to a subset of patients with highest death risk compared to the average population in that dataset (). On the other hand, the subgroup discovery task in the Claim dataset is formulated as identifying a subset of claims with the highest severity compared to the of claims possess higher or equal to the median loss.

For each dataset, we conducted subgroup discovery using the top features selected by the different feature selection methods examined. Specifically, we experimented with top {5, 10, 15, 20, 25, 30, 35, 40, 41} features for the MIMIC-III dataset, and top {10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 109} features for the Claim dataset. Note that and represent using the entire original feature set in MIMIC-III and Claim datasets, respectively.

To compare the different feature selection methods, we measured the computation times elapsed to rank the features as the first performance metric. Furthermore, we explore the similarity of the feature ranking across these methods using rank-based overlap Webber et al. (2010). In addition, the output of the subgroup discovery algorithm using top features were also evaluated to inspect the usefulness of the selected features to detect divergent subgroup. We also compared the amount of time elapsed to identify the subgroup across different top values to determine the amount of computation time saved by using the selected top features rather than the whole input feature set. Lastly, we used the Jaccard similarity is employed to evaluate the similarity of the anomalous samples detected by using the selected top and the whole features. Note that all the experiments are conducted on a MacBook Pro with macOS Big Sur OS v11.6, 2.9 GHz Quad-Core Intel Core i7 (processor) and 16 GB 2133 MHz LPDDR3 (memory).

4 Results and Discussion

Table 1 presents the time duration in seconds (s) elapsed by each feature selection method to rank features in the two datasets used for validation. The results show that the proposed sparsity-based feature selection framework achieved the ranking with the shortest duration: seconds in Claim and just seconds in MIMIC-III datasets, resulting an average reduction of feature selection time by a factor of (in Claim) and (in MIMIC-III) compared to the existing methods. As expected, the wrapper method took the longest duration as it involves recursive fitting of a model and back-ward elimination of features. Considerably, the embedded methods also elapsed considerable time, particularly in the larger dataset due to the need for hyper-parameter tuning before selection. The committee vote approach also took longer than any of the XGBoost- and CatBoost-based embedded methods as it tries to normalize and merge their separate rankings.

Method Claim MIMIC-III
WrapperMiao and Niu (2016); Wanjiru et al. (2021) 16444.3 777.4
CommitteeWanjiru et al. (2021) 13565.6 652.7
XGBoostChen and Guestrin (2016) 12306.1 124.7
CatBoostHancock and Khoshgoftaar (2020) 1259.5 21.4
FilterWanjiru et al. (2021) 494.4 14.4
Fast-ShapLundberg et al. (2020) 114.1 11.3
Proposed 70.5 3.3
Table 1: Comparison of elapsed times (in seconds) by different feature selection methods for ranking given feature sets across two validation datasets: Claim and MIMIC-III . The Proposed method achieved the least elapsed time in both the datasets.

The pairwise similarity of rankings from different feature selection methods are illustrated in Fig. (a)a for the MIMIC-III dataset and in Fig. (b)b for the Claim datasets using rank-based overlap scores. In both datasets, it is clear that the ranking from Committee vote achieved the higher similarity with CatBoost and XGBoost rankings. This is expected as the Committee vote is generated from the two rankings. The feature rankings from Fast-Shap, relatively, resemble the other embedded techniques, i.e., XGBoost and CatBoost. The overlapping of Fast-Shap and XGBoost are lower than the overlapping of CatBoost and XGBoost even though the only difference between Fast-Shap and XGBoost is that in the latter, the tuning of model parameters was conducted during fitting while Fast-Shap uses the default XGBoost parameters without tuning. The proposed SAFS method outputs a higher similarity of ranking with that of Filter method as both employ objective measures to quantify the association between features and outcome without training any model. The feature ranking from the Wrapper method are most different from the other methods suggesting that it is the least effective ranking method despite taking the longest duration to rank the features (see Table 1).

(a) MIMIC-III dataset
(b) Claim dataset
Figure 2: Rank-based overlap scores quantifying the pairwise similarity of feature rankings by different selection methods.
(a) MIMIC-III dataset
(b) Claim dataset
Figure 3: Comparison of the time taken to discover the divergent subgroup across different top features. Note and mean the whole feature set is used by the detection algorithm.

Figure 3 shows a steep increase in the detection time (in seconds) with the increase in the number of features, thereby validating the plausibility of feature selection prior to the subgroup detection step. While it is clear that subgroup discovery is more efficient using a lower number of features selected in a principled way, it is important to evaluate the consistency of features identified to characterize the divergent subgroups detected. Ideally, similar descriptions of anomalous features should be achieved from scanning across different top features. To this end, we employed Upset plots to visualize the similarity of the features in the anomalous subsets identified by scanning over the top features using our proposed SAFS method as shown in Fig. 4. In MIMIC-III dataset (see Fig. (a)a), two features curr_service and urineoutput_cat appears in the anomalous subset across all the different top values.The feature psychoses is also shown to appear in of different values. In the Claims dataset (Fig. (b)b), three features (cat80, cat1 and cat94) are consistently detected as a subset of the anomalous features at least times out of different values. This validates the consistency of the identified group across different top values and reaffirms the benefit of using fewer selected features, which achieve similar subgroup but with significantly reduced computation time.

(a) MIMIC-III dataset
(b) Claim dataset
Figure 4: Upset plots to visualize the similarity of identified anomalous features that describe the detected divergent groups, across the two datasets (MIMIC-III and Allstate Claims), using different top features selected by the proposed SAFS.

Table 2 presents a detailed comparison of the identified subgroups across different top features. The details include the number of features and the corresponding number of values describing the divergent subgroup, the number of samples in the subgroup, and percentage of the whole data in the subgroup. The table also shows the odds ratio (and its CI) of the outcome in the identified subgroup compared to the complement subset, and the significance of the divergence as measured by an empirical p-value. Table 2 shows the detected subgroup details for the Claim dataset. It is clear that by just using 20 features ( of the whole features), we identify a subgroup with ( of the whole data) characterized by the following logical combinations of features (and values):cat80 (’B’ or ’C’) and cat1 (’A’) and cat94 (’B’ or ’C’ or ’D’). The odds of experiencing severe claims is ( CI: ) higher in the identified subgroup than its complement. This identified subgroup is similar to the subgroup identified using the whole features that results a subgroup of samples ( of whole data) and odds ratio of ( CI: ). The samples identified by using and features have a Jaccard similarity index of . Table 2 shows the same trend of achieving similar divergent subgroup using fewer number of features in the MIMIC-III dataset. Specifically, using just features ( of whole features), we identified subgroup size of ( of whole data) with a ( CI: ) increase in odds of experiencing death within days of the onset of their ICU admission. This group is described with the following combination of features (and values): psychoses (’0’) and urineoutput_cat (’1’ or ’2’) and drug_abuse (’0’) and curr_service (’0’ or ’5’ or ’6’ or ’8’) and depression (’0’), insurance (’1’ or ’2’ or ’3’ or ’4’) and vent_firstday (’1’). This is similar to the group identified using the whole feature set (), which results in a divergent subgroup of ( of whole data) with odds ratio of ( CI: ), achieving a Jaccard similarity of . The results also show that the number of features and their unique values to describe the identified subgroup unnecessarily grows with a larger number of top features. This validates the benefits of prior feature selection framework before subgroup detection to achieve better interpretability of the identified subgroup and significantly reduce the computation time.


: The proposed feature selection is shown to achieve competitively or better performance than existing feature selection techniques as it does not require fitting of a particular model (compared to existing Wrappers and Embedded methods), and it exploits the variation of objective measures across the unique values of a feature (compared to existing Filter methods). However, its limitation is mainly rests on its requirement to discretize numeric features into bins.

K #Feats (#Vals) Subset size % Jacc. Sim. Odds ratio CI p
10 1 (1) 45707 25 0.81 8.52 (8.32, 8.72) 0.0
20 3 (6) 44181 24 0.95 8.79 (8.58, 9.01) 0.0
30 3 (6) 44181 24 0.95 8.79 (8.58, 9.01) 0.0
40 3 (6) 44181 24 0.95 8.79 (8.58, 9.01) 0.0
50 3 (6) 44181 24 0.95 8.79 (8.58, 9.01) 0.0
60 3 (6) 44181 24 0.95 8.79 (8.58, 9.01) 0.0
70 4 (9) 43708 24 0.95 8.91 (8.7, 9.13) 0.0
80 5 (13) 42333 23 0.99 9.25 (9.02, 9.48) 0.0
90 5 (13) 42333 23 0.99 9.25 (9.02, 9.48) 0.0
100 5 (13) 42333 23 0.99 9.25 (9.02, 9.48) 0.0
109 6 (14) 42101 23 1.00 9.30 (9.07, 9.53) 0.0
(a) Claim dataset
K #Feat (#Vals) Subset size % Jacc. Sim. Odds ratio CI p
5 3 (8) 4577 23 0.32 3.60 (3.33, 3.89) 0.0
10 4 (9) 4386 22 0.34 3.65 (3.38, 3.95) 0.0
15 4 (9) 4386 22 0.34 3.65 (3.38, 3.95) 0.0
20 7 (14) 3183 16 0.93 3.86 (3.54, 4.2) 0.0
25 7 (11) 3078 16 1.00 3.95 (3.63, 4.31) 0.0
30 7 (11) 3078 16 1.00 3.95 (3.63, 4.31) 0.0
35 7 (11) 3078 16 1.00 3.95 (3.63, 4.31) 0.0
40 7 (11) 3078 16 1.00 3.95 (3.63, 4.31) 0.0
41 7 (11) 3078 16 1.00 3.95 (3.63, 4.31) 0.0
(b) MIMIC-III dataset
Table 2: Post-discovery analysis of identified anomalous subgroups in the two datasets (Claim and MIMIC-III) per each top features selected for scanning using the MDSS-based automatic stratification. The details include the number of anomalous features detected (#Feats) and number of their unique values (#Vals), number of samples in the identified subgroup (Subset size) and its percentage to the whole data (. We also provide Jaccard similarity (Jacc. Sim.) between identified samples from using top and the whole features,the odds ratio of outcome in the identified subset compared to the remaining data, its confidence interval (CI) and the empirical p-value (P). The results show that the discovery of the consistent anomalous subsets across top K values with slight decrease in the subset size as the scanning with higher number of features resulting in more complex anomalous discovery.

5 Conclusion and Future work

The model-centric approach has grown over the years to solve problems across different domains using sophisticated models trained on large datasets. While methods for data-centric insight extraction has not been given enough attention, they possess bigger potential to understand, clean, and valuate data thereby improving the capability for more efficient performance using less resources. Automatic divergent subgroup detection could be employed to answer different data-related questions, e.g., what are the high-risk population towards a particular disease? Such automated techniques often do not require prior assignment of a feature of interest, and they are scalable to high-dimensional feature input. However, detection of such subset of the data requires searching across potentially exponential combinations of input features that grow along with the number of input features. To this end, we propose a sparsity-based automated feature selection (SAFS) framework for divergent subgroup discovery that significantly reduces the search space and consequently, the amount of time required to complete the discovery and to improve the interpretation of the identified subgroups. SAFS employs a Yule’s-Y coefficient as objective measure of effect between each feature value and an outcome; and then encodes the variations across values in a given feature using the Gini-index sparsity metric. Both Yule’s-Y and Gini-index are chosen as they were proven to satisfy the fundamental requirements of good objective measures and sparsity metrics, respectively. We validated our feature selection framework on two publicly available datasets: MIMIC-III (with features) and Allstate Claims (with features). We also compared SAFS with multiple existing feature selection techniques, such as Filters, Wrappers and Embedded techniques. Results showed that the proposed feature selection framework completed the ranking of features with the shortest duration in the two datasets, resulting an average reduction of feature selection time by a factor of in the datasets (compared to the existing methods). Furthermore, the MDSS-based subgroup detection was employed to automatically identify divergent subgroups, in the two datasets. These subgroups refer to high death risk patients in MIMIC-III and claims with high severity compared to the median loss in Claims dataset. The detection results show the efficiency of our selection method that results in the discovery of similar divergent groups using just

of the original features (averaged across the datasets) compared to using the whole feature input. Future work aims to improve the efficiency of SAFS and apply it across different use cases (e.g., heterogeneous treatment effects) and feature types (e.g., numeric features without discretization). Moreover, a similar approach could also be employed to select nodes and layers in deep neural networks to detect adversarial/novel/outlier samples in latent spaces.


  • [1] T. Chen and C. Guestrin (2016) Xgboost: a scalable tree boosting system. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. Cited by: §1, §3.2, Table 1.
  • [2] Y. Chung, T. Kraska, N. Polyzotis, K. H. Tae, and S. E. Whang (2019) Slice finder: automated data slicing for model validation. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1550–1553. Cited by: §1.
  • [3] A. V. Dorogush, V. Ershov, and A. Gulin (2018)

    CatBoost: gradient boosting with categorical features support

    arXiv preprint arXiv:1810.11363. Cited by: §3.2.
  • [4] P. A. Estévez, M. Tesmer, C. A. Perez, and J. M. Zurada (2009) Normalized mutual information feature selection. IEEE Transactions on neural networks 20 (2), pp. 189–201. Cited by: §1.
  • [5] X. Gu, L. Akoglu, and A. Rinaldo (2019)

    Statistical analysis of nearest neighbor methods for anomaly detection

    arXiv preprint arXiv:1907.03813. Cited by: §1.
  • [6] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik (2002)

    Gene selection for cancer classification using support vector machines

    Machine Learning 46 (1), pp. 389–422. Cited by: §3.2.
  • [7] J. T. Hancock and T. M. Khoshgoftaar (2020) CatBoost for big data: an interdisciplinary review. Journal of Big Data 7 (1), pp. 1–45. Cited by: §1, Table 1.
  • [8] S. Hawkins, H. He, G. Williams, and R. Baxter (2002) Outlier detection using replicator neural networks. In Proceedings of International Conference on Data Warehousing and Knowledge Discovery, pp. 170–180. Cited by: §1.
  • [9] X. He, K. Zhao, and X. Chu (2021) AutoML: a survey of the state-of-the-art. Knowledge-Based Systems 212, pp. 106622. Cited by: §1.
  • [10] K. Hundman, V. Constantinou, C. Laporte, I. Colwell, and T. Soderstrom (2018) Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 387–395. Cited by: §1.
  • [11] N. Hurley and S. Rickard (2009) Comparing measures of sparsity. IEEE Transactions on Information Theory 55 (10), pp. 4723–4741. Cited by: §1, §2.2.
  • [12] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-Wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016) MIMIC-iii, a freely accessible critical care database. Scientific Data 3 (1), pp. 1–9. Cited by: §1, §3.1.
  • [13] Kaggle (2022) Allstate claims severity. Note:[Last accessed: March 10, 2022] Cited by: §1, §3.1, §3.1.
  • [14] S. S. Khan and M. G. Madden (2014) One-class classification: taxonomy of study and review of techniques.

    The Knowledge Engineering Review

    29 (3), pp. 345–374.
    Cited by: §1.
  • [15] H. Kim, G. A. Tadesse, C. Cintas, S. Speakman, and K. Varshney (2021) Out-of-distribution detection in dermatology using input perturbation and subset scanning. arXiv preprint arXiv:2105.11160. Cited by: §1.
  • [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25. Cited by: §1.
  • [17] A. Kumar (2015) A survey on feature selection algorithms. International Journal on Recent and Innovation Trends in Computing and Communication 3, pp. 1895–1899. External Links: Document Cited by: §1.
  • [18] J. Laurikkala, M. Juhola, E. Kentala, N. Lavrac, S. Miksch, and B. Kavsek (2000) Informal identification of outliers in medical data. In Fifth International Workshop on Intelligent Data Analysis in Medicine and Pharmacology, Vol. 1, pp. 20–24. Cited by: §1.
  • [19] F. Lemmerich and M. Becker (2018) Pysubgroup: easy-to-use subgroup discovery in python. In Joint European conference on machine learning and knowledge discovery in databases, pp. 658–662. Cited by: §1.
  • [20] A. T. Levin, W. P. Hanage, N. Owusu-Boaitey, K. B. Cochran, S. P. Walsh, and G. Meyerowitz-Katz (2020) Assessing the age specificity of infection fatality rates for covid-19: systematic review, meta-analysis, and public policy implications. European journal of epidemiology 35 (12), pp. 1123–1138. Cited by: §1.
  • [21] S. M. Lundberg, G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and S. Lee (2020) From local explanations to global understanding with explainable ai for trees. Nature machine intelligence 2 (1), pp. 56–67. Cited by: §1, §1, §3.2, Table 1.
  • [22] I. Mejía-Guevara, W. Zuo, E. Bendavid, N. Li, and S. Tuljapurkar (2019) Age distribution, trends, and forecasts of under-5 mortality in 31 sub-saharan african countries: a modeling study. PLoS medicine 16 (3), pp. e1002757. Cited by: §1.
  • [23] J. Miao and L. Niu (2016) A survey on feature selection. Procedia Computer Science 91, pp. 919–926. Cited by: §1, §1, §1, §3.2, §3.2, Table 1.
  • [24] L.C. Molina, L. Belanche, and A. Nebot (2002) Feature selection algorithms: a survey and experimental evaluation. Proceedings of IEEE International Conference on Data Mining. External Links: Document Cited by: §1, §1, §3.2, §3.2, §3.2.
  • [25] D. B. Neill, E. McFowland III, and H. Zheng (2013) Fast subset scan for multivariate event detection. Statistics in Medicine 32 (13), pp. 2185–2208. Cited by: §1, §1, §2.1, §2.3.
  • [26] C. Northcutt, L. Jiang, and I. Chuang (2021) Confident learning: estimating uncertainty in dataset labels.

    Journal of Artificial Intelligence Research

    70, pp. 1373–1411.
    Cited by: §1.
  • [27] W. Ogallo, G. A. Tadesse, S. Speakman, and A. Walcott-Bryant (2021) Detection of anomalous patterns associated with the impact of medications on 30-day hospital readmission rates in diabetes care. In AMIA Annual Symposium Proceedings, Vol. 2021, pp. 495. Cited by: §1.
  • [28] S. Roberts and L. Tarassenko (1994)

    A probabilistic resource allocating network for novelty detection

    Neural Computation 6 (2), pp. 270–284. Cited by: §1.
  • [29] L. Ruff, J. R. Kauffmann, R. A. Vandermeulen, G. Montavon, W. Samek, M. Kloft, T. G. Dietterich, and K. Müller (2021) A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE. Cited by: §1.
  • [30] P. Tan, V. Kumar, and J. Srivastava (2004) Selecting the right objective measure for association analysis. Information Systems 29 (4), pp. 293–313. Cited by: §2.2.
  • [31] J. R. Vergara and P. A. Estévez (2014) A review of feature selection methods based on mutual information. Neural Computing and Applications 24 (1), pp. 175–186. Cited by: §3.2.
  • [32] C. Wanjiru, W. Ogallo, G. A. Tadesse, C. Wachira, I. O. Mulang, and A. Walcott-Bryant (2021) Automated supervised feature selection for differentiated patterns of care. arXiv preprint arXiv:2111.03495. Cited by: §1, §1, §1, §3.2, §3.2, §3.2, §3.2, Table 1.
  • [33] W. Webber, A. Moffat, and J. Zobel (2010) A similarity measure for indefinite rankings. ACM Transactions on Information Systems (TOIS) 28 (4), pp. 1–38. Cited by: §3.3.
  • [34] Y. Xin, L. Kong, Z. Liu, Y. Chen, Y. Li, H. Zhu, M. Gao, H. Hou, and C. Wang (2018)

    Machine learning and deep learning methods for cybersecurity

    IEEE Access 6, pp. 35365–35381. Cited by: §1.
  • [35] G. U. Yule (1912) On the methods of measuring association between two attributes. Journal of the Royal Statistical Society 75 (6), pp. 579–652. Cited by: §2.2.
  • [36] R. Zhao, R. Yan, Z. Chen, K. Mao, P. Wang, and R. X. Gao (2019) Deep learning and its applications to machine health monitoring. Mechanical Systems and Signal Processing 115, pp. 213–237. Cited by: §1.
  • [37] Y. Zheng, X. Zhou, W. Sheng, Y. Xue, and S. Chen (2018) Generative adversarial network based telecom fraud detection at the receiving bank. Neural Networks 102, pp. 78–86. Cited by: §1.