Robust subgroup discovery

03/25/2021
by   Hugo Manuel Proença, et al.
0

We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global perspective. First, we formulate a broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables. This novel model class allows us to formalize the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalized Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Notably, we show that our problem definition is equal to mining the top-1 subgroup with an information-theoretic quality measure plus a penalty for complexity. Second, as finding optimal subgroup lists is NP-hard, we propose RSD, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration, which is shown to be equivalent to a Bayesian one-sample proportions, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. We empirically show on 54 datasets that RSD outperforms previous subgroup set discovery methods in terms of quality and subgroup list size.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2020

Discovering outstanding subgroup lists for numeric targets using MDL

The task of subgroup discovery (SD) is to find interpretable description...
research
06/09/2011

Finding a Path is Harder than Finding a Tree

I consider the problem of learning an optimal path graphical model from ...
research
05/01/2019

Interpretable multiclass classification by MDL-based rule lists

Interpretable classifiers have recently witnessed an increase in attenti...
research
11/03/2020

Robust hypothesis testing and distribution estimation in Hellinger distance

We propose a simple robust hypothesis test that has the same sample comp...
research
04/26/2018

High-dimensional Penalty Selection via Minimum Description Length Principle

We tackle the problem of penalty selection of regularization on the basi...
research
06/22/2011

Expert-Guided Subgroup Discovery: Methodology and Application

This paper presents an approach to expert-guided subgroup discovery. The...

Please sign up or login with your details

Forgot password? Click here to reset