Parameterized Complexity of Feature Selection for Categorical Data Clustering

by   Sayan Bandyapadhyay, et al.

We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers ℓ (the number of irrelevant features) and k (the number of clusters), budget B, and a set of n categorical data points (represented by m-dimensional vectors whose elements belong to a finite set of values Σ), we want to select m-ℓ relevant features such that the cost of any optimal k-clustering on these features does not exceed B. Here the cost of a cluster is the sum of Hamming distances (ℓ_0-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters k, B, and |Σ|. Our main result is an algorithm that solves the Feature Selection problem in time f(k,B,|Σ|)· m^g(k,|Σ|)· n^2 for some functions f and g. In other words, the problem is fixed-parameter tractable parameterized by B when |Σ| and k are constants. Our algorithm is based on a solution to a more general problem, Constrained Clustering with Outliers. We also complement our algorithmic findings with complexity lower bounds.


page 5

page 7

page 9

page 11

page 13

page 15

page 17

page 23


Parameterized Complexity of Categorical Clustering with Size Constraints

In the Categorical Clustering problem, we are given a set of vectors (ma...

CRAFT: ClusteR-specific Assorted Feature selecTion

We present a framework for clustering with cluster-specific feature sele...

Parameterized k-Clustering: The distance matters!

We consider the k-Clustering problem, which is for a given multiset of n...

Integrating K-means with Quadratic Programming Feature Selection

Several data mining problems are characterized by data in high dimension...

BETULA: Numerically Stable CF-Trees for BIRCH Clustering

BIRCH clustering is a widely known approach for clustering, that has inf...

Lossy Kernelization of Same-Size Clustering

In this work, we study the k-median clustering problem with an additiona...

Sequential Attention for Feature Selection

Feature selection is the problem of selecting a subset of features for a...

Please sign up or login with your details

Forgot password? Click here to reset