A system for exploring big data: an iterative k-means searchlight for outlier detection on open health data

04/05/2023
by   A. Ravishankar Rao, et al.
0

The interactive exploration of large and evolving datasets is challenging as relationships between underlying variables may not be fully understood. There may be hidden trends and patterns in the data that are worthy of further exploration and analysis. We present a system that methodically explores multiple combinations of variables using a searchlight technique and identifies outliers. An iterative k-means clustering algorithm is applied to features derived through a split-apply-combine paradigm used in the database literature. Outliers are identified as singleton or small clusters. This algorithm is swept across the dataset in a searchlight manner. The dimensions that contain outliers are combined in pairs with other dimensions using a susbset scan technique to gain further insight into the outliers. We illustrate this system by anaylzing open health care data released by New York State. We apply our iterative k-means searchlight followed by subset scanning. Several anomalous trends in the data are identified, including cost overruns at specific hospitals, and increases in diagnoses such as suicides. These constitute novel findings in the literature, and are of potential use to regulatory agencies, policy makers and concerned citizens.

READ FULL TEXT

page 1

page 3

research
04/05/2023

PIKS: A Technique to Identify Actionable Trends for Policy-Makers Through Open Healthcare Data

With calls for increasing transparency, governments are releasing greate...
research
10/30/2017

Hiding in plain sight: insights about health-care trends gained through open health data

The open data movement constitutes an approach to achieving accountabili...
research
02/24/2017

Characterizing Classes of Potential Outliers through Traffic Data Set Data Signature 2D nMDS Projection

This paper presents a formal method for characterizing the potential out...
research
01/29/2012

A robust and sparse K-means clustering algorithm

In many situations where the interest lies in identifying clusters one m...
research
12/13/2022

AWT – Clustering Meteorological Time Series Using an Aggregated Wavelet Tree

Both clustering and outlier detection play an important role for meteoro...
research
05/24/2018

A Practical Algorithm for Distributed Clustering and Outlier Detection

We study the classic k-means/median clustering, which are fundamental pr...
research
12/21/2020

Data Combination for Problem-solving: A Case of an Open Data Exchange Platform

In recent years, rather than enclosing data within a single organization...

Please sign up or login with your details

Forgot password? Click here to reset