Detection and Evaluation of Clusters within Sequential Data

10/04/2022
by   Alexander Van Werde, et al.
0

Motivated by theoretical advancements in dimensionality reduction techniques we use a recent model, called Block Markov Chains, to conduct a practical study of clustering in real-world sequential data. Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees and can be deployed in sparse data regimes. Despite these favorable theoretical properties, a thorough evaluation of these algorithms in realistic settings has been lacking. We address this issue and investigate the suitability of these clustering algorithms in exploratory data analysis of real-world sequential data. In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets. In order to evaluate the determined clusters, and the associated Block Markov Chain model, we further develop a set of evaluation tools. These tools include benchmarking, spectral noise analysis and statistical model selection tools. An efficient implementation of the clustering algorithm and the new evaluation tools is made available together with this paper. Practical challenges associated to real-world data are encountered and discussed. It is ultimately found that the Block Markov Chain model assumption, together with the tools developed here, can indeed produce meaningful insights in exploratory data analyses despite the complexity and sparsity of real-world data.

READ FULL TEXT

page 9

page 11

page 13

page 15

page 36

research
08/10/2022

Poincaré inequalities for Markov chains: a meeting with Cheeger, Lyapunov and Metropolis

We develop a theory of weak Poincaré inequalities to characterize conver...
research
04/28/2022

Singular value distribution of dense random matrices with block Markovian dependence

A block Markov chain is a Markov chain whose state space can be partitio...
research
06/30/2021

Analysis of COVID-19 evolution based on testing closeness of sequential data

A practical algorithm has been developed for closeness analysis of seque...
research
10/06/2021

T-SNE Is Not Optimized to Reveal Clusters in Data

Cluster visualization is an essential task for nonlinear dimensionality ...
research
12/20/2021

An iterative clustering algorithm for the Contextual Stochastic Block Model with optimality guarantees

Real-world networks often come with side information that can help to im...
research
02/03/2022

Cross-Study Replicability in Cluster Analysis

In cancer research, clustering techniques are widely used for explorator...
research
08/29/2020

Subtask Analysis of Process Data Through a Predictive Model

Response process data collected from human-computer interactive items co...

Please sign up or login with your details

Forgot password? Click here to reset