Clustering Longitudinal Life-Course Sequences using Mixtures of Exponential-Distance Models

08/21/2019
by   Keefe Murphy, et al.
0

Sequence analysis is an increasingly popular approach for the analysis of life-courses represented by categorical sequences, i.e. as the ordered collection of activities experienced by subjects over a given time period. Several criteria have been introduced in the literature to measure pairwise dissimilarities among sequences. Typically, dissimilarity matrices are employed as the input to heuristic clustering algorithms, with the aim of identifying the most relevant patterns in the data. Here, we propose a model-based clustering approach for categorical sequence data. The technique is applied to a popular survey data set containing information on the career trajectories, in terms of monthly labour market activities, of a cohort of Northern Irish youths tracked from the age of 16 to the age of 22. Specifically, we develop a family of methods for clustering sequence data directly based on mixtures of exponential-distance models, which we call MEDseq. The Hamming distance and weighted variants thereof are employed as the distance metric. The existence of a closed-form expression for the normalising constant using these metrics facilitates the development of an ECM algorithm for model fitting. We allow the probability of component membership to depend on fixed covariates. The MEDseq models can also accommodate sampling weights, which are typically associated with life-course data. Including the weights and covariates in the clustering process in a holistic manner allows new insights to be gleaned from the Northern Irish data.

READ FULL TEXT

page 17

page 20

research
04/24/2021

Matrix Normal Cluster-Weighted Models

Finite mixtures of regressions with fixed covariates are a commonly used...
research
02/20/2021

nTreeClus: a Tree-based Sequence Encoder for Clustering Categorical Series

The overwhelming presence of categorical/sequential data in diverse doma...
research
09/20/2022

Efficient and accurate inference for mixtures of Mallows models with Spearman distance

The Mallows model occupies a central role in parametric modelling of ran...
research
12/09/2022

Model-based clustering of categorical data based on the Hamming distance

A model-based approach is developed for clustering categorical data with...
research
04/16/2021

Parameterized Complexity of Categorical Clustering with Size Constraints

In the Categorical Clustering problem, we are given a set of vectors (ma...
research
04/12/2021

A smoothed and probabilistic PARAFAC model with covariates

Analysis and clustering of multivariate time-series data attract growing...
research
05/17/2019

Colombian Women's Life Patterns: A Multivariate Density Regression Approach

Women in Latin America and the Caribbean face difficulties related to th...

Please sign up or login with your details

Forgot password? Click here to reset