Biclustering random matrix partitions with an application to classification of forensic body fluids

by   Chieh-Hsi Wu, et al.

Classification of unlabeled data is usually achieved by supervised learning from labeled samples. Although there exist many sophisticated supervised machine learning methods that can predict the missing labels with a high level of accuracy, they often lack the required transparency in situations where it is important to provide interpretable results and meaningful measures of confidence. Body fluid classification of forensic casework data is the case in point. We develop a new Biclustering Dirichlet Process (BDP), with a three-level hierarchy of clustering, and a model-based approach to classification which adapts to block structure in the data matrix. As the class labels of some observations are missing, the number of rows in the data matrix for each class is unknown. The BDP handles this and extends existing biclustering methods by simultaneously biclustering multiple matrices each having a randomly variable number of rows. We demonstrate our method by applying it to the motivating problem, which is the classification of body fluids based on mRNA profiles taken from crime scenes. The analyses of casework-like data show that our method is interpretable and produces well-calibrated posterior probabilities. Our model can be more generally applied to other types of data with a similar structure to the forensic data.


page 8

page 33


Semi-Supervised Learning with Multiple Imputations on Non-Random Missing Labels

Semi-Supervised Learning (SSL) is implemented when algorithms are traine...

DP-SSL: Towards Robust Semi-supervised Learning with A Few Labeled Samples

The scarcity of labeled data is a critical obstacle to deep learning. Se...

Bayesian Semi-supervised learning under nonparanormality

Semi-supervised learning is a classification method which makes use of b...

On missing label patterns in semi-supervised learning

We investigate model based classification with partially labelled traini...

An interpretable semi-supervised classifier using two different strategies for amended self-labeling

In the context of some machine learning applications, obtaining data ins...

Deep Reference Priors: What is the best way to pretrain a model?

What is the best way to exploit extra data – be it unlabeled data from t...

Supervised Convex Clustering

Clustering has long been a popular unsupervised learning approach to ide...

Please sign up or login with your details

Forgot password? Click here to reset