MURAL: An Unsupervised Random Forest-Based Embedding for Electronic Health Record Data

11/19/2021
by   Michal Gerasimiuk, et al.
0

A major challenge in embedding or visualizing clinical patient data is the heterogeneity of variable types including continuous lab values, categorical diagnostic codes, as well as missing or incomplete data. In particular, in EHR data, some variables are missing not at random (MNAR) but deliberately not collected and thus are a source of information. For example, lab tests may be deemed necessary for some patients on the basis of suspected diagnosis, but not for others. Here we present the MURAL forest – an unsupervised random forest for representing data with disparate variable types (e.g., categorical, continuous, MNAR). MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random, such that the marginal entropy of all other variables is minimized by the split. This allows us to also split on MNAR variables and discrete variables in a way that is consistent with the continuous variables. The end goal is to learn the MURAL embedding of patients using average tree distances between those patients. These distances can be fed to nonlinear dimensionality reduction method like PHATE to derive visualizable embeddings. While such methods are ubiquitous in continuous-valued datasets (like single cell RNA-sequencing) they have not been used extensively in mixed variable data. We showcase the use of our method on one artificial and two clinical datasets. We show that using our approach, we can visualize and classify data more accurately than competing approaches. Finally, we show that MURAL can also be used to compare cohorts of patients via the recently proposed tree-sliced Wasserstein distances.

READ FULL TEXT

page 1

page 10

research
06/04/2022

Missing data imputation for a multivariate outcome of mixed variable types

Data collected in clinical trials are often composed of multiple types o...
research
11/30/2017

Who wins the Miss Contest for Imputation Methods? Our Vote for Miss BooPF

Missing data is an expected issue when large amounts of data is collecte...
research
06/19/2015

CO2 Forest: Improved Random Forest by Continuous Optimization of Oblique Splits

We propose a novel algorithm for optimizing multivariate linear threshol...
research
12/10/2015

Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance

Recursive partitioning approaches producing tree-like models are a long ...
research
04/13/2023

Heterogeneous Oblique Double Random Forest

The decision tree ensembles use a single data feature at each node for s...
research
11/14/2016

Splitting matters: how monotone transformation of predictor variables may improve the predictions of decision tree models

It is widely believed that the prediction accuracy of decision tree mode...
research
07/22/2019

Evaluation of Embeddings of Laboratory Test Codes for Patients at a Cancer Center

Laboratory test results are an important and generally highly dimensiona...

Please sign up or login with your details

Forgot password? Click here to reset