Surrogate-guided sampling designs for classification of rare outcomes from electronic medical records data

03/31/2019
by   W. Katherine Tan, et al.
0

Scalable and accurate identification of specific clinical outcomes has been enabled by machine-learning applied to electronic medical record (EMR) systems. The development of automatic classification requires the collection of a complete labeled data set, where true clinical outcomes are obtained by human expert manual review. For example, the development of natural language processing algorithms requires the abstraction of clinical text data to obtain outcome information necessary for training models. However, if the outcome is rare then simple random sampling results in very few cases and insufficient information to develop accurate classifiers. Since large scale detailed abstraction is often expensive, time-consuming, and not feasible, more efficient strategies are needed. Under such resource constrained settings, we propose a class of enrichment sampling designs, where selection for abstraction is stratified by auxiliary variables related to the true outcome of interest. Stratified sampling on highly specific variables results in targeted samples that are more enriched with cases, which we show translates to increased model discrimination and better statistical learning performance. We provide mathematical details, and simulation evidence that links sampling designs to their resulting prediction model performance. We discuss the impact of our proposed sampling on both model development and validation. Finally, we illustrate the proposed designs for outcome label collection and subsequent machine-learning, using radiology report text data from the Lumbar Imaging with Reporting of Epidemiology (LIRE) study.

READ FULL TEXT
research
11/30/2020

Predictive case control designs for modification learning

Prediction models for clinical outcomes may be developed using a source ...
research
01/06/2020

Language Models Are An Effective Patient Representation Learning Technique For Electronic Health Record Data

Widespread adoption of electronic health records (EHRs) has fueled devel...
research
08/14/2019

Two-stage Federated Phenotyping and Patient Representation Learning

A large percentage of medical information is in unstructured text format...
research
12/18/2020

Multi-outcome trials with a generalised number of efficacious outcomes

Existing multi-outcome designs focus almost entirely on evaluating wheth...
research
08/24/2023

Large Language Models Vote: Prompting for Rare Disease Identification

The emergence of generative Large Language Models (LLMs) emphasizes the ...
research
01/30/2019

Electronic Health Record Phenotyping with Internally Assessable Performance (PhIAP) using Anchor-Positive and Unlabeled Patients

Building phenotype models using electronic health record (EHR) data conv...

Please sign up or login with your details

Forgot password? Click here to reset