A Rigorous Machine Learning Analysis Pipeline for Biomedical Binary Classification: Application in Pancreatic Cancer Nested Case-control Studies with Implications for Bias Asse

08/28/2020
by   Ryan J. Urbanowicz, et al.
20

Machine learning (ML) offers a collection of powerful approaches for detecting and modeling associations, often applied to data having a large number of features and/or complex associations. Currently, there are many tools to facilitate implementing custom ML analyses (e.g. scikit-learn). Interest is also increasing in automated ML packages, which can make it easier for non-experts to apply ML and have the potential to improve model performance. ML permeates most subfields of biomedical research with varying levels of rigor and correct usage. Tremendous opportunities offered by ML are frequently offset by the challenge of assembling comprehensive analysis pipelines, and the ease of ML misuse. In this work we have laid out and assembled a complete, rigorous ML analysis pipeline focused on binary classification (i.e. case/control prediction), and applied this pipeline to both simulated and real world data. At a high level, this 'automated' but customizable pipeline includes a) exploratory analysis, b) data cleaning and transformation, c) feature selection, d) model training with 9 established ML algorithms, each with hyperparameter optimization, and e) thorough evaluation, including appropriate metrics, statistical analyses, and novel visualizations. This pipeline organizes the many subtle complexities of ML pipeline assembly to illustrate best practices to avoid bias and ensure reproducibility. Additionally, this pipeline is the first to compare established ML algorithms to 'ExSTraCS', a rule-based ML algorithm with the unique capability of interpretably modeling heterogeneous patterns of association. While designed to be widely applicable we apply this pipeline to an epidemiological investigation of established and newly identified risk factors for pancreatic cancer to evaluate how different sources of bias might be handled by ML algorithms.

READ FULL TEXT

page 9

page 10

page 11

page 13

page 15

page 19

page 20

page 21

research
06/23/2022

STREAMLINE: A Simple, Transparent, End-To-End Automated Machine Learning Pipeline Facilitating Data Analysis and Algorithm Comparison

Machine learning (ML) offers powerful methods for detecting and modeling...
research
04/26/2021

LCS-DIVE: An Automated Rule-based Machine Learning Visualization Pipeline for Characterizing Complex Associations in Classification

Machine learning (ML) research has yielded powerful tools for training a...
research
08/22/2023

A survey on bias in machine learning research

Current research on bias in machine learning often focuses on fairness, ...
research
02/04/2020

A Generalized Flow for B2B Sales Predictive Modeling: An Azure Machine Learning Approach

Predicting sales opportunities outcome is a core to successful business ...
research
09/04/2023

Is Your Learned Query Optimizer Behaving As You Expect? A Machine Learning Perspective

The current boom of learned query optimizers (LQO) can be explained not ...
research
02/06/2017

Toward the automated analysis of complex diseases in genome-wide association studies using genetic programming

Machine learning has been gaining traction in recent years to meet the d...
research
05/05/2022

Sound Event Classification in an Industrial Environment: Pipe Leakage Detection Use Case

In this work, a multi-stage Machine Learning (ML) pipeline is proposed f...

Please sign up or login with your details

Forgot password? Click here to reset