Three principles of data science: predictability, computability, and stability (PCS)

01/23/2019
by   Bin Yu, et al.
12

We propose the predictability, computability, and stability (PCS) framework to extract reproducible knowledge from data that can guide scientific hypothesis generation and experimental design. The PCS framework builds on key ideas in machine learning, using predictability as a reality check and evaluating computational considerations in data collection, data storage, and algorithm design. It augments PC with an overarching stability principle, which largely expands traditional statistical uncertainty considerations. In particular, stability assesses how results vary with respect to choices (or perturbations) made across the data science life cycle, including problem formulation, pre-processing, modeling (data and algorithm perturbations), and exploratory data analysis (EDA) before and after modeling. Furthermore, we develop PCS inference to investigate the stability of data results and identify when models are consistent with relatively simple phenomena. We compare PCS inference with existing methods, such as selective inference, in high-dimensional sparse linear model simulations to demonstrate that our methods consistently outperform others in terms of ROC curves over a wide range of simulation settings. Finally, we propose a PCS documentation based on Rmarkdown, iPython, or Jupyter Notebook, with publicly available, reproducible codes and narratives to back up human choices made throughout an analysis. The PCS workflow and documentation are demonstrated in a genomics case study available on Zenodo.

READ FULL TEXT

page 2

page 4

page 5

page 6

page 7

page 8

page 10

page 11

research
03/08/2022

Model Positionality and Computational Reflexivity: Promoting Reflexivity in Data Science

Data science and machine learning provide indispensable techniques for u...
research
02/20/2021

Measuring the Stability of Learned Features

Many modern datasets don't fit neatly into n × p matrices, but most tech...
research
03/09/2021

Design Principles for Data Analysis

The data science revolution has led to an increased interest in the prac...
research
10/06/2022

Post-selection Inference in Multiverse Analysis (PIMA): an inferential framework based on the sign flipping score test

When analyzing data researchers make some decisions that are either arbi...
research
02/23/2020

"Playing the whole game": A data collection and analysis exercise with Google Calendar

We provide an exercise suitable for early introduction in an undergradua...
research
11/30/2020

What are the most important statistical ideas of the past 50 years?

We argue that the most important statistical ideas of the past half cent...
research
01/03/2023

Introducing Variational Inference in Statistics and Data Science Curriculum

Probabilistic models such as logistic regression, Bayesian classificatio...

Please sign up or login with your details

Forgot password? Click here to reset