Modeling Data Analytic Iteration With Probabilistic Outcome Sets

09/15/2023
by   Roger D. Peng, et al.
0

In 1977 John Tukey described how in exploratory data analysis, data analysts use tools, such as data visualizations, to separate their expectations from what they observe. In contrast to statistical theory, an underappreciated aspect of data analysis is that a data analyst must make decisions by comparing the observed data or output from a statistical tool to what the analyst previously expected from the data. However, there is little formal guidance for how to make these data analytic decisions as statistical theory generally omits a discussion of who is using these statistical methods. Here, we extend the basic idea of comparing an analyst's expectations to what is observed in a data visualization to more general analytic situations. In this paper, we propose a model for the iterative process of data analysis based on the analyst's expectations, using what we refer to as expected and anomaly probabilistic outcome sets, and the concept of statistical information gain. Our model posits that the analyst's goal is to increase the amount of information the analyst has relative to what the analyst already knows, through successive analytic iterations. We introduce two criteria–expected information gain and anomaly information gain–to provide guidance about analytic decision-making and ultimately to improve the practice of data analysis. Finally, we show how our framework can be used to characterize common situations in practical data analysis.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/06/2021

Lumos: Increasing Awareness of Analytic Behavior during Visual Data Analysis

Visual data analysis tools provide people with the agency and flexibilit...
research
04/27/2021

Do We Expect More from Radiology AI than from Radiologists?

What we expect from radiology AI algorithms will shape the selection and...
research
07/03/2019

bayes4psy – an Open Source R Package for Bayesian Statistics in Psychology

Research in psychology generates interesting data sets and unique statis...
research
05/06/2022

Visual Data Analysis with Task-based Recommendations

General visualization recommendation systems typically make design decis...
research
11/19/2020

Categorical exploratory data analysis on goodness-of-fit issues

If the aphorism "All models are wrong"- George Box, continues to be true...
research
07/10/2020

Boba: Authoring and Visualizing Multiverse Analyses

Multiverse analysis is an approach to data analysis in which all "reason...
research
08/31/2021

DoGR: Disaggregated Gaussian Regression for Reproducible Analysis of Heterogeneous Data

Quantitative analysis of large-scale data is often complicated by the pr...

Please sign up or login with your details

Forgot password? Click here to reset