1.1 Data visualization in hypothesis-driven analysis
“Nothing—not the careful logic of mathematics, not statistical models and theories, not the awesome arithmetic power of modern computers—nothing can substitute here for the flexibility of the informed human mind,” wrote Tukey and Wilk half a century ago (tukey1966data). Since then, research areas like information visualization and interactive analytics have become thriving subfields of computer science, motivated by an assumption that interactive visual interfaces for querying data enable humans to combine their domain knowledge with data summaries to produce insight. This has led to the development of interactive interfaces to help analysts more easily conduct ad hoc data exploration and analysis, from programmatic environments like computational notebooks, to modern business intelligence tools that create dashboards or trellis plots without the user needing to manually specify encodings, to visualization recommenders that serve up data summaries optimized for perception and exposure of patterns.
While notebooks created with RStudio, Jupyter, and similar packages are based in programming languages that offer analysts great flexibility in terms of graphical and statistical functions, interactive graphical user interfaces for data analysis provide a more constrained environment, enabling exploration of tabular data at the click of a button or dragging action of a variable. These tools vary in how much support they provide for different stages of analysis, perhaps implying different perspectives on what interactive analysis of data should be. Tableau and Power BI, for instance, are commonly adopted as visual analysis tools in applications like business intelligence, as well as reporting tools used to create displays like dashboards for decision making, but offer relatively little support for modeling and statistical testing. DataDesk and JMP, on the other hand, provide graphical tools like brushing and linking as well as a suite of modeling tools to support canonical statistical models like regressions and hypothesis tests.
If we look to research on state-of-the-art graphical user interface tools for exploratory and visual analysis, researchers often motivate their work in ways that imply that the value of the interface is to get out of the way of the data, so the human analyst can find the patterns or ‘insights’ they hold. Tools are intended to create a responsive environment where queries are met at the ‘speed of human thought’ (heer2012interactive) to enable more flexible inputs by which users can query and analyze data and to efficiently summarize data despite the scalability problems that arise as datasets grow larger.
One possible presumption behind prioritizing data exposure in building these tools is that exploratory and confirmatory stages of an analysis workflow are easily distinguished. Some accounts of how knowledge is created during data analysis would seem to imply that so-called exploratory analysis is ‘model-free’ and consists of preparing and familiarizing oneself with data, searching for useful representations or transformations, and noting interesting observations. Confirmatory analysis, on the other hand, involves verifying that data support a hypothesis or generating new hypotheses (keim2008visual, pirolli2005sensemaking, sacha2014knowledge, thomascook). Statisticians and others have long warned that a failure to distinguish exploratory and confirmatory stages can lead to “naive empiricism run amok” (macdonald1983exploratory), referring to pseudo-scientific use of data to confirm existing beliefs or identify patterns that do not betray underlying regularities in the target phenomena. Inappropriate overlap between exploratory and confirmatory analysis has even been proposed as a contributing factor to failed attempts to replicate what were believed to be high quality experiments in psychology and other fields (the “replication crisis”). The dangers of too much overlap between EDA and CDA have more recently been the premise of work in computer science that pursues algorithms and interfaces for mitigating the symptoms of too much flexibility, such as by tracking and adjusting for visual comparisons that an analyst makes.
In practice, however, it can be difficult to draw a clear line between exploratory and confirmatory data analysis. Model-driven inference plays a role even in canonically exploratory activities; after all, what is surprising is defined by the implicit or explicit model of our expectations. With the help of our visual system, we engage in processes comparable to fitting implicit models to data when we examine visualizations for distribution and trend, and we judge fit when we notice outliers and other deviations from symmetries inherent in graphical forms like histograms or scatterplots (Figure1a, b). We build up faceted displays like trellis plots to look for more complex effects and possible interactions in data (Figure 1c). Conversely, we use graphs to assess residual deviation from models we explicitly specify and fit on data (Figure 1d). In discussions of science reform more broadly, some have argued that it is misleading to point to a lack of separation between EDA and CDA as a key contributor to problems of false claims and non-replicability (devezer2020case, szollosi2019preregistration). Instead, they propose that these problems stem in large part from scientists’ failures to develop theories that strongly imply testable hypotheses, implying a different solution set for improving the validity of statistical inferences.
In this article we consider how assumptions about the analysis process and specifically the distinction between EDA and CDA may be reflected in interactive systems for exploratory visual analysis, and consequently impact the analyses that these tools support. We propose that designing software to strengthen, rather than separate, the links between purely exploratory and model-driven analysis can lead to better analysis. We argue that this bridging necessitates engaging with theories aimed at describing human statistical inference during graphical analysis. Without an underlying theoretical basis to ground how exploratory activities feed the development of theories and models, computer scientists and statisticians can easily end up designing software that encourages only vague theories about how data were generated and conflicts with real world analysis stakes and goals. While our article is conceptual in nature, our arguments are backed by a growing amount of empirical research in the areas of interactive and exploratory analysis and uncertainty visualization.
The article is organized as follows: We first consider the origins of interactive data analysis, and how they might have led to a fixation in system design on exposure, the “laying open of the data to display the unanticipated” (tukey1966data)
. We describe how examples of negative implications of data exposure in recent research can be linked to an idea of ‘rough CDA’ as a frequent activity in analysis. We describe how conceiving of exploratory analysis activities as driven by model checks in a Bayesian framework provides a generalizable framework for developing interactive analysis tools to support multiple proposed stages of exploratory data analysis. We note how this view overlaps as well as diverges from several other recent approaches to formalizing the role of statistical graphics in inference, including graphical inference as Bayesian cognition and statistical hypothesis testing. We discuss design implications of adopting a Bayesian model check formulation for interactive analysis software. We recommend ways in which interface features might better support analysts in specifying and testing implications of their implicit statistical models, from data diagnostics phases to rough confirmatory analysis. Finally, we discuss how attempting to fully automate a human-like analysis workflow might stimulate insights about how to improve interactive analysis interfaces.
2 Background: Exploratory and interactive data analysis
2.1 Tukey on exploratory data analysis
It may seem quite obvious that if you are doing data analysis, the interface you use should above all prioritize representation and easy access to the data. This way of thinking owes much of its motivation to the exploratory data analysis movement pioneered by John Tukey in the 1960’s. tukey1962future popularized the idea of exploratory data analysis (EDA) as a natural complement to confirmatory data analysis (CDA), writing that “[t]he simple graph has brought more information to the data analyst’s mind than any other device. It specializes in providing indications of unexpected phenomena.”
The proposal of EDA is memorable in part because he directly addressed a tension between the flexibility in thinking required to learn from one’s data through construction of graphics and transformations and the supposed guarantees of confirmatory approaches. For instance, tukey1966data
wrote, “[f]ormal statistics has given almost no guidance to exposure; indeed, it is not clear how the informality and flexibility appropriate to the exploratory character of exposure can be fitted into any of the structures of formal statistics so far proposed” and accused the formal inquiries into properties of confirmatory methods he saw around him as a means of “legitimizing variation by confining it by assumption to random sampling” and “restoring the appearance of security by emphasizing narrowly optimized techniques and claiming to make statements with ‘known’ probabilities of error.” Tukey’s classic text on EDA distinguishes it as a separate stage of analysis from CDA, and much of his work acknowledges a need to distinguish explicit confirmatory procedures to address implications of flexibility, stressing, for example, the importance of treating as provisional identified patterns that had not been tested on different data through procedures like cross validation(mosteller1977data).
It is not surprising then, that scholars have continued to stress a division between EDA and CDA, describing EDA as “freewheeling search for structure” (buja2009) and repeating the analogy originally put forth by tukey1972exploratory of a detective developing hunches while classically CDA activities like hypothesis testing can be likened to a jury deciding whether a defendant is guilty (behrens1997principles, behrens1996data, buja2009, wickham2010).
However, stressing a distinction betwen EDA and CDA can risk overlooking the strong sentiment in many of Tukey’s writings of how exploratory analysis and model fitting go hand in hand. Some of the graphics he espoused can be interpreted in terms of a model; “hanging rootograms,” for example, would be difficult to motivate without reference to a Poisson count model. He spoke of the value of the “iterative character of the relationship of exposing and summarizing” in exploratory analysis (tukey1966data). By attempting to fit models to data, one learns about what doesn’t fit, a process that has been called model diagnostics (buja2009) which enables that which didn’t fit to be “more effectively approached and structured because there has been some fit, even a poor one” (tukey1966data). tukey1969 acknowledged the contradiction of analysis philosophies he espoused: “If in advocating flexibility as necessary—and the jackknife as good—I am thought to be leaning one way, then in advocating explicit and careful attention to problems of multiplicity I am doubtless thought to be leaning in the other.” Hence even Tukey’s most direct statements about problems with CDA would seem to be best understand as intended more to “move the center of gravity away from an (over)emphasis on mathematical theory to a greater balance between methodology, theory, and applications” (friedman2002) than to call for an abandonment of confirmatory approaches in graphically aided analysis.
2.2 Exploration versus confirmation in science reform
The rapidly growing science reform literature also involves debates over the role and proper ‘control’ of exploratory analysis in empirical scientific research. Many reformers have suggested that the lack of reproducibility of many high profile empirical results in psychology known as the replication crisis can be attributed to researchers failing to adequately separate exploratory from confirmatory stages of research. For example, wagenmakers2012 motivate an agenda for “purely confirmatory research” due to how exploratory analyses cause statistics to lose their guarantees. nosek2018 describe how overlooking the difference between EDA and CDA can “lead to overconfidence in post hoc explanations (postdictions) and inflate the likelihood of believing that there is evidence for a finding when there is not” (p. 2600, as cited in szollosi2019arrested). Methods like preregistration (nosek2019preregistration) in which a researcher declares their hypotheses and analysis plan in advance of data collection, have been popularized as solution.
However, a growing body of work argues that attempts to strictly separate exploratory and confirmatory analysis are not well motivated logically or empirically. szollosi2019preregistration argue that preregistration does not directly solve the problem of poor diagnosticity of statistical tests when exploratory findings are confirmed, since these depend critically on how well statistical models map to underlying theories, nor is there good reason to believe that it will encourage researchers to reflect more deeply on their theories, methods, and analyses. Others argue that some problems associated with a lack of distinction between EDA and CDA, like hypothesizing after results are known (HARKing), are not well evidenced to contribute to a lack of replicability (rubin2017does) and can be helpful if done transparently (hollenbeck2017harking). devezer2020case point out how the reform literature provides no unambiguous definitions for confirmatory versus exploratory.
Providing some support for a notion that EDA is ‘model-free’, oberauer2019addressing argue that in discovery-oriented research, theories do not strongly imply testable hypotheses. Instead, theories define a search space for effects that would support them, where failure to find effects does not invalidate theory. The question is not how the theory is wrong when effects aren’t found, but why the data being assessed might not have been appropriate. Only in theory-testing research does the theory strongly imply a hypothesis and a lack of support for the hypothesis evidence against the theory. devezer2020case describe how, in light of an alternative view that exploratory analysis often involves deliberate and systematic attempts at discovering generalizations (stebbins2001) as cited in devezer2020case), exploratory analysis can be thought of as analogous to mapping unknown spaces until one is ”convinced that there is no element within the region being explored that remains undiscovered”, whether these be theoretical spaces, model spaces, or concerned with experimentation. They relate hypothesis generation that occurs during exploratory analysis to abduction proper, in which scientists consider all of their knowledge about a phenomena with the aim of adding new insight or understanding, a process which is believed to be irreducible to formal statistical inference (blokpoel2018deep, van2020theory).
Our view on philosophies of exploratory data analysis for designing interactive interfaces agrees with recent discussions of science reform arguing the relationship between exploratory and confirmatory activities is not as simple as arguments for clear separation between the phases imply. Like recent philosophical work in science reform, we acknowledge that there may not exist a normative model to encompass the diversity of activities associated with exploratory analysis. We argue that attempts to formalize inference processes are nonetheless important for guiding interface design despite their imperfectness. This is because formalizations establish testable implications to drive knowledge gain about how EDA occurs, while viewing EDA as atheoretical approach can restrict analysts from identifying connections between their graphical inferences and the models that would allow them to formalize them. We also point out how GUI EDA analysis applications may often be used for intuitive probabilistic inference that is not followed by confirmatory analysis, motivating a need to better integrate support for activities associated with both EDA and CDA.
2.3 Innovations in graphical user interfaces for analysis
Modern interactive data analysis also owes much to developments in computer science, in the same way that earlier advances in statistical modeling by Laplace, Gauss, and so forth accompanied progress in mathematics. As Tukey began writing about exploratory data analysis, computer scientists such as Engelbart, Kay, Sutherland, and others made pioneering efforts in the development of software interfaces for “intelligence augmentation.” As promoted by engelbart1963, intelligence augmentation is associated with “increasing the capability of a man to approach a complex problem situation, to gain comprehension to suit his particular needs, and to derive solutions to problems.” Increased capability could come as greater efficiency (perhaps framed as “more rapid comprehension” or “speedier solutions” as well as improved perception of possible solutions to problems that before seemed unsolvable. The broad framing of IA by these early pioneers outlined a vision for transforming interactions with computers in which graphical user interfaces for data analysis were a natural step.
Tukey’s and colleagues’ system PRIM-9 augmented the capabilities of a human by enabling perception of higher (2+) dimensional data (fisherkeller1988prim). A user could “dissect” multivariate data through point cloud rotation, use masking to select subregions of a space, and isolate particular subsamples. Because an analyst will rarely be able to specify the “optimal” projection, finding an appropriate one requires moving about in a multidimensional space, which PRIM-9 enabled through controlled continuous rotation (friedman2002). Tukey’s work on PRIM-9 led to further developments through projection pursuit, the incorporation of automation into interactive visualization by optimizing a projection index to detect interesting directions of study (friedman1974projection).
In the decades that followed, other statisticians made graphics contributions. Asimov (asimov1985grand
) introduced the grand tour, which used animation to stitch together projections on high dimensional data for visual analysis in a seemingly continuous way. Projection pursuit guided tour combined both methods for better results when identifying low-dimensional structures in sparse high dimensional data(cook1995grand). becker1987brushing explored brushing as a way to interactively select data in a visualization one is analyzing, in order to see the same data in other linked views, such as when viewing a scatterplot matrix. XGobi (swayne1998xgobi), followed by GGobi (swayne2003ggobi), made these state-of-the-art dynamic statistical graphic methods available in a single environment. The scagnostics of wilkinson2005graph, wilkinson2006high explored a graph-theoretic set of measures for grouping bivariate scatterplots of high dimensional data, while his grammar of graphics provided a formal description of statistical graphics (wilkinson2012grammar).
Computer scientists also began to take more interest in the new interactive capabilities for data analysis afforded by more powerful computation. shneiderman1974computer, shneiderman1982future coined the term “direct manipulation” in the early 1980s to refer to systems in which objects of interest such as data points were continuously represented and could be acted on through physical manipulation or button presses. In contrast to the inflexible and hard to learn syntax of conventional query languages, direct manipulation was easy and produced immediately visible and reversible results (hutchins1985direct). One could call direct manipulation interfaces for data analysis an early step towards “democratizing data analysis,” as these tools reduced the amount of specialized knowledge required to interact with data; one no longer needed to memorize rigid syntax, for example.
The late 1980’s saw the emergence of visualization as a subfield of computer science (mccormick1987visualization), focused on amplifying cognition through visual methods drawn from computer graphics, vision, signal processing, human computer interaction, and others, and addressing domain applications like medical imaging, planetary sciences, and molecular modeling. Information visualization, which is closer to our focus here, concerns visualizing abstract data for which spatial mappings can be chosen more arbitrarily (e.g., statistical graphics) and was distinguished in the 1990s (card1999readings)) drawing cognitive scientists and psychologists, statisticians, and cartographers. While many early advances sought to enhance data analysis among experts, the last few decades of research in the field has seen a surge of interest in making visual data analysis accessible to more novice users. Today, widely used systems like Tableau Software employ innovations that grew out of visualization research, by encoding state-of-the-art knowledge on effective visualization (mackinlay1986automating) and reducing the efforts required to manually specify views through drag-and-drop interfaces like Tableau’s shelf model (tableau, stolte2002polaris) or button-driven chart type transformations (mackinlay2007), which interpret these user interactions as database queries.
More recently, recommender systems have become an active area of research in visualization (vartak2017towards). Recommenders aim to be even more hands-off than popular visualization tools like Tableau or PowerBI by suggesting data wrangling operations (kandel2011wrangler), views to analysts based on perceptual properties (wongsuphasawat2015voyager, wongsuphasawat2017voyager), statistical analyses (demiralp2017foresight, key2012vizdeck, vartak2015s) or contextual or behavioral properties (bromley2014dive, lin2020dziban, key2012vizdeck, gotz2009behavior), requiring minimal to no input from the user after a dataset has been loaded. Other tools literally make analysis hands-off or at least “mouse off” by supporting new input modalities like natural language (gao2015datatone, setlur2016eviza, srinivasan2017orko) or touch (powerbi, vizable). Such forms of “behavior optimization” comprise the state of the art in interactive data analysis system design (rahman2020evaluating).
Research in visual analytics has evolved in tandem with that in interactive visualization, being differentiable mainly in its focus on integrating visualization-based and automated data analysis methods (keim2008visual
), for example as in interfaces for developing and debugging state-of-the-art machine learning models like neural nets (seehohman2018visual for a review). Relevant to our interests are attempts to conceptualize the visual analytics process as a model of knowledge generation (andrienko2018viewing, keim2008visual, pirolli2005sensemaking, sacha2014knowledge, wang2009defining). Researchers have commented on modeling and uncertainty as implicit in exploratory and interactive analysis (andrienko2018viewing, sacha2014knowledge). Recent work uses set theory to define patterns as relationships between multiple data elements of varying types (distance, ordering), identified by investigating distributions of groups of elements and their relationships against other groups of elements and their relationships (andrienko2021theoretical), so as to characterize a pattern finding phase of exploratory analysis.
Similarly, the rise of ‘Big Data’ as a fascination and challenge faced by industry has also driven increased interest in interactive analytics in database research, referring to approaches for optimizing query results for real time analysis by a human. These applications bring their own challenges (andrienko2020big, fisher2012interactions), such as minimizing latency while retaining acceptable accuracy. User interfaces have not always been central to these efforts, but how to deliver visualizations and interactions in these paradigms is gaining interest (alabi2016pfunk, fekete2016progressive, kim2015rapid, moritz2017trust, park2016visualization).
3 Exploratory analysis or rough confirmatory analysis?
That exploratory visual analysis systems support a diverse range of activities from data diagnostics, to characterizing distributions and relationships, to looking to support a hypothesis is acknowledged in the literature on interactive visual analysis (see battle2019characterizing for a review). Descriptive accounts also tend to acknowledge that if often alternates between open-ended tasks (e.g., flipping through filters looking for something interesting to explore a space of theories or models (i.e., abduction proper)) and more focused exploration (e.g., trying to formulate and validate a hypothesis). However, recent analogizing of exploratory visual analysis to a multiple comparisons problem by computer scientists (pu2018garden, zgraggen2018investigating, zhao2017controlling) emphasize what tukey1972data referred to as “rough confirmatory analysis.”
tukey1972data) characterized analysis as beginning with an initial exploratory phase in which the analyst doesn’t consider probability, followed by an intermediate probabilistic stage in which the analyst attempts to answer the question, “With what accuracy are the appearances already found to be believed?”, followed by confirmatory testing. In the intermediate, rough confirmatory stage, an analyst seeks a coarse set of possible answers to the question of how accurate the apparent patterns are. He described how the appearances could be “so poorly defined that they can be forgotten,” or marginal (such that ”crude analysis might not suffice and a more careful analysis is called for”), or well determined such that “we may, but more often do not, have grounds for a more careful analysis.” He stressed multiplicity as a key issue in this second stage (tukey1972exploratory, tukey1969, tukey1972data), including “How many things might have been looked at? How many had a real chance to be looked at? How should the multiplicity decided upon, in answer to these questions, affect the resulting confidence sets and significance levels?”
3.1 Empirical critiques of ignoring inference in exploratory visual analysis
An emerging line of critically-themed research in interactive visualization and analysis attempts to problematize a model-agnostic approach to designing software for visual analysis, implying that users of exploratory visual analysis tools frequently engage in rough confirmatory analysis. Much of this work remains speculative, suggesting by way of examples how different types of cognitive biases may arise in interactive analysis (dimara2018task, wall2017warning). However, a growing number of empirical studies are being used to argue about potential threats to valid inference from flexibility or design decisions in exploratory visual analysis.
For example, one recent line of research argues that by enabling the user to query more and faster, modern interactive systems for data analysis are particularly likely to result in a multiple comparisons problem. When using standard approaches to null hypothesis significance testing (NHST), a multiple comparisons problem arises because NHST admits a certain percentage of false positives by definition. Hence the more tests one does, the more false positive conclusions one might expect to arrive at. An implication made in some recent interactive analysis research is that if visual comparisons are analogous to significance testing, where a-value is used to judge whether an effect can be ruled unlikely to be due to chance, as some statisticians have proposed buja1999inference, buja2009, wickham2010, then those developing interactive analysis system should introduce measures to control their potential to produce false discoveries.
zhao2017controlling described how most people in a sample they studied who were looking at histograms of Census data in an analysis task treated patterns they saw as if they were reliable (“significant”) and didn’t consider how the number of comparisons they did inflated their chance of finding something interesting. zgraggen2018investigating estimated the severity of the multiple comparisons problem among 28 moderately experienced analysts, who used an interactive visual analysis tool to identify any reliable observations or recommendations as they assessed data samples they generated from a known ground truth population. The authors tracked each analyst’s total number of visual comparisons using a combination of experimenter questioning and eye tracking, and used statistical tests against the ground truth to determine the accuracy of each type of observation they saw (e.g., a comparison between two groups, a statement about the shape of a distribution, etc.). This led to an estimate that over 60% of the analysts’ conclusions were spurious.
Using a similar prompt asking analysts to report generalizations that could be made from exploratory visualizations, nguyen2020exploring investigated how plotting defaults in interactive visualization and business intelligence tools like Tableau Software may affect novice analysts, who tend to be least likely to know when or how to change from a default setting in software (shah2006policy). In two online experiments, they showed participants data samples using either disaggregated views, mean aggregated views, or disaggregated views with an overlaid mark showing the mean and asked what they might conclude, if anything, about a population. They found that those who used disaggregated views were less than one-fifth as likely to talk about effects without mentioning how big they are (e.g., “There’s no difference in sales between campaigns,” “Visitors from the midwest bought more”). They reported lower confidence values by an average of 6 points on a 100 point scale, and showed more sensitivity in terms of how many conclusions they drew to whether they were looking at 50 records or 1000 records.
These recent empirical studies have taken issue with the ambiguity of the concept of an “insight,” which is commonly used to characterize conclusions drawn from an interactive analysis session (rahman2020evaluating). This term has been defined in various ways, with one common definition being a “complex, deep, qualitative, unexpected, and relevant revelation” (north2006toward). While insights are often framed as being closer to confirmatory processes—sacha2014knowledge), for instance, distinguish between exploratory “findings” and more formalized verification loops that involve “hypothesis” and “insight”—rarely are degrees of belief in an insight discussed or elicited. A recent empirical study on professional analysts’ naturalistic insight generation with visualization tools found that only a handful mentioned that identifying an insight involves consider how confident one can be in it, echoing the insensitivity to probability in analysis conclusions described by the aforementioned studies.
Several recent critiques in the interactive visualization literature point to the absence of attempts to elicit or formalize the role of prior knowledge in interactive analysis studies or systems (kim2019, koonchanok2021data). Though research in visual analytics implies that prior knowledge plays a role in what one considers a finding or insight (federico2017role, lammarsch2011towards, mccurdy2018), few attempts have been made to integrate prior knowledge into visual analysis beyond allowing analysts to link text notes to views. For example, mccurdy2018 conclude from an empirical study of visual analysis by global health experts that the experts often mentally adjust the data they see to account for known “implicit” error, but the authors imply that such knowledge cannot be integrated directly with data representations. Cognitive psychologists have studied how experts’ prior knowledge and reasoning strategies lead them to interact with visualizations differently than novices (see, e.g., hegarty2004diagrams, trafton2000turning), yet the findings of this literature have not necessarily influenced the design of exploratory analysis tools. One recent exception is a Wizard-of-Oz study by choi2019concept, which explores how well users of an exploratory visualization tool can articulate conceptual and model-based expectations they bring to data based on their prior knowledge, finding that they frequently used visualizations to validate their expectations.
The weakness of evaluation methods for interactive visualizations and analysis tools have been another point of critique. Researchers often evaluate interactive visual analysis tools often use lower time spent on a task, or lower time required to answer a question, as desirable criteria, along with reported satisfaction through qualitative user feedback (see rahman2020evaluating for a review), similar to the evaluation of interactive visualization more broadly, where reliance on accuracy reading data and response time have motivated a long running workshop entitled “Beyond Time and Error” (beliv). These measures are common even when the goal of an interactive visualization is framed as supporting reasoning under uncertainty (see hullman2018pursuit for a review), suggesting that researchers may not know how to define measures that would better capture inference or decision quality. As an alternative to time and error, some recent work implies that coverage—how much of a dataset is explored, for example, in the course of an interactive analysis session—is synonymous with effective exploratory analysis (van2013small, wall2017warning, wongsuphasawat2015voyager, wongsuphasawat2017voyager). As a general criteria of good analysis, coverage conflicts with the potential that an analyst has prior knowledge and intentions about about the data they’ve collected or that may proceed from purely exploratory analysis to visually-driven rough confirmatory analysis.
These and other critiques imply that inference is an important goal in visual analysis. While this might appear obvious to many readers, the idea that, as tukey1990data described, phenomena—referring to potentially interesting things that we can describe in non numerical terms—are what we typically want to learn about when we deal with data, has not been emphasized as much as the idea of immediate support for pattern finding in visualization and interactive analysis research. Such critiques motivate our proposal that research on supporting exploratory visual analysis should embrace theories of graphical inference. In the following section we propose an alternative understanding of exploratory visual analysis as guided by model checks, and describe possible formalizations of this theory.
4 A Bayesian theory of inference for interactive analysis
The microcosm of activities that comprise interactive analysis—from data diagnostics to theory exploration to rough to proper confirmatory analysis—may help explain why design philosophies behind the development of interactive analysis interfaces are hard to identify and at times seem in conflict. While we acknowledge the diversity of activities that occur in data analysis, we think that research aimed at developing better interfaces for exploratory analysis would benefit from a more formal approach to defining the mechanism behind human graphical inference during interactive analysis.
We motivate the need for a theoretical model as follows. Even in some activities fall outside of the predictions of any specific model, without an underlying theoretical framework to guide the design of tools, we are hard pressed to identify where our expectations have been proven wrong and can easily end up with the sort of piece-meal and mostly conceptual theories that dominate much of the literature on interactive analysis. This lack of formalization makes it difficult to falsify or derive clear design implications from theoretical work. For example, some work suggests that visual analysis is a process of fitting intuitive models (andrienko2018viewing, choi2019concept) or sensemaking under different forms of uncertainty (sacha2014knowledge). However, ambiguity in the underlying assumptions about the structure of an intuitive model and how it may evolve given a sequence of analysis operations render these conceptions untestable.
An important distinction is that the value of proposing formal theories of graphical inference does not depend on those theories or the goals they imply being perfectly accurate. As researchers we may never be able to define what it means for an EDA process to be “optimal” or to perfectly predict human graphical inference in a given situation. However, part of our goal toward improving interactive analysis interfaces should be to propose and evaluate theories of human inference that seem to describe many instances, and which many would agree, if it did describe human inference, would be beneficial in many instances.
What might a formal theory to describe how an analyst responds to data during interactive analysis look like? If, as the literature on EDA versus CDA seems to suggest, it is difficult in practice to distinguish the two beyond the fact that CDA is a ‘final’ step of confirming one’s inferences about real world phenomena, then the theory of statistical inference should provide a useful prescriptive grounding for such a formalization. We motivate an understanding of interactive visual analysis as a process of implicit model checking, then formalize this idea in a Bayesian statistical framework. We discuss this understanding in comparison to other proposed theories of visual analysis.
4.1 Implicit model checking in interactions with data
At a high level, if EDA is understood to be discovery of the unexpected as is generally assumed, then this is defined relative to the expected. We note two practical implications of this duality:
Any exploratory graph should be interpretable as a model check, a comparison to ‘the expected.’ This implies that when constructing such graphs we should be able to figure out what is the model being used as a basis of comparison. Sometimes, as with a residual plot (Figure 1d), this comparison is obvious; in fact many exploratory graphics themselves get their meaning from implicit reference distributions, from histograms inviting comparisons to bell curves (Figure 4) to CDF plots inviting comparisons to the diagonal. Other times we can gain insight by carefully considering what sort of model is being implicitly checked by a graph. For example, a trellis plot showing distributions of hours of sleep for sleep tracker users who report female versus male as their sex and who have and have not previously used sleep trackers (Figure 1c) can be interpreted as a check of, or exploration of discrepancies from, a linear model that predicts using the three other plotted variables (e.g., + * * ).
Exploratory analysis can be made more effective by comparing to more sophisticated models. EDA is often thought of as an alternative to model-based statistical analysis, but once we think of graphs as comparisons to models, it makes sense that the amount we’ve learned increases with the complexity of the model being compared to. Effective graphics create visual structures that enable model inspection by foregrounding comparisons of interest in ways that exploit the abilities of the human visual system (bertin), such as to detect deviations from symmetry. Graphics are iterated on during exploratory data analysis to refine the visual comparison and/or increase the complexity of the model, such as by adding additional variables to trellis plots, or calculating derived fields to isolate effects while still relying on position encodings.
There is a corresponding argument in classical hypothesis testing or confirmatory data analysis, that more is learned from rejection of a complex model than from rejection of a trivial null model such as a hypothesis that all effects are exactly zero. In some ways, EDA is like an omnibus test in that we are open to all sorts of violations of the model, but with the difference that in exploratory analysis we are interested not so much in rejection as in the particularities of the discrepancies between model and data: rather than tailoring tests to particular alternatives, we rely on human pattern-finding abilities to motivate the development of future hypotheses. For example, in examining a trellis plot like Figure 3, an analyst might implicitly conduct scan first for any panels that seem to deviate from the others to check if there is a main effect of market. Without necessarily realizing it, they might conduct a sort of ‘mental cross validation,’ fitting a linear model to subsets of the data and then comparing to the left over panel each time. As a result, they identify a different trend in the panel for West, and might make further graphical optimizations to assess this observation, such as adding trend lines or sorting panels by slope.
To summarize this view, rather than assuming that analysts using interactive analysis software look for patterns only in a non-probabilistic mode, we instead conceive of them as developing and updating ‘pseudo-statistical’ models that help them make inferences about real-world phenomena. In contrast to statistical models an analyst might explicitly specify and test, we call these models pseudo-statistical because while they may be approximated statistically, they may deviate from what is generally defined as rational inference. For instance, they may be mentally represented in ways that deviate from a proper statistical model (e.g., at times neglecting probability information so as to explore a space of possible theories or explanations for data), and they may not be updated as predicted by a standard (i.e., Bayesian) model of belief updating. By real world phenomena, we mean a referent for an observation made in data analysis that exists outside the numbers or strings that comprise the dataset. This might be a measurement process, as in identifying errors in data collection, or a data generation process, as in trying to ascertain explanations of variability or skew. These phenomena might be evaluated in a past, present, or future tense. Below we make this proposal more concrete by formalizing it in a Bayesian framework, thanAK: then? talk about its implications for design of software.
4.2 A Bayesian formulation of graphical inference as model checking
Our proposal above is informed by a formulation of graphical inference and exploratory analysis as analogous to a ”model check”: a comparison of data to replicated data under a model, previously proposed by gelman2003, gelman2004exploratory. Put simply, the model checking formulation says that in viewing graphics, the user imagines data produced by a process that seems reasonable to them, and compares these imagined data to the observed data plotted in the graph.
More formally, assuming a parameter(s) of interest
, the model checking formulation expands the notion of a posterior distribution in Bayesian inference fromto . In the formulation of (gelman1996posterior), is a replicated dataset with the same size and shape of the observed dataset , but produced by a hypothesized model that accounts for what is known about . All model checks, whether exploratory (driven by graphical comparisons) or confirmatory (driven by -values), represent comparisons between and
. Visualizations, whether real or imagined, can be thought of as visual test statistics (and ); in other words they play the role of summaries that capture the amount of signal in the data (observed or imagined).
To make this concrete, consider an analyst doing initial checks of distributions after loading data in a tool that offers interactive data transformation and visualization. They plot multiple quantitative variables of interest to histograms, and then inspect each to judge distribution. The analyst might naturally assess the degree of symmetry in each, implicitly comparing what they see to imagined normally distributed data from censored or non-censored distributions centered on the location they perceive in the plotted data. They might naturally attend to data that deviate from their expectations of tail behavior for the implicit distributions, or alternate between comparisons to different implicit reference distributions (e.g., unimodal versus mixture) to judge distribution shape. For example, an analyst might notice that the observed distribution of 1000 sleep tracking app users deviates from their expectations of a Gaussian distribution with similar location and variance because fewer people sleep more hours per night than would be expected (Figure4). The visual ‘test statistics’ that the analyst perceives might be subjected to a discrepancy function, producing something like an implicit -value to be judged against the analyst’s internal criteria for when ‘enough’ evidence exists for a claim. Depending on the outcome, these model checks might be followed by the analyst seeking more information about the data collection process to determine the cause of perceived errors, by the use of statistical summaries or diagnostic tests of shape if the analyst plans to do confirmatory testing down the road, or simply by moving on to bivariate comparisons with more assurance that they understand outliers, skew, or other properties of the variables.
In a Bayesian framework, the hypothesized model that produces the data in these imagined histograms (
) is the posterior predictive distribution. This distribution can be viewed of a transformation of the posterior distribution from parameter space (i.e., in terms of ) to data space (i.e., in terms of the underlying measurements). In a Bayesian statistical paradigm, the posterior distribution is produced by updating a prior distribution by applying Bayes’ rule to a distribution that captures the likelihood of different values of . The posterior predictive distribution is then calculated by marginalizing the distribution of , which we can think of as a newly drawn version of given , over the posterior distribution of given . Reflecting on this machinery, the Bayesian model check analogy suggests that an analyst’s judgments about patterns in data are influenced by perceived properties of the observed data (likelihood), and by extension the set of visual encodings through which the observed data is perceived; by the prior knowledge of the analyst, which can play a regularizing role by shifting observed differences between groups closer to one another, or the implicit posterior predictive distribution away from the location or scale of distributions inferred from the observed data; and finally undoubtedly by the analyst’s statistical experience, since the space of possible models they perceive will depend on their knowledge.
Consider again the trellis plot in Figure 1c). As in most visualizations, the plot is intended for estimating more than one parameter, so
is a vector, which might include slopes and intercepts for each combination of sex and sleep tracker, as well as more specific comparisons across fitness levels or hours of sleep within or across particular views The analyst might perceive slightly different relationships (e.g., different slope directions) between hours of sleep and fitness level for females versus males. In eyeballing the plot to estimate intercepts, they might consider prior knowledge they have about the average difference in hours of sleep between males and females informed by research on sleep trends (e.g.,(burgard2013gender)). We might compare the analyst’s perceptions to the predictions of a maximal (i.e., including all interactions) Bayesian regression model that accounts for this prior knowledge and places weakly informative priors on other variables.
In a Bayesian statistical workflow, visualization is also used to reason about the appropriateness of the prior, and to compare its predictions to the observed data (gabry2019, gelman2020bayesian). For example, an analyst might examine the difference in the observed distributions of hours of sleep for female and male sex (Figure 5) against draws from a prior predictive based on the prior research (burgard2013gender).
4.3 Relationship to other models of graphical inference
Graphical inference as Bayesian cognition
In cognitive science, Bayesian models of cognition (griffiths2008, griffiths2012) have gained traction for modeling various forms of human cognition, including object perception (kersten2003), causal reasoning (steyvers2003), and knowledge generalization (tenenbaum2006). These models assume individual cognition relies on Bayesian inference: an individual’s implicit beliefs about the world are captured by a prior; when exposed to new information they update their prior according to Bayes’ rule, arriving at posterior beliefs. Recent work applies Bayesian models of cognition to how people draw inferences when shown visualized data, either eliciting their prior beliefs about a parameter (e.g., karduni2020, kim2019, kim2020bayesian) or endowing priors, showing them new data, and then eliciting their posterior beliefs to compare to normative Bayesian posterior beliefs from one or more models reflecting different ways that Bayesian updating could occur.
Bayesian cognition has been applied to visualization-based inference in a normative sense, where a Bayesian model is used to define ‘good belief updating’ as a standard for comparing to or guiding people’s belief updates from data (e.g., to evaluate different representations of uncertainty in data (kim2019) or guide belief updates (kim2020bayesian)). It has also been used in a more descriptive sense, in which observations of people’s belief updates are analyzed to gain insight into how human inference deviates (karduni2020, kim2019), ideally approached using principled tools for model evaluation and model selection (tauber2017bayesian). Toward both normative and descriptive applications, the mathematical basis of Bayesian inference has been used to calculate measures of graphical inference like perceived sample size, the size of the equivalent random sample that a Bayesian would have needed to see to arrive at the posterior beliefs expressed by a user (kim2019)
. Toward more descriptive ends, a researcher might attempt to model sources of deviation from normative updating based on factors other than the statistical informativeness of the data. For example, hierarchical models in which hyperpriors describe the bias a person expects from a given information source can be used to reflect on the forms and strength of distrust in data as a reason for deviation in some settings. Integrating the predictions of perceptual models like implicit logarithmic perception(Gonzalez1999, Hollands2000, Stevens1957, zhang2012) can help researchers separate cognitive and perceptual factors.
How is a Bayesian cognition framework as applied to interactive visualization related to the Bayesian model check framework described above? Both theoretical frameworks rely on the generalizability of a Bayesian modeling framework for describing human inference. Both can be used descriptively or normatively. In many ways, their normative versions are complementary when considering applications to interactive analysis and visualization: Bayesian cognition emphasizes trying to achieve more rational updating in the context of a predefined model, while the model check formulation enables us to study what implicit models analysts seem to use to reason about observed data in light of their prior knowledge. Improving people’s behavior with graphs via Bayesian cognition should lead to more sound implicit models and reference distributions from the standpoint of information accumulation. What does it look like to use these complementary theoretical frameworks to improve people’s behavior? Beyond the utility of these frameworks for studying and characterizing behavior, can they also be incorporated into systems to improve behavior on the fly?
In particular, the common tendency people show toward under-updating as sample size grows, and over-updating as it shrinks (as described by a model of non-belief in the law of large numbers(benjamin2016)) implies testable implications for analysis software. Specifically, conservatism in belief updating deriving from a bias like non-belief in the law of large numbers would suggest showing more frequent, smaller N samples. Applied to a progressive computation or approximate query processing setting in which analysts are shown visualizations of partial query results on very large data, developers of interfaces might think twice about design strategies that provide an initial partial result then only alert the user to check results when the queries are finished. Applied more broadly to systems for exploratory visual analysis, conservatism may mean that visualizations and interactions that guide the user toward partitioning data into smaller subsets or viewing multiple related visualizations at once, like trellis plots and visualization recommenders, are better for ensuring the analysts’ inferences are appropriately sensitive to sample size than visualizations that encode many variables in a single view. These and other implications of Bayesian cognition and Bayesian model checking may lead to ideas for how to improve systems in ways that would be hard to predict without the theory.
The primary challenge to integrating Bayesian cognition into the design and evaluation of interactive analysis tools is eliciting prior and posterior beliefs, as the method used to elicit priors influences the results (see o2006 for a review) but it can be hard to evaluate whether one has gotten the right prior for a person. Validation approaches that present draws from the prior predictive distribution should help some here. The prior elicitation process may also shift any natural inference process, for example by causing a person to dwell more on their beliefs than they would. This possibility raises questions about the role prior knowledge should have in the sorts of analysis workflows that GUI systems aim to support that researchers and developers might reflect on.
Graphical inference as null hypothesis testing
Some statisticians have proposed an analogy between graphical statistical inference and null hypothesis significance testing (gelman2003, wickham2010); buja2009 argue that discovering some insight using a visualization is akin to rejecting at least one assumption made under a null hypothesis. This understanding has led to several types of graphical tools.
The Rorschach method involves producing an array of “null” plots, visualizing data drawn from a null distribution that represents samples from a data generating process where no pattern exists. The idea is that looking at such plots can calibrate the eyes for sampling variation as one examines data. The lineups approach also relies on null plots generated in the same way, but produces an array of plots, where one of the plots is of an observed dataset y and the other plots are null plots. If an analyst can identify which of the plots shows the observed data, they are said to have performed a visual test equivalent to a hypothesis test with type 1 error rate of . The lineup is demonstrated in Figure 6e, which hides a set of monthly jobs estimates among nine null plots drawn from a model with no growth. Both approaches require the analyst to specify the null generating mechanism.
The Bayesian perspective on graphical comparisons as model checks subsumes treating visual comparisons as null hypothesis significance tests as a special case. The two frameworks align in many ways: both focus on the importance of judging deviation for some model assumptions, with the lineup and Rorschach representing general techniques for implementing graphical model checks.
Of course, some null mechanisms might be naive or obviously false (kale2019adaptation)
, and having to specify the null mechanism adds a degree of freedom, so expecting inferences from lineups to be equivalent to doing an exact statistical test is problematic. We think it is unfortunate that lineups have been so strongly associated with statistical hypothesis testing, which may imply to non-statisticians like computer scientists that the technique is less about understanding deviation than it is about checking whether some difference is equal to zero, which is a priori implausible in many real world scenarios. In their study of the multiple comparisons problem in exploratory analysis,zgraggen2018investigating
identified each analyst’s explicit hypotheses (those stated by the participant) and implicit hypotheses (those not reported, but identified later in interviews or using eye tracking) to estimate how of many conclusions they drew were false positives. Since eye-tracking remains unrealistic to embed in real interactive analysis tools, other research by the authors proposes heuristics based on session logs to detect visual comparisons made while someone interacts with a visualization system(zhao2017controlling). For example, not every visualization with a filter is a hypothesis test, but every visualization with a filter condition is a test of the null hypothesis that the filter condition makes no difference compared to the distribution of the whole dataset. Such approaches imply a one-to-one mapping between graphical comparisons and statistical tests that is not well supported by the graphical inference literature. A graphical comparison supports the evaluation of many different “tests” simultaneously, some of which might not be well understood even by the analyst until they are violated. For example, even a simple two dimensional scatterplot can lead to a number of visual judgments of properties associated with scagnostics (wilkinson2005graph), like clumpiness, monotonocity, or skewness.
Recent research on lineups implies in various ways that the analogy between examining a lineup and doing an unbiased statistical test is not so simple, including the visual acuity of users and design of the visualization (vanderplas2015spatial). Lineups have been applied, for example, to diagnose problems of fit with hierarchical models (loy2017model) and to identifying the “graphical power” of different visualizations for supporting pattern finding (hofmann2012graphical), both use cases that are well aligned with the idea of graphical judgments as model checks more broadly. Studying how people look at lineups to better understand graphical statistical inference has become its own line of research (beecham2016map, chowdhury2014utilizing, majumder2013, vanderplas2015spatial, zhao2013mind), providing some evidence of our view above that a good attempt at formalizing graphical inference can lead to better understanding of human visual inference and where it deviates from expectations.
Lineups can also be viewed as complementary to the Bayesian cognition approach summarized above, in that lineups fuse concepts from perceptual psychology like target identification and visual search with concepts from statistical modeling, while Bayesian cognition fuses concepts from cognitive psychology and behavioral economics like belief updating and revision with statistical modeling.
5 Implications for designing interactive analysis software
tukey1986sunset described how the development of expert systems helps address a challenge statisticians face in trying to teach statistical data analysis to the many people who need to use it. One reason is that “[o]ne just cannot build an expert system without thinking through a strategy,” hence designing a useful system prompts reflection on what a good strategy is. Another benefit is that a good expert system can be a way of teaching, such that “[u]sing each well-planned system will then give continuing education-especially when the user repeatedly asks the system, ‘Why did it choose to do that?’ After a while, some users will be ready for—nay will demand—more education, which we should by then be ready to furnish.”
There is an analogy between our vision for exploratory analysis software that more tightly integrates support for model-driven and probabilistic inferences and Tukey’s observation that expert systems require reflection on strategies and pave the way for greater education on the part of their users. We suspect that the prioritization of pattern finding in current GUI tools for exploratory analysis is only partially intentional. Researchers may have gravitated toward optimizing for perception over cognition because it seems less ambiguous, leaving theories of human graphical inference underexplored. Researchers and developers of modern GUI systems consequently put analysts in an environment where reflecting on data generating processes can be relatively difficult, making it harder for users to recognize where the patterns they perceive are tenuous, or even when they are making predictions versus exploring a space of theories. Attempts to falsify one’s beliefs or check one’s assumptions against the data undoubtedly occur with modern GUI tools for exploratory analysis, but how to do so is left to the analyst to figure out.
While we doubt that all users of visual analysis software assume implicit distributions when they judge “signal” in graphics, or even that very experienced users always consider distributions during visual analysis, evidence of the use of superficial visual heuristics for estimating effect size or pattern “significance” is not hard to find in our own and others’ research (conti2005attacking, hofman2020visualizing, kale2020visual, nguyen2020exploring). The question worth considering for researchers and developers of interactive visual analysis tools is how software might encourage more robust inferences beyond making view creation for pattern finding easier. This question presents an opportunity for thinking differently about how GUI EDA can support analysts natural processes, but also a significant design challenge, since more integration with modeling may introduce the possibility of confusion or misuses like overfitting.
5.1 Design requirements and future directions
At a high level, if model-driven inference underlies exploratory analysis, then systems should be capable of representing data generating processes. There are several functional implications of this.
First, software should support and encourage the use of robust representations of uncertainty whenever inference may be a goal. Above we discuss plotting observations rather than aggregations by default as one simple way software design can prioritize variation and uncertainty. However, implementing non-parametric bootstrapping as the basis for plotted data would take this a step further, potentially helping calibrate analysts to consider uncertainty by default while avoiding the need to train analysts on how to think about confidence intervals. How to provide the analyst control while encouraging them to contend with variation and uncertainty whenever patterns are being taken as “findings” is a question to be tackled in interaction design, and may require some diversity of strategies to be built in to GUI tools.
Second, analysts should have the ability to specify and see predictions from models of the data generating process that they wish to consider against the data. This idea is not entirely novel. Many widely used graphical user interface tools for data analysis, like Tableau, MS Excel, or Data Desk to name just a few do provide modeling tools in the form of built-in statistical tests and regression features. In fact, closest to our proposal Tableau currently supports visualizing reference lines, bands, and empirical distributions (tableaureferencelines) as well as forecasting for time series data (tableauforecasting). However, these tools are intended for primarily confirmatory use or prediction based on the observed data, with little recourse to, for instance, customize based on a prior prediction or to easily compare different possible models of a data generating process.
Also related to our vision of supporting rough model checking are existing tools like the lineup and Rorschach (wickham2010), both of which have been proposed for comparing observed data to predictions of a null model. As currently implemented in R these tools require the analyst to specify the null model programmatically. What might it look like if these tools were implemented in GUI systems for exploratory analysis?
Partly the difference is in emphasis. Systems could provide users with access to predictions of null models through built-in recommendations based on chart and data types. When visualizations include various distinct subsets of data, such as in trellis plots, the analyst could interactively select the data of interest. On the other side of the spectrum, analysts could see posterior predictive distributions from a model fit to data, using either a weakly informative or elicited prior, whenever visualizations of observed data alone seem ambiguous. When there is a clear source of prior information, for example as in business applications where similar analyses are conducted periodically on the latest sales or marketing data, seeing model predictions along with the observe data could help the analyst better perceive what if anything has been learned from the new information.
A key activity toward realizing this aim that has not been well explored by prior research is what a ‘grammar’ for model recommendations should look like. At the minimum, such a grammar should include common distributional families like Gaussian, Beta/Binomial, Poisson, etc.; common transformations like taking a log; and support for simple additive and multiplicative models. Ideally there is a connection between the visualization structure and the model. The development of these tools should involve collaboration between computer scientists with expertise in designing interactions and accompanying abstractions and statisticians who bring expertise in robust statistical methods, such as models that minimize critical assumptions that can be fit using analytical solutions.
One direction for future research is to work toward developing a computational engine for automated model specification via user interactions with variables like dragging to shelves. Just like Tableau’s underlying table algebra (stolte2002polaris) creates algebraic expressions from user-selected data fields, where fields are operands operated on by combination functions like cross, nest, and concatenate to drive visual specifications, a model generating engine could automatically compute plausible regression models based on data characteristics and allow the analyst to access them when as needed, and subsequently customize the model specifications by providing prior beliefs or varying assumptions. An engine might include precomputation of null alternatives as well. For example, showing draws from a model with non- or weakly informative prior fit to observed county data in a choropleth map depicting rates helps the analyst account for the impact of sample size in their inferences (correll2016surprise). Draws from a model that assumes all regions have the same rate, which might be set to the national average or some other baseline value that analyst decides, may also be useful. By using draws to represent the expected amount of sampling error at the sample size in each region, such models go beyond the canonical null model naive to population density. For greater flexibility, cases where users wish to target the modeling toward only some data in a view could directly select a subset by interacting with the chart, then right click to further parameterize the automatically computed model. Feedback in terms of predictions from the model should be immediate and reversible, and the analyst should be able to control when and how they is shown (e.g., through animated draws (hullman2015hypothetical), static ensembles, continuous representations, and either superimposed in the view or juxtaposed in separate views as in a trellis plot).
The Bayesian view of graphical examination as posterior predictive check motivates making it easier in tools for users to articulate prior distributions over parameters. Again in the interest of avoiding interrupting workflows to dive into code, the user could ideally ‘sketch’ a prior graphically then draws from it and see these along with observed data. Thought should be put to how to do this, of course, given that different elicitation interfaces can lead to different strategies for formulating priors and modeling ‘noise’ from the interface (kim2019, sarma2020prior).
Finally, on the flip side of the argument we make here, where confirmatory tools are currently integrated in GUI visualization tools, they should default to graphical model checks alongside or as a substitute for the typical tables of model results and fitness summaries. Animated hypothetical outcomes (hullman2015hypothetical) can be a useful tool here too.
5.2 Design challenges
As a theory, by implying that software should allow and even encourage users to make reference distributions explicit, the model check formulation opens up room for considerably more complexity in GUI interfaces for exploratory analysis. This includes the formidable challenge of developing a grammar of flexible yet robust model specifications and ensuring that models can be fit when users customize them. Naturally there will be kinks in this process, making it important to invest in exploring various ways to give feedback to the analyst during model specification and exploration.
One risk is that the additional cognitive load of interacting with reference distributions overwhelms some users, distracting them from paying as much as attention to the data as they might have. A concerted user-centered design approach toward prototyping and evaluating the use of different interaction designs for modelling tools seems necessary along with the development of a grammar of model components. These efforts should work toward guiding design principles, some of which may be similar to those used for mixed initiative interface design (horvitz1999principles), like allow direct invocation and termination and provide dialogue to resolve key uncertainties rather than making guesses that a user may not realize were made.
Another risk is that adding support for reference distributions introduces new ‘failure modes’ based on misunderstandings of how the features are meant to be used, or failures of the system designers to anticipate what details of data structure will be critical to infer or elicit. In particular, if we expect to encounter trade-offs in how easily implementable a model is and how appropriate it is for the sorts of real world scenarios analysts bring to GUI tools, then we risk leading analysts to overrely on inappropriate models. If analysts began to respond to the tools more than the data, the link between the models they use and their intuitive theories is weakened, which might lead to analyses that are less responsive to the data. Overfitting is another concern. More built in functions for generating model-driven expectations from existing data may, if not designed carefully, exacerbate issues. This possibility motivates design features for separating some data for testing when analysts want to check any hypotheses they arrive at. More broadly, a challenge is how EDA tools can encourage the use of modeling for building understanding of data generating processes, accounting for uncertainty, and making predictions that can be tested on future data over manufacturing provable insights. All of these challenges are formidable, but without trying to push the horizon of what the average GUI tool supports in terms of support for model checking, it is difficult to say how limiting they will be. Again iterative user-centered in the development of these tools should help identify how failures can occur and why early in the design process.
At a higher level, we expect the potential complexity of supporting reference model comparisons in GUI analysis tools may seem unwelcome to many who have long accepted the model-free or ‘leave it all implicit’ philosophy in analysis system design. Certainly these activities should be approached cautiously; visual analytics software has been criticized for unwarranted complexity in the form of overloading an interface with available functions (hegarty2010visweek). However, rather than implying that complex modeling must accompany all graphical examinations, we think the model check theory is better seen as an opportunity for those developing systems to reflect more deeply on plotting or other strategies analysts may currently use to help them make model driven judgments. This allows developers to target modeling tools to cases where the status quo of implicit model checking with graphics is most likely to be prone to vague or erroneous misunderstandings of the data on the part of the analyst, since here there is more to gain. In particular, more novice users of powerful graphical analysis tools might be most prone to failing to realize where their attempts to find patterns are tenuous or how they could benefit from thinking about the underlying process producing data.
6 Comparing human to automated statistics
Beyond the design implications of a Bayesian model check formulation, there are various testable expectations about human graphical inference that could lead to a better understanding of how people do exploratory visual analysis.
For example, how ubiquitous is conservatism in belief updating, as suggested by recent work in behavioral economics (e.g., benjamin2016) and visualization (e.g., kim2019)? What predicts the use of non-probabilistic pattern detection activities for theory or model exploration versus implicit graphical model checking in an analysis session? And how do changes to graphical representations (such as showing animated bootstrap replicates) affect visual analysis? Some of this work will naturally help researchers see what can’t be well identified about intuitive graphical inference. For example, to faithfully infer implicit reference distributions from experiments on human graphical inference could require more modeling structure of correlations than would be reasonable to assume in viewers’ conscious reasoning processes. But the act of trying to specify users’ processes statistically leaves us with a better understanding of what we can presume about exploratory analysis versus what remains subject to conjecture.
We think the research trend toward studying how data analysts use existing GUI systems (e.g., battle2019characterizing) is well motivated and much more could be done under the umbrella of deepening a theoretical foundation. For example, what do different classes of inference look like in expert use of visual analysis tools? How do analysts use interactive graphics when considering different causal hypotheses, and can we learn anything about human graphical causal inference by comparing to models of causal support from mathematical psychology (tenenbaum2001structure)?
Finally, beyond using theories to produce testable statements about how humans do analysis and to stimulate new design ideas, considering how we might remove humans from the analysis process entirely may also paradoxically help us find ways to improve interactive analysis. In other words, how could an Artificial Intelligence do statistics? We use this question as a thought exercise for further reflecting on the types of knowledge and strategies that come into play during interactive analysis.
In the old-fashioned view of Bayesian data analysis as inference-within-a-supermodel, it’s simple enough to imagine an AI replacing a person: it simply runs some equivalent to a probabilistic program to learn from the data and make predictions as necessary. But in a modern view of statistical practice—iterating the steps of model-building, inference-within-a-model, and model-checking—it’s not quite as clear how the AI works. By taking what currently seems vague and framing it computationally, we might discover useful regularities or patterns in human statistical workflows.
To fix ideas, we shall discuss Bayesian data analysis, which can be idealized by dividing it into the following three steps (gelman2013bda):
Setting up a full probability model—a joint probability distribution for all observable and unobservable quantities in a problem. The model should be consistent with knowledge about the underlying scientific problem and the data collection process.
Conditioning on observed data: calculating and interpreting the appropriate posterior distribution—the conditional probability distribution of the unobserved quantities of ultimate interest, given the observed data.
Evaluating the fit of the model and the implications of the resulting posterior distribution: how well does the model fit the data, are the substantive conclusions reasonable, and how sensitive are the results to the modeling assumptions in step 1? In response, one can alter or expand the model and repeat the three steps.
Currently, human involvement is needed in all three steps listed above, but in different amounts:
Setting up the model involves a mix of look-up and creativity. We typically pick from some conventional menu of models (linear regressions, generalized linear models, survival analysis, Gaussian processes, splines, trees, and so forth). Machine learning toolboxes and probabilistic programming languages such as Stan enable putting these pieces together in unlimited ways, with similar expressiveness to how we formulate paragraphs by putting together words and sentences. Right now, a lot of human effort is needed to set up models in real problems, but we could imagine an automatic process that constructs models from parts.
Inference given the model is the most nearly automated part of data analysis. Model-fitting programs still need a bit of hand-holding for anything but the simplest problems, but it seems reasonable to assume that the scope of the “self-driving inference program” will gradually increase. For example, for thirty years we have been able to automatically monitor the convergence of iterative simulations (gelman1992). With the no-U-turn sampler, a recursive algorithm builds a set of likely candidate points spanning a wide swath of the target distribution and stops when it starts to retrace its steps, thus avoiding the need to tune the number of steps in Hamiltonian Monte Carlo (hoffman2014).
The third step—identifying model misfit and, in response, figuring out how to improve the model—is likely the toughest part to automate. We often learn of model problems through open-ended exploratory data analysis, where we look at data to find unexpected patterns and compare inferences to our statistical experience and subject-matter knowledge. Indeed, a primary piece of advice we espouse to statisticians is to integrate that knowledge into statistical analysis, both in the form of formal prior distributions and in a willingness to carefully interrogate the implications of fitted models.
By considering how to fully automate all three steps, we can identify some ways to improve interactive software. The space of model parts we deem necessary to support step 1, for example, should directly guide the types of built-in options that interactive analysis tools offer an analyst to specify their implicit models. When it comes to step 2, inference within the model, we might try to build in automatic checks (for example, based on adaptive fake-data simulations) to flag problems with fitting a specified model when they appear. This could help us think about how users might note immediate problems with an implicit model as they examine graphics.
How would an AI do step 3? The AutoML approach to model evaluation typically involves choosing a preferred loss function to minimize, e.g., generalization error on held-out data, estimated using standard procedures like cross validation. But human model checking often combines model fitness measures with more qualitative assessments of how well model predictions align with domain knowledge. One approach closer to human model checking is to simulate the human in the loop by explicitly building a model-checking module that takes the fitted model, uses it to make all sorts of predictions, and then checks this against some database of subject-matter information such as a knowledge graph. This is one avenue for attempting to mimic the Aha process behind concepts like insight that drives scientific revolutions. Trying to construct this would undoubtedly require deeper inquiry into how humans check model fit, and might lead to ideas for building interactive systems, like making it easy for analysts to scan through many predictions from their models or transform them into different measures to ask “does this look right.”
All that is left, then, is the idea of a separate module that identifies problems with model fit based on comparisons of model inferences to data and prior information. It’s less clear how techniques from AI and ML research should be combined to do that; this may be the hardest part of the pipeline to remove humans from the loop. However, by attempting to combine existing technologies we are likely to learn more about how to think about humans doing model checks, which might also feed new interface optimizations.
Data visualization and exploratory data analysis can be seen as a form of model checking, with the goal of revealing the unexpected beyond what is already in a model of the world. We propose a research program that pursues a tighter integration between models, graphics, and data querying, motivated by a view of interactive analysis as a process of users comparing intuitive pseudo-statistical models to data via model checks. A Bayesian model check formulation of exploratory visual analysis makes clearer what types of interactive features would help in phases of analysis that resemble rough CDA, like built-in reference distributions, robust uncertainty representations, and features to encourage analysts to recognize the links between assumed models and priors and graphical structures.
Our proposal of Bayesian model checking as a way to unify thinking about different phases of analysis calls for more thoughtful integration of graphical inference techniques that researchers are already proposing into visual analysis systems. This includes innovations in uncertainty visualization like animated and static frequency framed depictions of distributions and variations on graphical inference techniques like lineups and graphical elicitation of priors or predictions from users, to name a few. While its natural to expect that novel techniques will not necessarily immediately make their way into mainstream tools, we suspect that a relative lack of theory around what visual analysis should be stands in the way of graphical inference tools becoming a standard part of the visual analysis toolkit.
Beyond stimulating new ideas for designing interfaces and new ways of evaluating them, conceiving of interactive analysis as checks against pseudo-statistical mental models pushes us toward identifying testable implications of different formalizations of this process. Without a theoretical foundation fields like interactive visualization and visual analytics can become trapped in a problem-solving orientation to system development. In such an orientation, researchers may be more likely to continually chase the next application area to design for than to consider ways the design of an interactive system might help users recognize their own goals and limitations. A good theoretical framework of modeling feeds a process in which we learn from the ways that peoples’ behavior deviate from model predictions and continually revise our aims as researchers and developers. This process is not completely absent in the status quo approach: current efficiency-oriented evaluations of interactive systems can also help researchers realize when their intuitions are wrong. The point it is that its likely to be less direct and error prone than if a more formal, normative model were available, similar to how it is inefficient to learn from only the yes/no answers of null hypothesis significance tests.
Our argument about the potential consequences of prioritizing pattern exposure in creating tools for interactive EDA should not be construed as saying that exposing raw data is generally bad in analysis contexts. In contexts like communicating statistical results, showing the data or properties of the raw data can be very useful for providing information about effect size, especially in light of many readers’ tendencies to overestimate effects (hofman2020visualizing). And, as system developers and researchers, we face many daunting challenges that existing pattern-focused tools address well. For example, to make huge datasets interactive at all requires a number of database and visualization-based optimizations. Similarly many of the interactive analysis innovations we’ve surveyed, such as recommendations based on graphical (e.g., wongsuphasawat2015voyager, wongsuphasawat2017voyager) or statistical features (e.g., lee2018case, vartak2015s) have an important role to play in reducing the many manual efforts required to do interactive analysis. However, we think the field of interactive data analysis could better achieve its goals of transforming how people interact with data if such innovations were guided by theories of inference. This is not to suggest that this task will be easy, as there is much still to learn about how to gently introduce modeling capabilities without interrupting an analyst’s flow, and about what users of different profiles do given more advanced modeling tools and asked to specify their expectations.
Our argument above has focused primarily on analysis applications involving abstract data, where standard statistical graphics are the norm. In some other applications of interactive analysis and scientific visualization, users may have a harder time expressing their implicit models. For example, doctors might want to search for clinical features in large databases of medical imagery to help them in making diagnoses. When experts’ implicit models are based in recognizing of visual-spatial signatures, it may be harder to elicit them, or at least require very different interfaces than we propose here. However, the fact that interactive interfaces are moving toward eliciting more input from domain experts like doctors’ to facilitate their work even when their implicit models are hard to formally represent suggests some parallels despite the different assumptions that can be made about the data (cai2019human).
Finally, a good model of intuitive graphical inference is likely to have implications for communicative use of interactive and static visualization as well. We have used the Bayesian model checking formulation to theorize about the role played by uncertainty communication in communicative visualization, for example (hullman2019). Visualizations are sometimes described as storytelling devices. The connection here is that stories can themselves be viewed as model checks or as explorations of anomalies, with the “twist” in a good story corresponding to a confounding of expectations (gelman2014stories). Putting these together suggests that designers and readers should consider visualizations with respect to the expectations or default narratives they overturn. A good model of inference can help us see the similarities between more than one pair of seemingly opposed activities.