1. Introduction
We consciously make decisions under uncertainty in everyday situations: dressing appropriately for the weather, leaving sufficiently early to catch the bus, planning for retirement, or interpreting medical test results(Kim et al., 2017). Most of us have never been formally taught to process statistical or probabilistic data (Sedlmeier and Gigerenzer, 2001) and have difficulty reasoning about such data (Koller and Friedman, 2009; Tversky and Kahneman, 1974; Díaz and Inmaculada, 2007)
. Inferring conditional probabilities in particular is a challenging cognitive process that requires a deep statistical understanding
(Lambert, 2018) and can have dire consequences if done incorrectly (Gigerenzer, 2003).Datadriven tools that make predictions about the future in support of human decisions are inherently uncertain. Yet, although they may have an internal representation of uncertainty, it is often hidden from the user (Fernandes et al., 2018). On the one hand, communicating uncertainty is challenging (Spiegelhalter et al., 2011) and can be more confusing than helpful (Greis et al., 2017). On the other hand, it has been shown that uncertainty displays can increase trust (Kay et al., 2013) in a system, improve understanding of the datagenerating process (Hullman et al., 2015; Joslyn and LeClerc, 2013; Kay et al., 2016a), help align users’ mental model with the computational model (Kim et al., 2017), explicitly reduce human biases (Tsai et al., 2011), and can improve decisionmaking (Fernandes et al., 2018; Kay et al., 2016a).
Beyond daytoday decisions, a large portion of scientific research is concerned with solving inverse problems of estimating unobservable parameters from noisy data and quantifying the uncertainty in those estimates. Some common data science practices have been criticised
(Kay et al., 2016b) for being a unidirectional, statisticiancontrolled process: starting with a dataset that is publicly available or has been collected in advance, models are learned and compared and a subset of evaluation data is presented via static numerical and visual representations. There is limited scope for users to probe: whether a representation accurately captures domain knowledge; sensitivity to specific data points; uncertainty in predictions; or alternative explanations of the same observations. The ability to explore through interaction becomes increasingly important as models form the basis for human decisionmaking as well as our scientific understanding.Bayesian probabilistic models have received significant attention as they capture uncertainty in the data, latent variables, and predictive posteriors, and domainknowledge can be incorporated through priors (Bayes and Price, 1763). Significant progress has been made in Monte Carlobased inference (Spiegelhalter and Rice, 2009)
, variational methods, and deep learningbased probabilistic models such as Generative Adversarial Networks
(Goodfellow et al., 2014)and invertible neural networks
(Rezende and Mohamed, 2016). The software stack supporting this work has also matured with public libraries such as Tensorflow Probability, PyMC
(Salvatier et al., 2016), Pyro, Stan and WebPPL. Due to their ability to provide calibrated estimates of uncertainty in the unobserved model parameters, Bayesian models are used widely in the scientific community to solve inverse problems, including estimating factors influencing wildfires (Silva et al., 2015) and the effect of nonmedical interventions in epidemiology (Flaxman et al., ).Interactive tools and visualisations for Bayesian models have the potential to let users explore otherwise opaque alternative explanations for observed data, similar to Explorable Multiverse Analysis (Dragicevic et al., 2019), and support cognitively difficult processes such as conditioning (Tsai et al., 2011; Taka et al., 2020; Micallef et al., 2012).
While novel uncertainty visualisations, Bayesian model visualisations, and interactive visualisations are being proposed continually, rigorous evaluations are lacking for some and are narrow in application focus or query type for others (Hullman et al., 2015; Kay et al., 2016a; Cole, 1989). Overall, the diversity in evaluation protocols employed across studies makes it difficult to compare results, replicate experiments (Micallef et al., 2012), and synthesise generalizable conclusions through metareviews (Hullman et al., 2019). Studies that do provide a quantitative user study evaluating Bayesian model visualisations and visualisations of uncertainty more generally differ in aspects such as the examined behavioral target, the subset of (Bayesian model) query types, the way user responses are elicited, and the chosen cost function applied to suboptimal responses, among other factors (Hullman et al., 2019). As a consequence, the fields progress is difficult to gauge, and some previously reported effects are irreproducible (Micallef et al., 2012). A robust, standardised, evaluation framework to quantify the effect of visualisation on Bayesian model interpretation (comprehension, predictability) and on decision making (rationality) would provide several benefits: data and evaluation results could be aggregated across individual studies, building larger more diverse samples than any individual study could feasibly collect; results obtained with different visualisation types could be more easily benchmarked, akin to benchmarking new machine learning algorithms on public datasets; barriers for new researchers to enter the field would be reduced by providing clear, specific guidelines on how to evaluate new combinations of models and visualisations.
As a contribution to the ongoing methodological renaissance (Nelson et al., 2018) in humansubject research addressing the recent replication crisis, this paper aims to further the community discussion of evaluation methodology and
2. Related Work
2.1. Bayesian probabilistic models
Let’s briefly introduce the notation for Bayesian statistics used throughout this paper. A Bayesian probabilistic model factorises the joint distribution
of observable random variables
and unobserved (latent) random variables into a prior distribution representing epistemological uncertainty and a likelihood representing aleatory uncertainty (1). Usually, corresponds to data one can collect andrepresents the parameters of a statistical model. Bayes’ Theorem expresses how one should rationally update ones
posterior belief about latent variables in light of observations (2). The posterior predictive distribution defines what one should rationally believe about future events given past experience (3). The integrals in (2) and (3) are often intractable and are therefore either approximated using Markov Chain Monte Carlo or variational inference, or avoided in likelihoodfree methods like GANs. These methods have in common that distributions
and can be represented by sets of samples. The prior, prior predictive, posterior, and posterior predictive distributions can be high dimensional with complex covariance structure and we currently lack tools for decisionmakers and domain experts with potentially little statistical background to interact with them effectively.
(1)  
(2)  
(3) 
2.2. Bayesian model visualisations
A widely studied problem in the literature on Bayesian decision making is commonly known as the mammography problem (Eddy, 1982), in which participants are asked to estimate the probability of a hypothetical patient having breast cancer, given a positive mammography result and estimates for the base rate and the test’s specificity and sensitivity. In this example, the test result is an observed variable and the presence of cancer is a latent variable. Solving this problem requires participants to apply Bayes theorem (2). Eddy (1982) found that 95 out of 100 physicians estimate the wrong quantity and, as a result, greatly overestimate the probability of the presence of cancer. Cole (1989)
observed improved performance where participants used visual aids, i.e. contingency tables, signal detection curves, detection bars or probability maps when solving this problem. A largescale online crowdsourced replication study
(Micallef et al., 2012) comparing areaproportional Euler diagrams, glyph representations, and hybrid diagrams combining both was unable to confirm previously observed benefits of presenting a visual representation alongside text, only observing increased performance over the textonly condition when numerical information was removed from the associated text. While it is still an open research question how best to present visual information in this particular problem, it is only representative of a small subset of queries one can answer with probabilistic models. How accurately would participants solve a related problem in which base rates and test errors were uncertain, test outcomes were continuous or multidimensional, or the question was rephrased to ask which latent state (cancer or no cancer) is most likely given the test result, or how confident they are that no cancer was most likely? In this paper, we map the space of possible queries one can answer with probabilistic models and argue that more research is needed to explore how visualisations can support users across the whole query space.Inspiration can be drawn from generic visualisations of uncertainty such as boxplots, density plots, Quantile Dot Plots
(Kay et al., 2016a), Hypothetical Outcome Plots (HOPs) (Hullman et al., 2015). It is unclear how well these approaches scale up to the large number of potentially highdimensional variables commonly found in Bayesian models. One particular challenge is how to communicate complex correlation structure across variables when only marginal distributions are displayed. Taka et al. (2020) review available visualisation tools for Bayesian analysis. Animation may help by sequentially showing, for example individual draws from the joint distribution (Hullman et al., 2015). Interactive primitives (Taka et al., 2020) could also help with exploratory data analysis. In this paper we test the hypotheses that animation or interactivity could help with Bayesian decisionmaking with a user study to illustrate the analysis workflow for the proposed evaluation protocol.2.3. Evaluating uncertainty visualisations
In a recent review of evaluation protocols in uncertainty visualisation, Hullman et al. (2019) highlighted the diversity in design decisions made by different authors, encouraged a less adhoc approach to study design, and more explicit reporting of design decisions. Specifically, they distinguish design decisions at six different levels, including the behavioral target (e.g. performance or quality of user experience), the expected effect (e.g., accuracy or satisfaction), the evaluation goal (e.g. comparison of different visualisations or understanding why a visualisation works well), measures of users’ behavior (e.g., a decision or selfreported satisfaction), elicitation (how is user behavior captured?), and analysis. With different decisions made at each of these levels affecting the conclusions drawn from individual studies, we argue that there is a need for the community to converge at a small subset of possible evaluation pathways to foster reproducibility, support benchmarking, and enable metastudies. Breslav et al. (2014) proposed a software system that records users’ microinteractions when interacting with uncertainty visualisations and to help designers analyse user behavior. In this paper, we propose one evaluation pathway for evaluating Bayesian model visualisations with respect to users’ rationality in decisionmaking and users’ comprehension of the uncertain information represented by the model. We specifically follow recommendations in (Hullman et al., 2019) to use decision frameworks for realism and control by measuring the quality of decisions using rational choice theory, to calibrate the user by providing feedback with regards to the decision quality and accuracy of nondecision responses, and by designing the MultiBet input element that allows participants to spreading bets across multiple outcomes where committing to a single decision is difficult. We aim to encourage reuse of this specific pathway by contributing a software framework that can be readily reused by the research community.
3. Evaluation Protocol
In presenting the evaluation protocol and subsequent user study we follow the taxonomy proposed in (Hullman et al., 2019). In this Section, we discuss behavioral targets, measures of user behavior and elicitation techniques as they are proposed to be fixed for evaluating user rationality and comprehension of Bayesian models and are implemented in the proposed software framework. The expected effects, evaluation goals, and analysis of a specific user study using this protocol are detailed in Section 5.
3.1. Query Space for Bayesian Models
Prior to specifying behavioral targets, it is useful to explore the space of possible questions users could answer with the help of a Bayesian model. In Section 2.1 we described Bayesian models as a factorized representations of and , inviting questions about both prior and posterior distributions in the presence of observed data. In both cases, Bayes’ theorem also provides a solution for conditioning on a subset of variables. Their factorization suggests another dimension in the space of possible questions specifying whether a question relates to observable or latent variables. For each variable the model holds information about multiple quantities: the identitity (or index) of a specific variable, what values could a variable take on, and how confident is the model that a variable might take on a particular value.
Evaluations of Bayesian model visualisations can therefore explore a highdimensional space of user questions, formed by the outer product of what the model can tell us

in the presence or absence of data?

when incorporating additional information about a subset of variables?

about observable and latent variables?

about a variables identity, value, or it’s confidence in a variable’s value?
Observability 
Quantity 


Query  Numerical Objective  
What return do you expect at least from asset with confidence?  value  
Which observed return would make it most likely that asset generated it?  value  
Which asset would you buy to maximize expected investment return?  id  
Which asset is most likely to return at least  id.  
How confident are you that asset will return at least ?  conf.  
How confident are you that an observed return of was most likely generated by asset ?  conf. 
Example queries at different points along the Observability and Quantity dimensions of the probabilistic model query space. Each query corresponds to a numerical objective, which enables quantitative assessment of users’ responses.
To illustrate the diversity of possible model queries, we consider the problem of investing a fixed sum of money for a fixed duration into one of a discrete set of assets. Asset returns could be represented with a Bayesian model that specifies the realised return as observable and asset classes as latent variables . Example queries representing different points in the query space along the Observability and Quantity dimensions are shown in Table 1.
We propose this mapping as a tool to support systematic design of user tasks for quantitative user studies and as a communication tool to support research reproducibility. A user study aiming to quantify the effectiveness of a universal visualisation tool for Bayesian models should include user tasks that uniformly cover the query space along each of the described dimensions. Different user groups such as domain experts, statistical model builders, and decision makers may only be interested in a smaller query subspace corresponding to their contribution to the Bayesian workflow (Betancourt, 2018), and evaluation of specialized visualisation tools may focus on user tasks that cover the most relevant subspace for their user group. This subspace could be determined, for example, through an exploratory study with participants from this group that systematically probes the relevance of queries along each of the query space dimensions. Similarly, a visualisation tool that supports a specialized application such as prior elicitation should be evaluated specifically on user tasks querying the probability and value of observable and latent variables prior to observing any data.
3.2. Behavioral Targets and Measures
In evaluating Bayesian model visualisations we seek to explore their effect on users’ performance, specifically on users’ ability to make optimal decisions and on users’ understanding of the model. We apply rational choice theory to evaluate decision making and thus refer to the optimality of users’ decisions as rationality and their level of understanding as model comprehension, respectively.
Rational choice theory assumes that user preferences can be expressed as scalar utility functions over all options of a choice set. If is known the value of a decision can be quantified as the expected utility under the model (4).
(4) 
In practice, the utility function is often intrinsic, possibly unique to each user, and not immediately accessible for quantitative evaluation. An extrinsic utility function could be provided as part of the user task in the form of hypothetical betting odds, for example. Participants’ incentives could be aligned with an extrinsic utility by asking participants explicitly to maximise utility or by tying reward mechanisms and a portion of participant compensation to expected utility.
Model comprehension
can be quantified by comparing users’ estimates of a modeled quantity to model estimates. Each numerical estimate provided by the user  of confidence, of a variable’s value, or of a variable’s identity  entails a probability distribution
. Its divergence from the probability distribution entailed by the optimal response can be measured using the KulbackLeibler (KL) divergence (5), or a symmetric variant: . The KLdivergence is zero for an optimal response and increases with relative entropy. It has been used previously to evaluate model comprehension in the context of uncertainty visualisations, where users were asked to graphically predict the distributions of events (Hullman et al., 2018). Micallef et al. (Micallef et al., 2012) proposed the absolute bias as an objective function, which penalizes larger errors more gracefully, whereas Tsai et al. (Tsai et al., 2011) used a hard binary criterion.(5) 
Equipped with objective functions that measure rationality as the similarity between user decisions and optimal decisions (4), and that measure comprehension as the divergence between distributions entailed by user responses and optimal responses (5), we can quantitatively evaluate the effect of model visualisations. By sampling user responses to tasks that cover the relevant query subspace across a range of models under one or more conditions we can quantify how close user responses are to the optimal response, and how significant differences of responses are, comparing user groups, visualisation types, and user behavior models.
3.3. Elicitation
Eliciting quantitative user responses to model queries requires associating an appropriate input widget with each query Quantity. Users are familiar with sliders to input continuous and ordinal values, making this type of input widget appropriate for queries related to user confidence or degree of belief, and potentially queries related to the value of a variable. The identity of an observed or a latent variable, such as the investment asset in Table 1, is neither continuous nor ordinal in the general case. Commonly used discrete choice input widgets include radio buttons, list selectors, and dropdown selectors. The visual response of radio buttons to user input bears an abstract similarity to placing a betting chip on specific outcomes, akin to playing the casino game of Roulette. Reinforcing this cognitive association through the choice of input widget could help communicate the effect of a user’s response, particularly in the context of tasks evaluating rationality.
User tasks for evaluating model comprehension can be readily formulated in such a way that the value of a variable and one’s confidence in that value are probed independently. Consider, for example, the questions What return do you expect at least from asset with confidence? and How confident are you that asset will return at least . For rationality tasks, this is not the case as the question of confidence relates to a decision, represented by one’s input choice instead of a property of the model. This could be resolved, for example, by simultaneously asking users to input their decision through a radio button list and to indicate their confidence in their decision with a slider. A joint response of a decision and confidence level, however, cannot be evaluated quantitatively using expected utility as it only specifies the probability (density) of the action distribution at a single point, leaving the distribution underspecified. Uncertain inputs for ordinal values (Greis et al., 2017) in comprehension tasks pose a similar issue for evaluation, despite potentially being more intuitive to use. We therefore propose a new input widget that allows users to spread their bets in discrete increments across the choice set.
The input generalizes the radio button list to incremental bets on a set of options (see Figure 2). It consists of columns with radio buttons each. Its state with represents the number of metaphorical chips placed on each option. In each column, the bottommost radio buttons are displayed as selected, akin to a stack of chips placed on top of the bottom edge of the widget. The next radio buttons above in each column are displayed as deselected, representing the remaining unused budget of chips. All other radio buttons are displayed as disabled, indicating that a bet of more than chips on option is invalid under state . A click on a deselected radiobutton in column increases to the corresponding number of chips and disables additional radio buttons in all columns accordingly. Conversely, a click on a selected radiobutton in column decreases to the corresponding number of chips and enables additional radio buttons in all columns. Note that these interactions update by variable amounts, depending on the specific radio button clicked.
3.4. Individual Tasks and Sequencing
Each user study task consists of a standardised set of interface elements. The Query specifies the question a user is asked to answer by interpreting and potentially interacting with a Model Visualisation. Any prerequisites for understanding the query and interpreting the model’s visual representation are specified in a task Context. An Answer Input element, such as a slider or a MultiBet widget, elicits a quantitative user response. An Acknowledgement input element such as a standard button widget affords finalizing the user response. Optionally, a Feedback element communicates aspects associated with the optimality of the submitted response, such as a decision’s expected utility, back to the user.
A comprehensive user study following the proposed evaluation protocol comprises all tasks from the outer product of query types, visualisation types, example models, and levels of task difficulty. This can quickly lead to an unacceptably high completion time per participant: participants of the user study in Section 5 completed 24 tasks in 50 minutes, excluding briefing and debriefing. To minimise learning effects as a confounder in user responses, tasks ordering should be randomized across participants. Participants may perceive a heightened cognitive load due to frequent context switches, which may impact their responses if they felt under pressure or they were not provided with the opportunity to take breaks when needed.
4. Software Framework
The evaluation protocol described in Section 3 was implemented as a webservice and is freely available online^{1}^{1}1Submitted as supplementary material for doubleblind review. The server is implemented using the Python micro webframework Flask^{2}^{2}2https://flask.palletsprojects.com/en/1.1.x/. The client is implemented in JavaScript using React^{3}^{3}3https://reactjs.org for interactivity, Redux^{4}^{4}4https://redux.js.org for state management, and D3^{5}^{5}5https://d3js.org for model visualisations and user feedback. This architecture was chosen to enable remote studies potentially run online and to minimise restrictions on the choice of server operating system or client devices. The server provides a REST API with a unique endpoint for each user study. Study management, participant management, participant response logging and task progress tracking is implemented using SQLite3 for persistence. Participants subscribe to a study on the client upon which a random user ID is generated on the server and added to the study, together with a newly generated randomized task order. At a study and userspecific endpoint, data for the first uncompleted task can be requested. The corresponding user response and other clientside logs are submitted at the same endpoint, which are written to the database and enable the data for the next task to be requested.
Randomized task orderings for each user are generated from a JSONformatted study template, which contains a nested TaskList whose elements are TaskLists, MergeLists or Tasks. TaskLists specify whether the order of list elements in the template should be preserved or randomized for each user. Nested TaskLists can express complex ordering requirements including a fixed ordered introduction followed by randomized tasks followed by a fixed ordered debriefing, and randomized orders of fixedorder sequences. Tasks define the taskspecific info for all interface elements identified in Section 3.4 (Context, Query, Model Visualisation, Answer Input, and Feedback), and metadata pertaining to the probabilistic model associated with the task. MergeList is a convenience type that specifies two lists of partial tasks from which the outer product is generated; this facilitates the specification of multiple repetitions of tasks where, for example, only the visualisation type or the associated probabilistic model differ across repetitions.
The task specification sent by the server is used by the client to populate a template user interface (UI) layout (see Figure 1). The template specifies a twocolumn layout with task Context, Query, Answer Input, and Acknowledgement displayed on the left, and the Model Visualisation displayed on the right. With interface elements implemented as React components, different Answer Input types and Model Visualisation types are created dynamically to match the current task’s specification.
The prior or posterior joint distribution of observable and latent variables, represented as Monte Carlo samples, is transmitted separately as a binary blob. A Generator object exposes operations on this distribution, such as estimating sample statistics or the marginal (cumulative) density. Model Visualisations interface with the Generator
and dynamically update their view using D3 to represent properties of the probabilistic model. A Boxplot component, for example, would access the generator’s estimates of the median, interquartile range and outlier samples for all its relevant variables and dynamically adapts its view to this data. One of the advantages of Bayesian models is the ability to answer whatif style queries expressed by conditioning the joint distribution on additional information. For example, expanding on the set of queries in Table
1, one can ask Which asset is most likely to return at least if asset was to lose in value (to return less than )? by conditioning the predictive distribution on and inspecting. Depending on the specificity of the condition there may only be few Monte Carlo samples in the support of the conditional distribution, providing a poor approximation of the density. Requesting additional Monte Carlo samples from the server would introduce significant latency rendering interactive exploration of this class of queries impractical. Instead, we implement bootstrapping on the client by resampling a fixed number of MCMC samples with replacement from the support. This approach reduces bias in the local density estimate while keeping latency at a minimum and avoiding the need for closedloop communication between the client and a serverside Bayesian inference engine.
The Generator is also used to evaluate user responses with respect to expected utility or its KLdivergence from the optimal response, to provide optional lowlatency user feedback. An example of feedback provided to the user immediately after their response was acknowledged is shown Figure 3. Here the probability distribution entailed by the user response is visually contrasted with the same distribution under the optimal response as per the model, and a numerical Reward in is given, where 10 corresponds to the optimal response. This is done to make it easy and consistent for the user to interpret feedback, and to incentivize users to optimize their response with respect to the evaluation objective.
All clientside state changes, induced by user interaction or by clientserver communication, are implemented as Redux actions. A Redux logger middleware caches all actions locally together with a timestamp. Logs of actions associated with each task are sent together with the user response, the estimate of the objective function, and the reward as a JSON object when the user acknowledges their response, and stored serverside as a string in the SQLite3 database.
Implementing a new user study with this software framework involves providing a study template that specifies task metadata and task order constraints, and optionally implementing new React Components for Model Visualisations or Answer Inputs. The modular clientside implementation facilitates seamless integration of new UI components, developed by the authors or other researches, in production webservices. In making the software framework that implements the proposed quantitative evaluation protocol publicly available, we hope to reduce the barriers for researchers to evaluate novel visualisations with users, and facilitate research into human perception of uncertainty.
5. Case Study: Interactive Boxplots and HOPs
This Section reports on a case study that aims to quantitatively evaluate two research questions: do animated or interactive visualisations of uncertainty in Bayesian models improve rationality and comprehension over their static and noninteractive equivalents? We apply the proposed evaluation protocol to gather quantitative data that let us answer these questions with a user study for a specific set of static, animated, noninteractive and interactive visualisations.
Static boxplots, error bar charts and confidence intervals have been shown to be regularly misinterpreted
(Belia et al., 2005), whereas animated Hypothetical Outcome have shown promising results in detecting trends from noisy data Plots (Hullman et al., 2015). Interactivity has been shown to help visualise highdimensional data revealing pertinent lowdimensional views through userguided projection
(Faith, 2007), align users’ cognitive with computational models (Kim et al., 2017), and help answer queries requiring Bayesian inference (Tsai et al., 2011).5.1. Apparatus
Participants interacted with the clientside web interface described in Section 4 and illustrated in Fig. 1. Here, we describe the Bayesian model participants were asked to query and the Model Visualisations chosen to compare the effects of animation and interactivity on user responses.
5.1.1. A Bayesian model of peak and offpeak queuing delays
Our objectives for choosing a Bayesian model for this study were that (i) it models a sufficiently complex probability distribution to be representative of a realworld problem (ii) it can be inferred from available real or synthetically generated data (iii) the semantics are easily explained in the briefing session of a user study and (iv) the risk of participants applying previously acquired domain knowledge in answering questions is minimised. We chose to adopt a textbook model from (McElreath, 2016), representing the queuing delay across a range of cafes when ordering a coffee, distinguishing between peak and offpeak periods. The datagenerating process was modelled hierarchically, assuming that the average peak wait times and differences between average peak and offpeak waittimes in each cafe are drawn from a global joint distribution as in Eqns. (6)(14). A synthetic model with the same structure and latent parameters generated from ) was used to simulate a total of (16 cafes x 2 periods x 5 visits = 160) observations . Hamiltonian Monte Carlo was used to generate 20,000 samples from the joint posterior distribution .
(6)  
(7)  
(8)  
(9)  
(10)  
(11)  
(12)  
(13)  
(14) 
5.1.2. Model Visualisations
The joint distribution of queuing delays for a subset of all caffees was represented to participants with one of four different Model Visualisations: Boxplots, Hypothetical Outcome Plots (HOPs), and interactive variants of each. Static boxplots represent the marginal distributions of latent variables and posteriors via aggregate statistics (median, interquartile range, outliers etc.). This static view hides covariance, masks distributional anomalies, and it has been shown that boxplots can be difficult to interpret for users with limited training in statistics (Belia et al., 2005). In contrast, Hypothetical Outcome Plots have shown to be more readily interpretable particularly with respect to trends across multiple variables (Hullman et al., 2015). Hypothetical Outcome Plots (HOPs) represent the joint distribution of latent variables and posteriors via animation by iteratively displaying a single joint sample of all parameters drawn at random. Iterative sampling allows the user to build a mental model of the joint distribution by visually integrating depicted samples over time (timemultiplexing). Sampling gives some access to the model’s covariance. However the user has no means to guide or constrain exploration towards subspaces of interest.
In order to investigate whether interactivity can improve rationality or comprehension, we propose interactive variants of boxplots and HOPs that allow the user to focus on selfdefined subspaces of the joint distribution. Interactive Boxplots allow users to explore a model by conditioning the joint distribution on one or more parameters. A mouseclick conditions the model on a small range of values of the selected parameter around the click location and updates the boxplots of all other parameters according to their marginals under the conditional distribution^{6}^{6}6Video hosted on Google Drive for doubleblind review https://drive.google.com/file/d/1VXoLeeqcGuDHMKbh5bNjGT05q25o6pVm/view. An example conditioning on one variable is shown in Fig. 4 (left). Additionally, the width of the selected parameter is proportional to the probability of the condition under the joint. Thereby the user can explore potential pairwise correlations and complex conditionals. We propose Ballistic Hypothetical Outcome Plots (BHOPs), which allow the user to guide exploration of the joint distribution by clicking on one parameter of interest to trigger a “pseudophysical” sequence of draws^{7}^{7}7Video hosted on Google Drive for doubleblind review https://drive.google.com/file/d/1HAZBo8bY2oOwHtXMuvlqWQ_Fww2LwnSI/view. The user can explore potential pairwise correlations and condition the joint distribution on a subset of values for one variable by “pinging” variables of interest (see Fig. 4 (right)).
5.2. Tasks
We were interested in exploring the suitability of the proposed visualisations as a generic interface to Bayesian models, and thus evaluate their effect on user responses across a large portion of the Query Space (see Section 3.1). A set of 24 queries were generated covering the outer product of two levels of Observability (observable, latent), three Quantities (value, confidence, and id), two levels of Conditioning (posterior, and posterior with additional sideinformation), and two variations of each query type. One variation of a query with additional sideinformation concerning the confidence in the latent ID of a variable provided the Context A friend tells you she just waited for her sandwich exactly 6 minutes, but won’t tell you which sandwich shop she visited. A mobile app shows you a longterm distribution of wait times across nearby sandwich shops as in the plot. You also have the additional information (not represented in the plot) that it currently takes between 3.5 and 4.0 minutes to get served at Sandys. and posed the Query Considering this additional information, how confident are you that your friend got her sandwich from Mesys?. The additional information from a trustworthy source should be used to further condition the model on queuing delays between 3.5 and 4.0 minutes at Sandys. While this is not possible with noninteractive boxplots, it affords visual filtering of samples shown with HOPs, and interacting with the visual representation of the distribution of queuing delays at Sandys in Interactive Boxplots and Ballistic Hypothetical Outcome Plots. The full list of questions can be seen in the Supplementary Material^{8}^{8}8user_tasks.txt.
5.3. Design
A factorial design was chosen to measure effects of the four Model Visualisations on all 24 tasks with 96 conditions in total. A pilot study established that only 48 conditions could feasibly be evaluated with each participant within one hour of participant time including introduction and debriefing. As a result, each variant of each query type was only paired with a random subset of two visualisation types per participant.
The interactive variants of boxplots and HOPs intentionally introduce only minimal changes to the visual appearance of their noninteractive counterparts. As a consequence, it is not immediately obvious to the user whether the specific visualisation they are presented with in each task of a randomized sequence affords interaction. We therefore separated all tasks in which interactive and noninteractive visualisations were presented, randomized the order of these groups, and the order of tasks within each group. Each group was further preceded by a message highlighting whether the subsequent set of tasks afforded interaction.
5.4. Procedure
The user study was conducted in the presence of a researcher on a laptop s/he provided. Participants were provided with an information sheet upon arrival and asked to sign a consent form. They received a £10 Amazon voucher for taking part and were promised to receive an additional £50 Amazon voucher if they obtained the highest cumulative reward among all participants. They were seated in front of a table. An introduction to the study, explaining the types of Queries, Model Visualisations and Answer Inputs, and describing the Task Feedback was delivered through the client interface and gave participants the opportunity to interact with each of these UI elements prior to commencing the study. Participants were offered clarifications at the end of the introduction, were reassured that they are responsible for advancing through the study and that they could take breaks as needed. They received task feedback after each task, and their cumulative reward was shown in the top right corner of the display at all times. They were informed when they advanced from the set of interactive visualisations to noninteractive ones or vice versa. After all tasks were completed, they were invited to provide qualitative feedback and given the opportunity to ask questions in a short debriefing discussion.
5.5. Participants
Participants were recruited via mailing lists and Yammer from former and present student populations, administrative, and academic staff across the University. The study was approved in advance by the University Ethics Committee. 22 volunteers took part in the user study, providing a total of user responses.
6. Analysis Workflow
This Section illustrates the proposed analysis workflow on the data collected from the case study in Section 5.
6.1. User Study Results
6.1.1. Was the task difficulty well calibrated?
The difficulty of study tasks should ideally be calibrated such that user responses deviate sufficiently from the optimal task response, confirming that a task was not too easy, and such that user responses are significantly different from a random response, confirming that a task was not too difficult. Note that only limited calibration can be performed a priori as it requires user responses. The sample of rewards obtained by participants for each study task, stratified across Model Visualisations, is illustrated in Fig. 5
. Almost all tasks are sufficiently challenging as sample rewards deviate considerably from the optimal reward. Tasks 1, 11, and 12 appear potentially too easy in hindsight, as the vast majority of participant responses on these tasks were close to optimal. A random agent’s behavior was simulated computationally by sampling 1000 responses uniformly at random from the space of permissible responses for each task. The corresponding rewards were compared to participants’ rewards. A MannWhitney U test with Bonferroni correction for multiple comparisons rejected the nullhypothesis (
) simultaneously for all tasks. This result suggests that users in aggregate responded to no task at random. Note, however, that this comparison provides a weak lower bound on task difficulty as a random agent completely disregards the query and the visualised information of the model.6.1.2. Does visual animation have an effect on rationality or comprehension?
We investigate the potential effect of visual animation on user responses as measured by task reward. We perform a withinsubject analysis of differences in rewards obtained on the same task by each user under an animated and a nonanimated condition. Specifically, we consider (B)HOPs as animated and (interactive) boxplots as not animated. Paired samples were stratified by task and aggregated separately across rationality questions and comprehension questions. The 50% central interval (CI) of differences in rewards observed in response to rationality questions and to comprehension questions overlaps with the null hypothesis, thus animation did not have an effect on rewards in this study. There was also no effect on the differences in time until an answer was acknowledged, with 50%CI of and for rationality and comprehension questions, respectively.
6.1.3. Does interactivity have an effect on rationality or comprehension?
Analogously, we investigate the effect of interactivity by analysing paired samples of responses under an interactive and noninteractive condition. We consider interactive boxplots and BHOPs as interactive, and boxplots and HOPs as noninteractive. The central 50%CI of differences in rewards observed in response to rationality questions and to comprehension questions also overlap with the null hypothesis, indicating that interactivity also had no significant effect on rewards in this study. Similarly, there was no effect of interactivity on response time with 50%CI of and for rationality and comprehension questions, respectively.
6.2. Discussion
Jointly, these results highlight the need for quantitative evaluation to test whether knowledge gained in other scenarios where interactive and animated visualisations have been found to be useful readily applies to the interpretation of Bayesian models. By no means do the results reported here suggest otherwise. Instead, one might speculate that previously identified challenges with communicating uncertainty are exacerbated by the potentially high dimensionality and complex covariance structure represented by Bayesian models, that effects are limited to a smaller query subspace, or that effect sizes of subtle changes to existing visualisations of uncertainty are too small to be measured with labbased user studies as suggested by the results of Micallef et al. (2012). The absence of a measured effect may also be related to other confounding factors (Hullman et al., 2019)
such as the users’ context, the framing of the task query, numerical information present in the task context, a mismatch between users’ reported subjective confidence and elicited statistical confidence, or participants’ employed heuristics to simplify judgements under uncertainty. More radically innovative interactive tools that communicate aspects of a Bayesian model’s structure might also be more supportive of rational decision making and model comprehension
(Taka et al., 2020). Each of these hypotheses can be tested with the proposed evaluation protocol and the software framework presented in this paper by making systematic changes to the study design. In the context of assessing task difficulty we introduced computational modelling, with a very naive behavior model, as a tool for interpreting user responses. In future work we will investigate this approach further to explore, for example, how user responses compare to an agent whose decisions are based on restricted views of the joint distribution such as the global mean, disregarding all information on uncertainty, or unconditional marginal distributions.7. Conclusion
While reasoning about uncertainty is challenging for people, and communicating uncertainty is difficult, systematic research in this area is becoming increasingly important as daytoday decisions and scientific discoveries are driven by probabilistic models. Standardised evaluation protocols support metastudies, reproducibility and measuring progress of the research community as a whole, but are currently lacking in the field of probabilistic model visualisation. With the proposed protocol and software framework, we hope to contribute to this community effort.
The evaluation protocol delineates the query space for Bayesian models and proposes one semantic mapping as a tool to define user study tasks. It draws the distinction between objectives for evaluating decision making and model comprehension, and formalizes each for quantitative evaluation. We identified a gap in available user input controls for communicating uncertainty in categorical decisions and designed the MultiBet in response. To further advance the standardisation effort, we developed a customisable software framework for user studies on probabilistic model visualisations that can be extended easily to include further input controls (Greis et al., 2017) and visualisation types (Taka et al., 2020). We reported on a user study to illustrate the process of conducting a quantitative user study with this framework, and to test the research hypotheses whether animation or interaction has an effect on rationality and comprehension, which were motivated by findings in the literature. The quantitative evaluation results did not support either of these hypotheses, highlighting the need for scrutinizing design hypotheses with quantitative user studies.
Acknowledgements.
This work was supported by the Sponsor E RlPSRC Project: EP/R018634/1 Grant #3.References
 LII. An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, F. R. S. communicated by Mr. Price, in a letter to John Canton, A. M. F. R. S. Philosophical Transactions of the Royal Society of London 53 (), pp. 370–418. External Links: Document, Link, https://royalsocietypublishing.org/doi/pdf/10.1098/rstl.1763.0053 Cited by: §1.

Researchers misunderstand confidence intervals and standard error bars
. Psychological Methods 10 (4), pp. 389–396. External Links: Document, Link Cited by: §5.1.2, §5.  Towards a principled bayesian workflow.. Note: Available online at: https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html#6_discussion [Accessed May 5, 2020]. Cited by: §3.1.
 Mimic: visual analytics of online microinteractions. In Proceedings of the 2014 International Working Conference on Advanced Visual Interfaces, AVI ’14, New York, NY, USA, pp. 245–252. External Links: ISBN 9781450327756, Link, Document Cited by: §2.3.
 Understanding bayesian reasoning via graphical displays. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’89, New York, NY, USA, pp. 381–386. External Links: ISBN 0897913019, Link, Document Cited by: §1, §2.2.
 Assessing students’ difficulties with conditional probability and Bayesian reasoning. International Electronic Journal of Mathematics Education 2 (3), pp. 128–148. External Links: Link Cited by: §1.
 Increasing the transparency of research papers with explorable multiverse analyses. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, New York, NY, USA. External Links: ISBN 9781450359702, Link, Document Cited by: §1.
 Probabilistic reasoning in clinical medicine: problems and opportunities. In Judgment under Uncertainty: Heuristics and Biases, D. Kahneman, P. Slovic, and A. Tversky (Eds.), pp. 249–267. External Links: Document Cited by: §2.2.
 Targeted projection pursuit for interactive exploration of high dimensional data sets. In 2007 11th International Conference Information Visualization (IV ’07), pp. 286–292. External Links: Document, ISSN 15506037 Cited by: §5.
 Uncertainty displays using quantile dotplots or cdfs improve transit decisionmaking. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–12. External Links: ISBN 9781450356206, Link Cited by: §1.
 [11] Report 13  estimating the number of infections and the impact of nonpharmaceutical interventions on covid19 in 11 european countries. Imperial College London (30032020). External Links: Document, Link Cited by: §1.
 Reckoning with risk: learning to live with uncertainty. Penguin. External Links: ISBN 0140297863 Cited by: §1.
 Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §1.
 Designing for uncertainty in HCI: when does uncertainty help?. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, CHI EA ’17, New York, NY, USA, pp. 593–600. External Links: ISBN 9781450346566, Link, Document Cited by: §1.
 Input controls for entering uncertain data: probability distribution sliders. Proc. ACM Hum.Comput. Interact. 1 (EICS). External Links: Link, Document Cited by: §3.3, §7.
 Imagining replications: graphical prediction discrete visualizations improve recall estimation of effect uncertainty. IEEE Transactions on Visualization and Computer Graphics 24 (1), pp. 446–456. External Links: Document, ISSN 21609306 Cited by: §3.2.
 In pursuit of error: a survey of uncertainty visualization evaluation. IEEE Transactions on Visualization and Computer Graphics 25 (1), pp. 903–913. External Links: Document Cited by: §1, §2.3, §3, §6.2.
 Hypothetical outcome plots outperform error bars and violin plots for inferences about reliability of variable ordering. PLOS ONE 10 (11). External Links: Link Cited by: §1, §1, §2.2, §5.1.2, §5.
 Decisions with uncertainty: the glass half full. Current Directions in Psychological Science 22 (4), pp. 308–315. External Links: Document, Link, https://doi.org/10.1177/0963721413481473 Cited by: §1.
 When(ish) is my bus? usercentered visualizations of uncertainty in everyday, mobile predictive systems. In ACM Human Factors in Computing Systems (CHI), External Links: Link Cited by: §1, §1, §2.2.
 There’s no such thing as gaining a pound: reconsidering the bathroom scale user interface. UbiComp ’13, New York, NY, USA, pp. 401–410. External Links: ISBN 9781450317702, Link, Document Cited by: §1.
 Researchercentered design of statistics: why bayesian statistics better fit the culture and incentives of hci. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI ’16, New York, NY, USA, pp. 4521–4532. External Links: ISBN 9781450333627, Link, Document Cited by: §1.
 Explaining the Gap: Visualizing One’s Predictions Improves Recall and Comprehension of Data. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, CHI 1́7, New York, NY, USA, pp. 1375–1386. External Links: ISBN 9781450346559, Link, Document Cited by: §1, §1, §5.
 Probabilistic graphical models: principles and techniques  adaptive computation and machine learning. The MIT Press. External Links: ISBN 0262013193 Cited by: §1.
 Chapter 12: Leaving Conjugates Behind: Markov Chain Monte Carlo. In A Student’s Guide to Bayesian Statistics, J. Seaman (Ed.), pp. 264–289. Cited by: §1.
 Statistical rethinking: a bayesian course with examples in r and stan. Chapman and Hall/CRC Press. External Links: ISBN 1482253445 Cited by: §5.1.1.
 Assessing the effect of visualizations on bayesian reasoning through crowdsourcing. IEEE Transactions on Visualization and Computer Graphics 18 (12), pp. 2536–2545. Cited by: §1, §1, §2.2, §3.2, §6.2.
 Psychology’s renaissance. Annual Review of Psychology 69 (1), pp. 511–534. Note: PMID: 29068778 External Links: Document, Link, https://doi.org/10.1146/annurevpsych122216011836 Cited by: §1.
 Variational inference with normalizing flows. In Proceedings of the 32 nd International Conference on Machine Learning, , Lille, France, pp. 1–9. Cited by: §1.
 Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2, pp. e55. Cited by: §1.
 Teaching Bayesian reasoning in less than two hours. Journal of experimental psychology. General 130 (3), pp. 380–400. External Links: Document, Link Cited by: §1.
 A bayesian modelling of wildfires in portugal. In Dynamics, Games and Science, J. Bourguignon, R. Jeltsch, A. A. Pinto, and M. Viana (Eds.), Cham, pp. 723–733. External Links: ISBN 9783319161181 Cited by: §1.
 Visualizing uncertainty about the future. Science 333 (6048), pp. 1393–1400. External Links: Document, ISSN 00368075, Link Cited by: §1.
 Bayesian statistics. Scholarpedia 4 (8), pp. 5230. External Links: Document Cited by: §1.
 Increasing interpretability of bayesian probabilistic programming models through interactive representations. Frontiers in Computer Science 2, pp. 52. External Links: Link, Document, ISSN 26249898 Cited by: §1, §2.2, §6.2, §7.
 Interactive visualizations to improve bayesian reasoning. Proceedings of the Human Factors and Ergonomics Society Annual Meeting 55, pp. 385–389. External Links: Document Cited by: §1, §1, §3.2, §5.
 Judgment under Uncertainty: Heuristics and Biases. Science 185 (4157), pp. 1124–1131. External Links: Document, ISSN 00368075, Link, https://science.sciencemag.org/content/185/4157/1124.full.pdf Cited by: §1.
Appendix
We also performed a betweensubject analysis of responses, exploring in more detail differences in accuracy on task level. Using bootstrap with 1000 samples, we estimated the mean and standard deviation of accuracy per task and per condition, and estimated the effect size as the difference in sample means scaled by the square root of the average standard deviation within each pair of conditions.
Comments
There are no comments yet.