Talking datasets: Understanding data sensemaking behaviours

by   Laura Koesten, et al.
University of Southampton

The sharing and reuse of data are seen as critical to solving the most complex problems of today. Despite this potential, relatively little is known about a key step in data reuse: people's behaviours involved in data-centric sensemaking. We aim to address this gap by presenting a mixed-methods study combining in-depth interviews, a think-aloud task and a screen recording analysis with 31 researchers as they summarised and interacted with both familiar and unfamiliar data. We use our findings to identify and detail common activity patterns and necessary data attributes across three clusters of sensemaking activities: inspecting data, engaging with content, and placing data within broader contexts. We conclude by proposing design recommendations for tools and documentation practices which can be used to facilitate sensemaking and subsequent data reuse.



There are no comments yet.


page 7

page 8

page 9

page 20


Lost or found? Discovering data needed for research

Finding or discovering data is a necessary precursor to being able to re...

Code Duplication and Reuse in Jupyter Notebooks

Duplicating one's own code makes it faster to write software. This exped...

The Landscape of Ontology Reuse Approaches

Ontology reuse aims to foster interoperability and facilitate knowledge ...

Eliciting Best Practices for Collaboration with Computational Notebooks

Despite the widespread adoption of computational notebooks, little is kn...

LUCE: A Blockchain-based data sharing platform for monitoring data license accountability and compliance

Easy access to data is one of the main avenues to accelerate scientific ...

To Reuse or Not To Reuse? A Framework and System for Evaluating Summarized Knowledge

As the amount of information online continues to grow, a correspondingly...

ObjTables: structured spreadsheets that promote data quality, reuse, and integration

A central challenge in science is to understand how systems behaviors em...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Climate change. Poverty. Global hunger. All have been dubbed wicked problems (peters2017so) that have the best chance of being tackled by bringing together and using cross-disciplinary data in new ways (DBLP:books/ms/4paradigm2009/HeyTT09); CODATA111 Although data reuse is increasingly encouraged (EUpublications), reusing data involves a host of challenges (i.e. (wilkinson2016fair)), particularly when people must bring together and make sense of data from multiple domains.

Even within their own disciplinary domain, understanding and making sense of data is a difficult and time-intensive process for researchers and data professionals (kern2015there; DBLP:conf/chi/MullerLWPTLDE19). Contributing to this difficulty is the fact that data do not speak for themselves, but require supporting structures – both social and technical – to convey the meaning necessary for reuse (borgman2015big). The effort and costs involved in sensemaking can potentially be reduced through the development of computerized tools and systems (as in DBLP:conf/chi/RussellSPC93). Designing such tools is contingent upon first understanding and describing the relatively understudied behaviours involved in data-centric sensemaking.

Here, we identify and detail patterns of activities involved in data exploration and sensemaking. In the context of this work, data can be thought of as collections of related observations organised and formatted for a particular purpose, reflecting the variety of concepts different actors have of data (see (borgman2015big)).

In order to identify sensemaking patterns, we employ a multi-faceted study design, using a qualitative approach to study data reuse and drawing on verbal summarizations as a method of uncovering sensemaking processes. We build on the following ideas: (1) the act of summarizing is a form of sensemaking, (2) verbal summarization represents unique cognitive processes and (3) it is possible to uncover common patterns in sensemaking activities when people describe data that they are familiar with and data that were unknown to them. We used these ideas to develop the following research questions:

  • RQ1: What are common patterns in sensemaking activities, both for known and for unknown data, in the initial phases of data-centric sensemaking?

  • RQ2: How do patterns of data-centric sensemaking afford potential data reuse?

To explore these questions, we created a study design combining in-depth interviews, a think-aloud task and a screen recording analysis with researchers while they interacted with and verbally summarised an example of their own research data and a dataset that was new to them. We present the results from this study, and use our findings to identify activity patterns and data attributes which are important across three clusters of sensemaking activities: inspecting the data, engaging with the data content more deeply and placing data within broader contexts. Finally, we discuss how these findings can be used to develop design recommendations for tools and documentation practices to facilitate sensemaking and subsequent data reuse.

The key contributions of this work are threefold: (i) the sensemaking patterns which we identify, (ii) the user needs for data reuse which we highlight, (iii) the design recommendations which we propose for data reuse.

2. Background

Sensemaking has been studied across a range of disciplines, including psychology (e.g. DBLP:journals/expert/KleinMH06), decision making (e.g. klein1993decision; malakis2013sensemaking), organizational behaviour (e.g. maitlis2014sensemaking for a review), information seeking (dervin1997given; DBLP:journals/ijhci/MarchioniniW07), and human computer interaction (HCI) (e.g. DBLP:conf/chi/RussellSPC93). In this work, we focus on sensemaking as discussed in information science and HCI. In these domains, sensemaking is defined as the process of constructing meaning from information (blandford2010interacting), and is recognised as being an iterative process that involves linking different pieces of information into a single conceptual representation (hearst2009search; DBLP:conf/hvei/Russell03).

Models of information seeking behaviour often present sensemaking as a key component. Traditional models detail the specific steps involved in sensemaking during information seeking as a sequential, yet evolving, process (e.g. DBLP:journals/iwc/SutcliffeE98; hearst2009search; kuhlthau2004seeking). While traditional models tend to be static, many of their authors emphasise that people’s behaviour is complex and changes when being presented with new information. More recent, dynamic models acknowledge a variety of influencing factors in finding and making sense of information, e.g. skills, knowledge, preconceptions, culture or motivation (e.g. kelly2009methods; klein2007data). Other work examines the cognitive mechanisms involved, framing sensemaking as a series of different information processing components taking data as input and producing conceptual changes as an output (doi:10.1086/594540; zhang2019cognitive).

While sensemaking of textual information has been well-explored, there is a gap in research that aims to understand the strategies involved in making sense of data. Compounding this is the fact that the very definition of ”data,” particularly ”research data” has itself been the subject of much debate. An increasingly common conceptualisation of research data is that proposed by Borgman: data are representations of observations, objects, or other entities that are used as evidence for the purposes of research or scholarship (borgman2015big). This definition does not distinguish between data formats or qualitative or quantitative data, recognizing that what serves as data in one situation for one individual may not act as data in another situation for another individual (see also (Pasquetto2017)). Similary, in their data frame theory, klein2007data emphasise how the perspective (or frame) of the data consumer shapes the data in terms of how they are perceived, interpreted and even acquired. Through engaging with data, preexisting frames either change or get reinforced, which can be seen as an aspect of sensemaking.

Work in information science focuses on identifying the contextual information needed by researchers to evaluate data for reuse in a variety of fields, ranging from zooarcheaology to social sciences and earthquake engineering (i.e. (faniel2017practices; DBLP:journals/jasis/KimY17; DBLP:journals/cscw/FanielJ10). More recently, Faniel and colleagues have created a typology of the contextual information needed to support data evaluation across three disciplinary domains, finding that information about data production, data repositories, and data usage are key in making decisions about reusing data (faniel2019context). Details about how researchers understand and work with data that they do not create themselves are also found within the context of ethnographic studies of broader data practices, such as data sharing (zimmerman_new_2008; wallis2013if)or data discovery (gregory2019understanding).

Engagement and sensemaking with data is also determined by the purpose of the engagement activity, usually connected to a task, which can range in specificity. The importance of quality metrics and uncertainty attached to data is task dependent (DBLP:conf/chi/KoestenKTS17; Boukhelifa:2017:DWC:3025453.3025738). While there are a variety of task classifications in information seeking literature, to this date there is no established taxonomy for data-centric work tasks, which might reflect rapidly changing work practices with data.

Studies in HCI tend to focus on quantitative data, addressing the role that visualization plays in identifying patterns in data (furnas2005making; DBLP:journals/tvcg/KangS12)or proposing tools such as a visual analytics system tailored for particular groups of data analysts (stasko2008jigsaw) or agile display mechanisms for users accessing government statistics (marchionini2005information)

. Other relevant work examines exploratory data analysis (EDA) strategies, where new data are explored with a series of procedures, until a high-level story emerges. Common EDA techniques include performing rough statistical checks and analyses (e.g. calculating descriptive statistics) or looking for general trends or outliers in the data

(DBLP:journals/jais/BakerJB09; marchionini2006exploratory). Most EDA techniques are graphical in nature and are undertaken to help assess the quality of the data.

The first phase of getting to know data, which can involve exploratory data analysis techniques, has been shown to involve a high level of cognitive effort (DBLP:conf/chi/KoestenKTS17; DBLP:journals/jasis/ZhangS14). We aim to more deeply examine data-centric sensemaking, with a focus on the initial account of understanding and interacting with data, in order to identify common patterns of sensemaking activities. To do this, we build on work using summarization as a way of exploring cognitive processes (hidi1986producing).

Summarisation tasks, as studied in psychology, are described as involving three distinct cognitive activities: selection of which aspects should be included in the summary; condensation of source material to higher-level ideas, or more specific lower-level concepts; and transformation by integrating and combining ideas from the source (hidi1986producing). As comprehension is viewed as a prerequisite for summarisation (kintsch1978toward), text summarisation tasks have been used to assess recall and language abilities (kintsch1978toward). We build on these ideas and use summarization as a way of exploring the cognitive processes involved in comprehending data as an information source.

In a recent study, koesten2019everything had participants produce written dataset summaries in order to better understand selection criteria for datasets. While these summaries provide insights into the conceptualisation of datasets, written summaries do not always capture the complex verbal sensemaking that precedes their creation (mayernik2011metadata). Verbal, or spoken, summarizations often reflect deeper, more spontaneous cognitive processes (DBLP:journals/jasis/CrestaniD06), but they have yet to be used to understand data sensemaking behaviours.

Here, we make use of verbal data summarizations to explore how researchers make sense of both familiar and unfamiliar data. We aim to bring together multiple perspectives, capturing the initial phases of sensemaking of our participants, but also investigating what they believe is necessary for someone to know about their data in order for them to be understood. Our findings reveal patterns in sensemaking activities, but they also reveal details about information structures needed for data reuse. These findings are therefore relevant for tool designers, as well as other stakeholders ranging from data producers and consumers to data repositories and publishers.

3. Methodology

We drew on our past work in textual data summarization (koesten2019everything) and the reuse of research data (gregory2019understanding) to create a semi-structured interview design examining how people verbally summarize and make sense of both familiar and unfamiliar data.

3.1. Interview design

All participants were asked to bring data that they had used or were familiar with to share during the interview. We did not specify what we meant by ”data”, leaving the decision about what constitutes ”their data” up to participants themselves. The majority of participants (n=27) chose to bring data which they had created. Most of the data brought by participants were spreadsheets (); other data included textual data (e.g. interview transcripts), images, videos or other artifacts.

We also prepared a dataset to share with participants, a modified version of a spreadsheet from a popular news source in the United Kingdom, the Guardian Data Blog222 which was used in a previous study (koesten2019everything) (see Figure 1; the entire spreadsheet is available on a GitHub repository333 associated with this work). This dataset was chosen as it met specific selection criteria: it included numerical and textual data, missing values, inconsistencies in formatting and some ambiguous data. At the same time, the data were understandable and not specific to a particular domain.

Figure 1. Excerpt of the ”given dataset” describing the global occurrence and mortality rate of swine flu

The interview protocol (available on the GitHub repository) consisted of two primary sections: questions about the participants’ data, and questions about the spreadsheet we had prepared, referred to here as the given data. Interviews lasted minutes and were held using the web-conferencing application Zoom. Audio-recordings were created for all interviews; screen-recordings capturing participants’ online interactions with the data were obtained for interviews. (Five recordings were not created due to technical problems).

Both sections of the interview protocol began with the verbal summarization task. The task was for participants to provide a general description of the data to a potential data reuser who is trying to decide whether to use the data for a particular task, but is unable to see the data. We formulated the task in this fashion in order to elicit rich descriptions of the data.

After completing the summarization task in the first section of the interview, further questions (see interview protocol) were asked regarding participants’ data, data creation and documentation practices, and data reuse. In the second section of the interview, we shared our dataset with participants and asked them to perform the same verbal summarization task for the given data. We also asked them to discuss and describe specific areas of the dataset and posed follow-up questions about data reuse, data sharing and data search.

We conducted two pilot interviews with researchers at the end of October 2018. This allowed us to determine the time needed to conduct the interviews and to fine-tune the interview questions and the set-up of the summarization task. The remaining interviews were held between November 2018 and January 2019 and were transcribed by a professional transcription firm.

3.2. Recruitment

Our primary sample was drawn from a pool of individuals, past respondents to a large scale survey study conducted by (DBLP:journals/corr/abs-1909-00464), who had published at least one article indexed in Elsevier’s Scopus literature database444 in the last three years.

We sent a total of recruitment emails in November-December 2018 in two batches and received positive responses. From those responding, we selected participants who represented a range of disciplines and nationalities and who were proficient in English. We recruited an additional four participants via convenience and purposive sampling, for a total of participants. Participants for our pilot interviews were identified using purposive sampling.

3.3. Participants

Participants ranged from age to age , with the majority being between and years old (Median ). Participants reported

different countries of residence worldwide, with a skew towards the Netherlands (

) and the UK/USA (). out of the countries are in Europe; of our participants live in European countries.

Although participants work in multiple countries, the majority were fluent in English; minor problems with language or internet connectivity were experienced in two of the interviews. Over half of the participants (n=18) worked at a university or college at the time of the study, with six working in research institutions. Participants’ disciplinary domains and roles are described in Table 1. All participants have previously published research papers. The majority of our participants were experienced with quantitative research; others categorised themselves as predominantly qualitative researchers or used both quantitative and qualitative methods.

P Domain Role
Biological sciences Project manager
Life sciences, Paleontology Project acquisition manager
Biblical studies, Information Technology Researcher
Musicology, Humanities Project leader, project manager
Geophysics Data curator
Physics, Chemistry Post doctoral associate
Analytical Chemistry Researcher
Material Science and Engineering Professor emeritus, researcher
Social Science (Social Care, Social Work) Senior research fellow
Social sciences, Computer science Director of Research Services
Social justice, Socioeconomic Justice Professor
Geology, Earth Sciences Research scientist
Earth Sciences PhD student
Fluid Mechanics Researcher
Molecular Biology Researcher
Tourism, Social Psychology Senior lecturer
Mathematical education Assistant professor
Telecommunications, Computer science Associate professor
Biological anthropology Postdoctoral research fellow
Medicine, Biomedicine Researcher and teacher
Agriculture, Food science PhD Student
Medicine Surgeon, PhD student
Entomology (Biological Sciences) Researcher, curator
Environmental sciences, agriculture Lecturer
Biostatistics, Epidemiology Associate professor, biostatistician
Material Science Researcher
Psychology Researcher, PhD student
Veterinarian, Obstetric Clinician Assistant professor
Information science, Medicine Associate director
Environmental sciences Researcher
Medicine, Mental Health Head of research group in a hospital
Table 1. Description of participants (P) with their disciplinary domains and professional roles

3.4. Analysis

3.4.1. Developing and testing the codebook

We used the qualitative data analysis program NVivo to thematically analyze the interview transcripts. The coding strategy was developed through a multi-step process of independent parallel coding (thomas2006general). Two authors independently analyzed a sample of seven interview transcripts and developed an initial codebook with supporting examples, employing a combination of deductive and inductive thematic analysis (robson2016real). Codes developed through deductive analysis were oriented on the different sections of the interview protocol (e.g. the codes ”given data” and ”participants’ data”) and on existing literature in data summarization (koesten2019everything) and data reuse (faniel2019context; faniel2017practices). Within these high-level themes, the authors iteratively developed codes based on a general inductive approach (thomas2006general) through sequential readings of the transcripts.

The independently-developed codebooks and supporting quotations were compared for similarities and differences. Two of the authors combined and modified the individual codebooks to create a single unified codebook, which was then used to re-code the sample transcripts independently. The re-coded sample transcripts were then compared in order to test the validity and comprehensiveness of the codes. To further enhance the reliability of the coding scheme, two senior researchers checked and discussed the unified codebook for a sample of the data.

Based on this analysis, we made further modifications, resulting in a nested coding tree consisting of three primary codes with a total of 30 child codes (see Figure 2 for the most used codes). Nine of these child codes also had multiple child codes of their own. We consolidated these codes through axial coding (straus1990basics) drawing out those links which allowed us to answer our research questions. The themes identified through axial coding are used to structure the Findings section and form the basis of the synthesis presented Figure 7.

Figure 2. Primary codes

3.4.2. Screen recording analysis

We analyzed the 26 captured screen recordings to identify common interactions with the given dataset. We examined participants’ actions from the time when they were first shown the dataset to the end of the general description task. Two of the authors independently viewed a sample of these screen recordings and made a list of common interactions. The authors then discussed the list of interactions and developed a list of interactions to use in the video analysis. We used this list to identify the first occurence of each possible action; we did not consider the duration of each action in our analysis. Via the screen recordings, we could document how much of the spreadsheet participants could visibly access on their screens without scrolling. This allowed us to control for larger screens.

3.4.3. Quantitative analysis

All plots were created using the statistical analysis program R. We used the color palette ”viridis”, as it has been shown to be more accessible than other comparable color schemes555

3.4.4. Ethics

The study was approved by the University of Southampton’s Ethical Advisory Committee under ERGO Number 45874. Informed written consent was given by the participants prior to the interview.

4. Findings

We present our results along three dimensions which we identified via axial coding to be related to sensemaking activities: inspecting the data, engaging with the content and placing data in broader contexts. We pay special attention to both activities, which we define as the physical and cognitive actions undertaken by participants, and data attributes, or characteristics of the data with which participants interacted. We examine the findings in light of data reuse and synthesize them in the Discussion section to provide an overview of the activity patterns we uncovered.

4.1. Inspecting

When participants were first shown our dataset, we asked them to provide a general description after taking a few minutes to explore the data silently. Here, we examine both the order of how participants discussed attributes of the data (see Figure 3 and 4) and their actions in the spreadsheet during these verbal summarisations (Figure 5).

4.1.1. Order of verbal summarisation

We observed two approaches when completing the verbal summarisation task: participants took either a linear or an interwoven approach. In linear summaries of the given data (n=23), participants addressed the data attributes identified in Figure 3 (e.g. time, location, format) one-by-one before proceeding to the next attribute. In the interwoven summaries, participants interspersed descriptions of individual attributes with analyses and comments (n=8).

Figure 3. The order in which participants discussed certain attributes in the given dataset

Figure 3 shows the attributes headers (n=64), quality/uncertainty (n=42), topic/title (n=33), and analysis/dependencies (n=30) were most frequently mentioned in descriptions of the given dataset. The majority of participants mentioned the overall topic or title as one of the first two attributes (n=24); roughly half of participants mentioned the format or shape of the data (e.g. the number of columns, rows or observations) either first or second (n=15). The discussions of other attributes were likely influenced by the structure of the dataset itself. Location information was a prominent part of the dataset, e.g., as the data were ordered by country and the four columns containing geographic information were positioned on both the left and right sides of the spreadsheet. The data included only minimal temporal information. The majority of general descriptions mentioned location (n=22) toward either the beginning or end of the description, while temporal information was mentioned in just under half of the general descriptions (n=13).

In the linear general descriptions, time and location were discussed or questioned at a general level.

This communication shows us the deaths from swine flu in the countries around the world, Afghanistan, Albania, Columbia, Bolivia. (P31)
The one thing that is not apparent immediately from the data is the time span. (P19)

Participants taking an interwoven approach to summarization engaged in more initial analysis, repeatedly seeking relationships and dependencies between the spreadsheet columns or expressing uncertainty about meaning or the quality of the dataset.

I don’t see any date or year, for purposes of comparison then it’s a bit problematic, I can for example only do comparison charts for those with an asterisk for Austria and Bulgaria, for example, because they all have the data from 2009 but for number of deaths recorded in that country, then this data is useful, infection rate per population. (P5)

We observed similar attributes within the general descriptions of participants’ own data, but participants also mentioned additional attributes, i.e. details of their own field of research, methodology and details of the particular study, data availability and access restrictions, and the existence of additional information or documents needed to describe and understand their data (Figure 4).

Figure 4. The order in which participants discussed certain attributes in their own data.

Most of the general descriptions of participants’ data followed a linear pattern (n= 24). This could be because participants were not working to understand their own data, but were rather aiming to make their data understood. They were also to some extent better prepared for the requirements of the task, having already had experience working with and discussing their data. In interwoven summaries of their data (n=7), participants mixed descriptions of study methodologies with descriptions of headers, variables and data format; some relied on methodological descriptions to communicate the general topic of their data.

These are experiments from a 50 metre long indoor set up that we have, where we ran gas and oil through the pipeline, through a 60 metre long pipeline, and we measured the average values - so pressure drop and build-up. And we did that for different gas and liquid velocities, and they also changed the type of oil, so we did this with one oil with a quite low viscosity and one with oil with a quite high viscosity. (P14)

4.1.2. Actions in the spreadsheet

The actions captured in the screen recordings of the verbal summarization task for the given data support the attributes identified in Figure 3 and 4. Figure 5 shows the total number of actions observed, as well as the frequency of their order of occurrence. Scrolling right (n=24) was the most frequently observed action, followed by scrolling down (n=23). Participants also clicked on or indicated column headers and specific values. Clicking on both headers (n=18) and particular cells (n=17) occurred more often than other forms of indicating these areas of the spreadsheet, i.e. hovering over or circling them.

Figure 5. Total number of actions and order of actions observed in screen recordings of the given dataset. Size of circle represents number of participants engaging in activity. Figure is arranged according to which activity most frequently occurred first. Color represents purpose of action.

Analysing the order of these actions show that the majority of participants began by determining the length, breadth and general topic of the given dataset. Nine participants first scrolled down, while eight clicked on headers and seven initially scrolled to the far right of the spreadsheet. Once participants established the general shape of the data, more analysis-related actions were observed, most noticeably examining specific cell values by clicking or indicating and moving back and forth between different columns. One example of left and right scrolling was switching between the different types of geospatial columns which were not located in close proximity to one another.

Some participants prepared for analysis by reformatting the spreadsheet (n=10), i.e. by adjusting the column width or freezing columns or rows. (For three recordings, the width of one of the columns was not optimized to allow reading one of the header names). Analysis features of the spreadsheet were only used in four instances, for actions such as sorting, filtering, or performing calculations. This reflects the nature of the think aloud task and the time limitations of our study.

We began examining screen recordings for participants’ data after the general description task. These screen recordings provide a different type of insight, revealing actions that participants took to ensure that the interviewer adequately understood their data. The actions that we observed ranged in complexity. Participants with spreadsheet data often clicked on each column header again, as they provided more detail about each column. Others demonstrated how they would analyse the data, showing unique functions of their analysis software or creating sample spreadsheets and plots.

4.2. Engaging with the content

Participants engaged with data content in more depth as they worked to explain and understand them. This stage of deeper interrogation sometimes began during the scanning phase; it occurred both when interacting with the given dataset and with the participants’ own data. Examples of similarities and differences in the content and process of interacting with both data types are presented in Table 2.

Their Data Given Data
Encodings (P23): Because of the way it’s set up, while it may appear on screen in words, it’s actually all in zero, one, two, three, four up to nine. But you can present it yourself in words, and that’s really helpful if you’re scoring, because you can actually click from one cell to the next, and down the base it will actually tell you what that character is in that cell and whether you just code a zero or a one.

(P9): Age band and then I grouped the age bands into adult and older people, and that was one of the issues of each of our journeys, a different way of categorising people, so I ended up with a very broad age range really.
(P28): I don’t know if it could be interesting to be banded in categories like, I don’t know, continents… it depends.

(P19): It looks like they started to code for if there are no deaths, then it’s coded as a zero, but there are some instances where there are missing data.
Acronyms and abbreviations (P22): That is a classic abbreviation in the field of hepatic surgery. AFP is alpha feto protein. It is a marker. It’s very well known by everybody…the AFP score is a criterion for liver transplantation.

(P28): So if there is strange code that people cannot understand, I make a legend. Normally the colleagues I’m working with, we use our terms, so I tend to use the most user terms like LD; like SEM is typical for…everyone in my field.
(P20): If I would make this assumption, I would say this is like geographical location of the countries, but I have no idea what is ‘Long’ and ‘Lat’. In my work, I have never encountered these kind of acronyms, so it’s currently hard for me to assume what would this mean in the context of swine flu.

(P7): I’m not sure what ‘long’ means. I wonder if it’s not something to do with longevity. On the other hand, no, it’s got negative numbers. I can’t make sense of this.
Identifiying ”strange things” (P7): Let’s say from previous experiments and or runs, you know that repeating the experiment, you would get within an error of say 5% or 2%, whatever the case maybe. So obviously these three [indicating error bars] are huge, and it would mean that you will have to repeat. So either something is wrong in your system, or you get something wrong during the sample preparation, or the system’s not stable, or something else is going on. Or that you’re just not planning enough repetitions to get to the true value, so I think it is an important measure to determine if you’ve got reliable data. (P20): If I would not go into those cases, like with these discrepancies, I would just assume that this column indicates only the optimised data about whether they’re aware or they’re not, that [deaths are] due to swine flu in these countries.”

(P14): That is simply a column saying if there are any deaths at all or not for a certain country related to the swine flu. I see there is a formula here, just simply checking if Column L is larger than zero. So exactly using this information…so then that means there is something wrong with the formula or I completely misunderstood what Column L is.
Table 2. Exemplary quotes illustrating participants’ interactions with both their own and the given data.

In this phase, participants identified patterns and trends (e.g. via simple analyses or discovering relationships between columns) and discussed encodings expressed within the data. They also explored uncertainties attached to the data and the data’s overall integrity. We point out two particular instances observed in this level of engagement with the data: understanding strange things and collaborative sensemaking.

4.2.1. Pattern identification, encodings and analysis tools

When discussing their data, participants demonstrated how they seek patterns and relationships by creating plots, switching between layers on geospatial images, and developing scales and formulas. Participants also expressed a desire to create plots to visualize our data to identify trends and sought anchor variables as they investigated individual columns and described sample rows. Participants further drew attention to columns with limited value ranges in their descriptions, e.g. columns with binary variables or those with only a few categorical variables; fewer participants analyzed the range of values in columns with continuous variables.

Participants ”encode” their own data in ways that help them more easily identify trends and generate findings by, e.g., converting categorical variables to numerical values and vice versa. These encodings are often determined by the specifications of the analysis tools and software which participants use, such as SPSS, R, or domain-specific programs; these programs also influence how participants structure their data, at times increasing the data’s machine readability.

I use this data to create variables in SPSS. The one I’m looking at now has still got all the labels as words; I thought it would be easier to look at as a spreadsheet. There’s another process I went through to translate the words into numbers. For SPSS, you really need numbers in the value labels. That was a whole process, to go through of coding the written, the categories, but just adapting those into numbers that I use. (P9)

[we are] working in R and our supervisor wrote a package which can easily work on it, but the main aspect is that you have to have grouping variables and independent variables which are the sensor signs. Then you have to separate the data to these different types, so the grouping variables and the independent variables because the PC and the IDA in the R can work in this structure. (P21)

Other forms of encoding included developing broad categories or groupings to describe and analyze data, such as differentiating between raw data and derived data or numerical and non-numerical data. Participants also created groups of certain columns according to their semantic meaning; demographic variables were mentioned together, as were descriptive attributes for the same instance, e.g. ”columns with sources” or ”socioeconomic measures”. These types of encodings were observed when participants discussed both their own and the given data. When working with our data, participants also searched for how null values were encoded and represented (see Table 2).

While the majority of participants reported using spreadsheets or Microsoft Excel at some point in their data workflows, very few participants actually made use of the built-in analysis tools in our spreadsheet at any time during the interview. This could be due to time limitations during the interviews or to the fact that participants were not familiar with the Google Sheets environment which we used to present our data. It could also be a result of the fact that some participants do not use spreadsheets to analyze data directly, but rather reported using them for other purposes, such as recording and organizing data or cleaning and preparing data for analysis. Spreadsheets are also used by participants to specifically enable sharing data in a way that is easily accessible or compatible with a variety of analysis programs, facilitating data reuse.

4.2.2. Expressing uncertainty, seeking quality and understanding strange things

Both when discussing their own data as well as when engaging with our data, participants expressed concerns about potential misinterpretations, focusing on questions that could arise due to misunderstandings about how data were cleaned and processed. For both quantitative and qualitative data, participants viewed the encodings and categories that they had constructed as major risk points for correctly interpreting their data. The encodings that facilitated their own use of the data (as a data producer) may not be helpful or be explained well enough to enable appropriate data reuse by potential consumers of their data.

Although we’ve tried really hard, because we’ve put in a coding frame and how we manipulate all the data, I’m sure that there are things in there which we haven’t recorded in terms of, well, what exactly does this mean? I hope we’ve covered it all but I’m sure we haven’t. (P10)

They also questioned and critiqued the meaning of our data, highlighting the lack of contextual information about how the dataset was created and the use of unexplained abbreviations in the dataset. When discussing their own data, however, participants often referred to unexplained acronyms or abbreviations common in their own disciplinary domains (see Table 2: Acronyms and abbreviations).

Participants combined their interpretations about the meaning of our dataset along with analyses of its completeness and how missing values are reported to make quality determinations. They also used missing values as checkpoints to identify relationships between columns and to identify potential errors or anomalies in the data.

So the data is fairly complete with really limited missing values, so the quality of data looks good. (P29)

It’s got some blanks, which I presume means no data has been given. Although that’s interesting…there’s some missing data which shouldn’t be missing. Because Armenia, for number seven say, it reports three deaths and yet the swine flu deaths is blank, so that’s a bit of an anomaly, and there are quite a few blanks actually. (P9)

Participants looked for other unexpected values (e.g. outliers) or inconsistencies in formatting or standard ways of reporting to assess the precision and accuracy of the given dataset. Wrestling with these strange things often served as the entry point to a deeper engagement and understanding of the data, allowing participants to question their assumptions and initial understandings (see Table 2).

Now that sounds quite high for the Falklands. I wouldn’t have thought the population was all that great…and yet it’s only one confirmed case. Okay [laughs]. So yes…one might need to actually examine that a little bit more carefully, because the population of the Falklands doesn’t reach a million, so therefore you end up with this huge number of deaths per million population [laughs], but only one case and one death. (P23)

Some of them have decimals, like a lot of decimals, and some don’t have any decimals. So I don’t know whether that means that those are supposed to be measured more precisely…or that there is an inconsistency of using the amount of decimals per cell. (P1)

Encountering the unexpected in their own data is a critical and normal part of participants’ research processes (see Table 2). While anomalies can be indicative of possible mistakes or points for improvement in the study design, they can also reflect unexpected external changes to the study environment, e.g. people withdrawing from a study, or new technologies that have been adopted over the course of long-running studies. Participants repeatedly emphasized the need to communicate information about these changes or potential sources of error to possible data reusers.

4.2.3. Sensemaking through ”collaborations”

Working with team members is key to making decisions about study design and analysis, i.e. deciding which data are important to record and analyse, how to develop scales, clarifying study details and making sense of mistakes or unexpected values in the data.

I know roughly what it consists of, but I didn’t know precisely, and I had to go back to the person who generated it and say ”What does column D mean? And where is the location of the thermocouple whose temperature is measured in column E?”. (P8)

We have a table with…almost 30 columns with variables that were collected, including the names of the people who went into the field and collected each of the samples. So we are keeping track of who’s responsible for each of the samples, then if we find any error, any mistake, then we can contact those people. (P24)

During the interview, participants also collaborated with the interviewer to ensure that the interviewer correctly understood their data. Often, important details crucial to understanding the data emerged only when both the interviewer and the participant could see and interact with the data together. We saw this, e.g., in the case of learning about the importance of temporal information in coral reef imaging data or highlighting a key variable (inflammation) in a study about bipolar individuals. For some, it was nearly impossible to explain their data without being able to indicate specific areas of an image or demonstrating how error analysis was conducted.

4.3. Placing

Figure 6. Placing data in contexts

As they engaged more deeply with data, participants placed data into existing contexts, practices and knowledges; this process of ”placing” occurred at different scales (Figure 6). Data were placed within their immediate contexts of creation, e.g., when participants detailed study designs, experimental setups or the conditions surrounding data collection, including broader temporal or geographic details.

And it describes, or rather it comprises the results of a laboratory experiment lasting about an hour in which the experimenter,[..], is inducing the crystallisation of a salt in a porous rock sample. And as the crystallisation proceeds, two things happen. One is that the heat is evolved and so the temperature changes and the rock sample slightly increases in temperature, and we measure that temperature at three different positions. And in addition there is a very slight expansion of the rock which we detect from the output of a very sensitive mechanical gauge. And then these four measurements, the three temperatures and the mechanical strain are measured at intervals of one second over a period of a few hours and so the dataset consists of the set of numbers. (P8)

These types of contextual details have the potential to impact the meaning which the values themselves carry.

Error bars depend a lot on the experimental conditions and on the condition of the material. So, for example, if it was used on powder samples then the error bars would be bigger than the ones that were obtained on single crystal data. (P6)

At a broader level, participants conceptually placed data within the norms of their disciplinary domains, referencing discipline-specific methodologies and limitations, ways of analysing and verifying data or common data formats. They also recognised that broader social contexts can influence the sensemaking process.

…let’s say it’s our country, not everyone has Internet access. So I wouldn’t necessarily be able to access these sources; I couldn’t go on the Internet and look up numbers and stuff like that. But maybe it’s possible, and I think that’s possible for most, is to make a phone call to some kind of authority and see if they can either help me directly with information or refer me to some source where I can go look. (P7)

Finally, participants attempted to place data within the world, gauging how representative data are of a particular phenomena. These judgements reflect assumptions about how much the data reflect reality, as data themselves are usually samples which are hardly ever complete, unbiased or without conflict or ambiguity.

It’s a pretty large sample size, again, 1,260. We have equal numbers of males and females. We have three ethnicities: Caucasian, African American and Hispanic, equal numbers of each of those. So it’s a well-balanced data set and, because of that, if you were to be interested in how these different cultural values vary or not based on ethnicity, it would be an excellent dataset. (P19)

One simple example we observed in our dataset was a contention about the representativeness of the countries, which showed a range of interpretations and was expressed in a variety of ways. Participants questioned both the representativeness of the list of countries and also whether the data themselves were representative for the entirety of each country.

P2: It’s listing the countries for which data are available, not sure if this is truly all countries we know of…

P8: It includes essentially every country in the world

P29: global data

P30: I would like to know whether it’s complete…it says 212 rows representing countries, whether I have data from all countries or only from 25% or something because then it’s not really representative.

P7: If it was the whole country that was affected or not, affecting the northern part, the western, eastern, southern parts

P24: Was it sampled and then estimated for the whole country? Or is it the exact number of deaths that were got from hospitals and health agencies, for example? So is it a census or is it an estimate?

During placing activities, participants commonly reported the need to know the original purpose for which the data was created. Descriptions of their own data’s original purpose were often complex, as they were intermingled with descriptions of the field of research. Participants floundered in their attempts to place our data, in part because the original study objective was unknown.

Unlike with ours, I didn’t see anything about the objective. I don’t know what you were looking for. All I see is data and basically I have to try to make sense of it. (P21)

Although important across all dimensions of sensemaking, disciplinary and data expertise were key to placing data. Most participants felt that it was easier to describe their data than to summarise and try to understand our data.

My data are much more easier, for sure, because I knew what I was talking about. I didn’t have to go through, to understand, which was the quality of the data; I didn’t have to understand what it was, the kind of information that this data was giving to me. If I have to go to a database that I’ve never seen normally and also that is not in my field, it is absolutely much more difficult. (P28)

Some participants believed that only experts from within their same discipline could reuse their data meaningfully, citing the specificity of their data or the need to analyze the data with specific programs. Others stated that appropriate reuse would require a deep understanding of evolving domain research practices; many had difficulty imagining alternative uses for their data by individuals outside of their area of research.

Yes, I don’t think you can use it for anything else really, it’s a very specialised and specific field that we work in. (P14)
I don’t think it would be used for a radically different purpose, but I could imagine somebody taking the data and reanalysing it in relation to a different model of the underlying process, for example. Or confirming the interpretation that we’ve placed upon the data using our own model…But they would be people who’d be very close to the topic. (P8)

A few participants believed that the use of common data structures, terminologies and methodologies within their domains made it possible for their data to ”speak for themselves” to others with similar expertise.

I probably wouldn’t have to describe it [the data]. Probably they would just get it. (P1)

These are just numbers in columns. So I would have to explain maybe more how I obtained this data, but the meaning is the same….For the general public it would be much more difficult to explain the details. (P26)

4.3.1. Structures needed for sensemaking and reuse

We observed procedural reasons why data do not speak for themselves, but require additional structures to convey meaning. Participants did not always include column names in their data in order to make them more machine readable, e.g., or they divided datasets into various sub-sheets to ease processing. We also observed that additional information structures, i.e. documents and codebooks, are needed to support reuse for data consumers regardless of domain, as well as to support future (re)use by the original data producer.

Ten years makes a big difference in my memory, too. So, even at the time when I was working in it, I didn’t have to refer to the code book, I knew it all by heart. I would have to go back and look at the code book now myself, and that’s why it’s important to keep the notes on what you’re doing with the variables and keep a copy of the survey that was used, the research instrument, those sorts of things. (P11)

Participants described a large variety of documentation and knowledge transfer practices surrounding datasets (Table 3). These practices and the formats used to provide additional information are shaped by journal restrictions, metadata schemas and repository requirements, and by the perceived usability of the information structures themselves. Sometimes this additional information is separate from the data; other times it is embedded within the data, i.e. in the case of annotations or descriptions of codes within a spreadsheet. Different data consumers may require different information structures for the same data.

If they’re using a different program, I can direct them to a character set, which you can get from this matrix, but the publication of that character set is quite separate but available online. (P23)

So if you start with the README here, then we can take several directions. So, you can delve into the features, what they mean, and you can delve into the feature documentation. You can delve into ways to query it, and do that for yourself, and then you go to all kinds of programming documentation. And then, here I also pointed to tutorials, [..]. And you can read some papers about it and they’re also cited…We also have a Slack community with 120 people, and if they have really hard questions, we invite them to Slack, and they are being answered by either me or people who know more about it. (P3)

Supplementary files (corresponding spreadsheets, text documents, README files)
Resource description document (including, e.g., explanations of columns)
Documentation of the code
Emails & set with questions and answers
Figures, visualisations
Code book / sheet; can contain personal data
Presentations / slides
Technical reports
Audio folder
Slack channel
Annotations & interpretations (also on various levels of the data, e.g. on image layers)
Questionnaires / surveys (variables often created in order of the questions)
Table 3. Information structures supporting sensemaking

The study also revealed attributes which should be present in information structures to avoid losing meaning and to enable data reuse. We present these attributes according to two perspectives which emerged in the interviews: the data consumer’s distance to the data and the methodological approach of the original study in Table 4 and 5. We define ”distance to the data” in terms of a data consumer’s familiarity and expertise with particular data. Someone ”close to the data” will have more knowledge of the data and how they were created; someone more distant from the data will not have this knowledge. In Table 5 we focus on two broad approaches to data collection: quantitative and qualitative methodologies.

Close to the data Far from the data No difference in distance to the data
More granular information about conditions, assumptions, errors, trends, possible questions the data can answer, variable types, analysis / programming details, sample creation details, study objective More granular information about research explanation, explanation of all abbreviations / acronyms, how ratios / errors / columns derived Supplemental materials
Benefits / problems of data Less technical language Study objective and expected outcome
Previous work that this data builds on, relation to standards in discipline, out-of-discipline abbreviation Research explanation Data collection details
Less granular information about field of research, common abbreviations, data format / structure Tailor to field of interest of the data consumer Sample details
More general data presentation Potential use of data

Calculation of ratios/ standard deviation

Usage restrictions, confidentiality concerns
Explanations of codes, categories, scores
Table 4. PERSPECTIVE: Information needs related to distance to the data. Someone ”close to the data” will have more knowledge of the data and how they were created. Someone ”far from the data” will not have this knowledge.
Quantitative Qualitative
Everything about the experimental setup (time period, instrument settings, location, etc.); where the setup differs from real world settings Everything about study set up (e.g. time period, total number of interviews, age of participants, online or in person)
Who did which work Who did the research / the analysis
Are measurements individual measurements or multiple measurements of the same thing that were aggregated Survey questions and coding strategy
(e.g. predefined answers or free text)
Which section of the object was measured, on how much material a measurement was made If participants were required to answer particular questions
Standard error, precision of measurements How sample was chosen, created, scope and characteristics of sample
Factors influencing uncertainty (seasonal differences, external events, etc.) How categories were chosen, how scores were created, variables of focus
Standard units of measurement in a field Social context / setting of study
Instruments - specifications, reliability, how calibrated, how instruments work and how they create the data output, software format used to capture / analyze data
Number of repetitions of experiment
Expected outcomes of method
Table 5. PERSPECTIVE: Methodological narrative – not exclusive to either approach

We asked participants if they would describe their data differently to a colleague with similar expertise, i.e. someone close to the data. Rather than needing less information about the data due to prior knowledge, many participants believe that individuals with similar expertise need more granular information about data creation conditions, prior work which the data builds on, and the potential uses of individual variables. Some participants said that they would not describe their data differently to someone close to the data, emphasising instead common attributes that would be important, regardless of a data consumer’s distance to the data (Table 4).

So if I’m talking to somebody who is data agnostic or who has not worked in a data science field, my description would be limited to the basic variables, the fields that are of interest to the person…If I’m talking to a data science person or a data scientist who’s going to use the data, my description would be more granular. My description would be more helping the person understand the benefits as well as the problems associated with the data. (P29)

I would maybe shorten up some things and focus on some others. For instance, I would expect that everyone I’m working with expects to code BMI in kilograms and to have birth weight in grams because it’s a standard unit for those things in Danish health research…I would tell them more about the study design, because often people I work with are epidemiologists. So there one of the main things would be, where do these 2,000 women come from? Is it data from Denmark or from somewhere else? Is this from last year or from 30 years ago? Things like that, so more complex information so that they can decide if it’s relevant for their interests. (P25)

Different methodological approaches also elicited particular details, although these details were not mutually exclusive of each other. For quantitative data, participants reported needing extensive information about an experimental setup, including how experimental designs differed from the real world environment.

Well I would perhaps mention the size of the pipe diameter. That is something that they’re often interested in, because in real pipelines, the pipe diameter is perhaps 12 inch and more, quite large, while in typical labs, you don’t have this possibility (P14).

Key findings from the qualitative perspective include the choice of categories, questions of representativeness and details of the study set-up that influence the data, such as whether participants are required to answer a survey question. Social context also influences how study participants communicate, e.g. in the case of interview participants in conflict areas who may not feel safe enough to respond truthfully to questions.

5. Discussion

We bring together different perspectives in this study, drawing together participants’ descriptions of familiar and unfamiliar data and our observations of how participants engaged with these data. We now synthesise our findings, identifying different patterns of activities and their related data attributes involved in data-centric sensemaking. The sensemaking efforts which we observed can be synthesized into a core set of clusters of activity: inspecting the data, engaging with the data content more deeply and placing data within broader contexts (Figure 7). In this synthesis, we define:

  • Activity patterns as the actions, both physical and cognitive, which people undertake when making sense of data

  • Data attributes as characteristics of the data which people interact with as they perform a set of activities

  • Clusters as the patterns of activities, with their related attributes, which tend to occur together

We also examine the relation of these three clusters of sensemaking activities to information structures needed for reuse and discuss three emergent themes in the context of this synthesis.

Figure 7. Patterns of activities and attributes in data-centric sensemaking

C1 contains activities and attributes that provide participants with a broad overview of the data, such as understanding the data’s general topic, title, structure and format. In the given data, we observed that most participants scanned the spreadsheet first vertically to look at the number of rows and to get an idea of missing values and then horizontally to look at the headers.

C2 represents a deeper engagement with the content of the data, including activities such as establishing relationships between columns, performing simple analyses, picking out examples of particular values, conducting quality assessments and trying to understand the uncertainty attached to the data, by questioning, e.g. the meaning of missing values or abbreviations and acronyms.

In C3, we observed participants placing data in relation to the world and different contexts. They worked to understand how the data were related to study designs, to disciplinary norms as well as to temporal and geographic considerations to understand the representativeness of the content. They questioned, e.g., the level of detail (granularity) presented in the data as well as the data’s original purpose.

Our findings show that level one (C1) of Figure 7 was mostly done alone; level two and three (C2, C3) were often often solved in collaboration. When participants described their own as well as our data, critical details emerged only after the initial description, when both the interviewer and the participant could see and interact with the data together. These conversations moved away from objective descriptions towards describing the complexity of qualitative judgements behind the (quantitative) variables, as well as to rich descriptions of factors influencing the origination of data.

When discussing their data, participants made use of the information structures identified in Table 3 and their related qualities (Table 4 and 5) across all dimensions as they worked to make their data understood. Many also referenced the lack of contextual information (e.g. purpose, collection methods) in the given data as being a stumbling block to understanding.

The importance of needing additional contextual information to support data reuse has been previously noted (see, e.g. (borgman2015big)). Frameworks describing contextual information for digital collections have been proposed (DBLP:journals/ijdc/BakerY09; Chin:2004:CSC:1031607.1031677; DBLP:journals/jd/Lee11), but do not make the connection to sensemaking that we draw here. Work in data preservation highlights the contextual information needed about data in different disciplines, e.g. (DBLP:conf/jcdl/FanielKKBY13; DBLP:conf/asist/FanielKY12), and, more recently, across disciplines (faniel2019context; DBLP:journals/corr/abs-1909-00464); translating these findings into interaction guidance and subsequently into tools supporting reuse remains difficult.

Part of this difficulty is due to the dynamics and context-specific nature of working with data (DBLP:conf/chi/KrossG19; DBLP:conf/chi/MullerLWPTLDE19). The in-depth descriptions of study set-ups, purposes of data collection and domain specific knowledge brought by our participants underscores this problem. As a way to address this challenge, Figure 7 can be viewed in the context of work using design patterns in areas such as software engineering (gamma1995design), user interface design (granlund2001pattern), or ontology design (gangemi2009ontology).

This approach identifies high-level patterns as a way to provide repeatable solutions to recurring design problems. This creates possibilities for a level of formalisation that enables the development of flexible designs and tools. Our results are in line with (Boukhelifa:2017:DWC:3025453.3025738; DBLP:conf/chi/KoestenKTS19; DBLP:journals/computer/MarchioniniHZE05), who see flexibility as being key to supporting real-world data workflows. Figure 7 therefore represents a patterns-based approach to conceptualising the processes involved in the initial stages of data-centric sensemaking.

To further clarify Figure 7 and to illustrate how our findings could spur design efforts, we discuss three specific themes that emerged in our research at the level of each identified cluster, before presenting concrete design recommendations in the Conclusion.

5.1. Understanding shape

When inspecting a dataset for the first time, participants either discussed the data in a linear fashion, addressing each attribute individually before moving to the next, or they took a more interwoven approach, mixing descriptions of dataset attributes with analyses and questions, which also overlaps with activities in the second cluster of Figure 7.

As they engaged in these acitvities, participants aimed to arrive at an overview, to create a high-level representation of the entire dataset in their head while engaging with it (see also (DBLP:conf/chi/KoestenKTS17)). We observed different levels of focus in this process. Participants alternated between zooming out to describe the data at the level of the entire spreadsheet, e.g. the number of observations or format of the data, and zooming in to look at specific cell values or individual parts of the data. Participants adopting a more interwoven approach tended to engage in the process of zooming in and out more often than those using a linear approach.

This desire to understand the data as a whole has parallels in the information science literature, where the need to understand an entire information collection at a high level has been mentioned (rieh2016towards). Discussing the visual aspects of sensemaking, DBLP:conf/hvei/Russell03 also mentions the need to understand what is in a whole collection. white2009exploratory recommend allowing users to filter, sort and explore different views of the data on demand for complex search tasks.

In our study, the information is distributed among the cells of the dataset, the structure and organization of the data, as well as any related information structures.

5.2. ”Strange things” as an entrypoint, not an obstacle

Participants repeatedly encountered and dealt with ”strange things” in both data sources, i.e. outliers, errors, missing data, and inconsistencies in formatting. As they wrestled with the unexpected in the data, they engaged in the patterns identified in Cluster 2, such as expressing uncertainty, seeking relationships or performing analyses.

Whereas (zhang2019cognitive)

describe dealing with conflicts as a barrier to sensemaking, our findings suggest that conflict is a useful and accelerating moment in the exploration of data. The concept that real data is usually messy and complex was internalised by our participants. Participants were neither surprised nor alienated by conflicting data; in contrast, errors and uncertainties were expected and participants applied different analytical strategies to overcome them, a finding also in line with recent literature

(DBLP:conf/chi/KoestenKTS17; Boukhelifa:2017:DWC:3025453.3025738).

Participants repeatedly emphasized the need to communicate information about sources of error and possible uncertainties to potential data consumers, although there were a variety of communication methods used to do so, some of which are detailed in Table 3. Preferred methods for communicating information about strange things in the data were sometimes chosen in an arbitrary or convenience-based method. Some of this information was embedded within data themselves, leading to potential problems in machine readability. Others were not linked to the data in a sustainable way, making them unsuitable for long-term preservation of meaning.

5.3. Perspectives in placing

Participants place data and their representativeness in a range of broader contexts (the world, disciplinary norms, methodological contexts of creation). While we present these placing activities separately in Cluster 3, they can in fact be closely related. We saw this particularly in how participants placed data in terms of a study’s methods and their own disciplinary expertise.

Details about data creation are often implicit within a domain’s epistemic norms (leonelli2016data). Even with the best documentation, this complicates cross-disciplinary data reuse. A data consumer from another domain may not have the experience necessary to understand or evaluate the appropriateness of a particular methodological approach. Additionally, our participants’ concept of the details needed for reuse encompassed much more than just a step-by-step process of how a study was conducted. Rather, for both quantitative and qualitative data (see Table 5), participants needed details about the entire narrative surrounding data creation, i.e. why a certain method was chosen or the unique, local aspects about a study’s set-up and their attached constraints.

We also found that the granularity of these narratives is related to a potential data consumer’s expertise or distance to the data, with experts needing more detailed information about study descriptions. Table 4 also shows common attributes, aside from methodology, that are important in facilitating understanding, independent of a data consumer’s expertise with data, i.e. needing information about study objectives, usage restrictions, and explanations of categories and acronyms.

6. Study Limitations

Although they were working in a wide range of disciplinary domains and research related roles, our sample population consisted of a particular type of professional: researchers who have published an article indexed in the Scopus database666 Scopus comes with a skew towards certain research fields777 For instance, the Arts & Humanities are not as well-represented. Scopus has an extensive review process for the journals which it selects for inclusion (elsevier2017scopus); and there are roughly an equivalent number of journals from the broadly defined fields of social sciences, life sciences, health sciences and physical sciences (elsevier2017scopus). While the limitations of Scopus could lead to a potential bias in our sample, the selection criteria we applied also ensured that the sample population met our study requirement of speaking with different types of researchers with data experience.

As the study was conducted with researchers, our findings may not be directly applicable to other individuals. Focusing on researchers met the goals of our study, particularly our aim of examining sensemaking in light of data sharing and reuse. However, we believe the general sensemaking patterns emerging from this study are to some extent transferable between different skill sets; simply the execution of how these goals are achieved might look different for people with a higher or lower level of data literacy. Nonetheless, the study would need to be repeated with different populations in order to apply our findings more broadly.

Our participants work in a variety of countries; English was not the native language for all participants. To account for this, we selected our sample from those responding to our recruitment messages carefully to ensure that participants had a high degree of English fluency. While we see the global spread and disciplinary diversity of our sample as a strength of the study, we also recognize that data, research, and sensemaking practices are influenced by social, legal, and economic contexts unique to both country and disciplinary domains.

The sensemaking patterns which we identify could also be limited by the data themselves. Different data may have surfaced different data attributes. By including participants’ data, as well as the given dataset in the study, we attempted to balance this potential bias. A final potential limitation is that describing their data first might have primed participants for performing the verbal summarization task with the given data, influencing the way that they performed the second task. Given that any data description will be based on a particpant’s prior experience, we believe this is a natural side effect of these types of studies.

7. Conclusions and Recommendations

In this study, we investigated common patterns in sensemaking activities in initial encounters with data, particularly in light of potential data reuse. We identified three clusters of activities involved in initial data-centric sensemaking (inspecting, engaging with content, and placing in context), and detailed the observed activities and data attributes relevant in these clusters. This approach provides an avenue to bring focus to design efforts, narrowing down the myriad of technologically feasible solutions to those supporting the sensemaking needs of users / data consumers. We conclude by presenting concrete recommendations for how the patterns we identified can be applied to design efforts. This illustrates the large space for future work necessary to apply insights about data-centric sensemaking and reuse to the workflows of researchers and practitioners in order to better support user-centered data practices.

7.1. Recommendations for C1: Understanding shape

Understanding the shape of a dataset, a theme drawn from C1, can be supported through interface design and functionalities in a number of ways. Our results show that data needs to be understood as a whole, on the level of the entire dataset, which suggests summarization methods, which can be of textual, visual or statistical nature, that provide a zoomed-out view of the data (e.g. DBLP:journals/corr/abs-1805-03677). At the same time, participants also engaged with subsets of the data, particularly individual columns; these patterns could be supported through zooming in via column level summaries, including interactive plots and visualizations at the column level (e.g. this idea is partially realised by Kaggle in their dataset previews888 Future research could look at different ways of expressing a column-based notion of provenance, such as where the data in a column comes from, how it was created or from where it was derived.

Similarly, certain types of information structures attached to the dataset facilitate particular sensemaking patterns over others. A README file with a summary of the dataset’s size and format may provide the information necessary for a zoomed out inspection of the data; an interactive map of the area where a specimen was collected may be more suitable to a zoomed in approach, as well as enabling the activities described in Cluster 2.

7.2. Recommendations for C2: ”Strange things” as an entrypoint

Our findings suggest that errors can be seen as an entry point to sensemaking (see C2), as flags to investigate further. This provides an interesting direction to explore for sensemaking functionalities in tools. Rather than flattening out data by making it cleaner, tools could instead flag and highlight strange things to make users more aware of their presence. Column summaries, as mentioned in C1, could include explanations of abbreviations and missing values, metrics or links to other information structures necessary for understanding the column’s content. Datasets should include links to basic concepts (used in the data or in the documentation) such as common practices in code documentation or ”the web” (Wikipedia) to provide context. Documentation about the narrative surrounding these strange things should also be more standardised and linked directly to these flags in a sustainable way.

Other sensemaking patterns identified in C2 can be supported by customised interactive visualisations. Displaying the entire data, as described in C1, but highlighting relationships between columns or entities could allow users to more easily pick up relationships between columns. Tools could also display trends and patterns extracted from the dataset and allow users to select those attributes of the data that are of interest. Following this idea, data producers could identify anchor variables, those which they consider most important in their dataset; this could further aid sensemaking activities by focusing summarisation efforts.

7.3. Recommendations for C3: Perspectives in placing

Our findings highlight the need for flexible designs to support placing activities across the three identified levels of placing: the world, disciplinary norms and the study-set-up (C3). Rather than designing for a specific type of user, tools should be designed to embrace different levels of expertise, allowing a potential data consumer to drill down to the desired level of detail. Semantic technologies also could be used to link to standardized definitions of disciplinary acronyms or terms. Geographic information could be linked to a map or country registry to allow judgements of representativeness; a similar approach could be taken for certain disciplinary standards and study set-ups, such as standard experiment conditions, expected result ranges or commonly used confidence levels.

Our findings across all dimensions also surface the collaborative nature of data-centric sensemaking and the omnipresent role of information structures throughout the sensemaking and reuse process. It has been suggested that the production and consumption of academic writing can be conceptualized as a form of dialogue (e.g. (doi:10.1177/1474022211398106)); the broader practice of reusing data can itself be seen as a form of collaboration or conversation between data producers and consumers. The data producer must communicate the many (often collaborative) decisions which influence the creation of a dataset (DBLP:conf/chi/KoestenKTS19; DBLP:journals/tvcg/MahyarT14) to potential data reusers. A combination of focused documentation practices integrating different media types, together with interaction flows tailored to the sensemaking practices of both data producers and consumers, could further facilitate the conversation implicit in reusing data.

Sensemaking allows individuals to create rational accounts of the world which enable action (maitlis2005social); data-centric sensemaking enables the enactment of data reuse. Understanding sensemaking needs and exploring designs to support those needs, as we do here, therefore plays a vital role in realizing the potential of reusing data.