1 A Data-Centric View of Missingness
A data-centric view categorizes missingness in three data-related groups: data composition, data relationship and data usage. In this section, we first introduce our notion of data composition and data relationship and then discuss the data-centric view of missingness.
1.1 Data Composition and Data Relationship
The composition of data can include three major components [chen1988entity, codd1970relational]: entity, attribute and value. An entity is a data item, which is a basic unit encoding a piece of information. An attribute is a specification that describes an entity, and an entity can have multiple attributes. A value reveals how an entity performs on an attribute. Thus, a dataset can be considered as a collection of entities, described by one or multiple attributes with specified values. For example, in an image dataset, each picture is a data entity. Attributes that describe each picture include width and height, with values such as 1080px and 960px. These three components have also been used for database management, specifically as a relational model [codd1970relational], which organizes data in tables. Each row is an entity, each column corresponds to an attribute, and each cell includes a value.
The relationship between data entities can be described using the concept of entity set. An entity set refers to a set of unique data entities that share the same attribute(s) (e.g., a collection of photos or a list of locations). For two entity sets and , a relationship between them, , is a subset of their Cartesian product . When is not empty, we say is related to . Otherwise, we say that is independent from . The relationship can be determined by using data values of selected data attributes. Moreover, for different usage scenarios, the relationship can be determined differently. For example, in cyber security, may be defined as communication between computers and web URLs; while in bioinformatics, may be determined based on expressed genes under conditions.
1.2 Missingness in Data Composition
Based on the data composition discussed in Section 1.1, there are three possible types of missingness, including: 1) missing data entities, 2) missing data attributes, and 3) missing data values. Figure 1 gives an example of them.
Missing data entities highlights the absence of data entities, which is also called as missing observations [santos2019generating]. It often comes from errors in the process of data collection [allison2001missing]. For example, in fitness tracking devices, some sensor data might not be successfully recorded due to network connection failures.
Missing data attributes reveals the incompleteness of data attributes. This may come from a careless design of the data collection mechanism [pigott2001review]. For example, when designing a logging system of a cloud application, some user behaviors could be overlooked [sun2016designing]; when creating a survey, researchers might fail to include all the relevant. Due to such careless designs, even a data collection process runs successfully, certain data attributes can still be missing.
Missing data values (often named as missing data) are the loss of data values. Compared to the other two, it has drawn the most attention and been heavily studied [song2018s, fernstad2019definitionofmissingness, fernstad2014visualanalyticsofmissingdata]. Specifically, when considering the distribution of data values, missing data values can be further categorized into the following three groups [rubin1976]:
Missing completely at random (MCAR): missing data is assumed to not have any underlying mechanism and therefore should exhibit no relationships with either existing data or other missing data.
Missing at random (MAR): assumes dependencies on observed values, but assumes no underlying relationships between the missing values themselves.
Missing not at random (MNAR): is the most restrictive and requires dependencies between missing values.
Besides real absent values, missing data values also involves a special case, named as disguised missing [pearson2006problem], in which the value is present but not accurate. For example, a user fills out a questionnaire and leaves a default value, such as January 1st for a birthday for privacy reasons. In this case, the data value is not actually absent but a selected default value that may not reflect the truth. Thus, in the case of disguised missing, even data values are present, the true information that data collectors need remains missing and unclear.
1.3 Missingness in Data Relationship
Missing data relationships refers to the absence of relations among data entities. As is shown in Figure 2, from a graph perspective, it highlights the lack of links among nodes in a graph. This means that for a given set of data entities, some connections between data entities are not present. An absence of relationships among data entities may either result from errors in a data collection process or be a reflection of algorithmic results of data relationship discovery. For example, there is no link between two bank accounts due to the loss of an intelligence report; or the connection between two persons cannot be computationally identified based on word cooccurrence.
Missing data relationships can be formalized as the problem of missing links in graphs. Similar to using the imputation techniques for missing value inferences, based on existing links, missing links can be computationally identified [zhao2019missbin, zhao2020understanding]. A common goal of such techniques is to find potentially useful missing links (e.g., serving as a bridge that connects two communities in a social network), and further fix and verify them by adding back the lost ones [zhao2020understanding].
1.4 Missingness in Data Usage
The utility of data for sensemaking activities involve two key types: 1) data selection and 2) analytical method selection. The former refers to which parts of a given dataset will be selected for analysis. The latter means which analytical methods will be picked and applied to the selected data. Missingness in data usage can happen in both activities due to uncertainties and selection biases [heckman1990varieties, wall2019toward]).
Missingness in data selection reveals that not the whole dataset is selected for analysis. For example, to find similar cars, 4 out of 100 attributes are selected and the rest remains unused; or instead of using all data entities, a dataset is sampled and then analyzed. When selected attributes or samples of data entities are not representative (e.g., stratified sampling [trost1986statistically]), missingness exists in data selections.
Missingness in analytical method selection
reveals that not a full set of analysis methods is selected. For real-world problems, it is not easy or sometimes even impossible to identify a complete set of analytical methods, this missingness highlights that the performed analyses are not sufficient. For example, to explore similar cars, just one centroid-based clustering method (i.e., k-means clustering) is used but other clustering techniques that might reveal additional insights are not applied.
2 A Human-Centric View of Missingness
A human-centric view categories missingness at three levels: 1) observed missingness, 2) inferred missingness and 3) ignored missingness. They reveal how the data-centric missingness discussed in Section 1 is perceived by people.
2.1 Observed Missingness
Observed missingness means that users can directly perceive missingness. It indicates that the visibility of missingness is high, and users can easily notice it. For example, a user quickly realizes that a data value is missing, after she sees an empty cell in a table; or by checking and following ribbons in a parallel sets [kosara2006parallel], a user finds that there is no connection between two categorical data entities [convertino2019method].
Because the visibility of missingness is affected by the way that data is represented, observed missingness relies on the visual context, in which data is encoded by certain visualizations. Different visual encodings can impact how easily users can observe missingness. For example, as is shown in Figure 4, it is easier for users to see missing links by looking at a matrix than checking the same data displayed lists of node-pairs. Thus, for observed missingness, users can verify their perceived missingness by referring to the given visual context (e.g., pointing to an empty cell).
2.2 Inferred Missingness
Inferred missingness refers to that the visibility of missingness goes low or missingness even gets invisible, so it is impossible for users to directly observe missingness. However, via an investigation with given data, users can infer the possible existence of missingness. For example, by reading the following four intelligence reports that are modified based on the Sign of the Crescent dataset [hughes2003discovery]:
“Report on 04/24/2003. Phone calls on 22 April, 2003 made from 703-659-2317 to the numbers: 804-759-6302 and 804-774-8920. A translation of this message reads: ‘I will be in my office on April 30 at 9:OOAM. Try to be on time’.
Report on 01/11/2003. Abdul Ramazi is the owner of the Select Gourmet Foods shop in Springfield Mall, Springfield, VA., with a phone number 703-659-2317.
Report on 03/18/2003. A check with mobile phone providers shows that a Sprint cell phone 804-774-8920 is registered in the name Mukhtar Galab.
Report on 04/14/2003. The contact given by Faysal Goba was: 1631 Capitol Ave., Richmond VA; phone number: 804-759-6302. From an interrogation of a cooperative detainee in Guantanamo. Detainee says he trained daily with a man named Faysal Goba at an Al Qaeda explosives training facility in the Sudan in 1994.”
One may infer that the three persons, Abdul Ramazi, Mukhtar Galab and Faysal Goba may collude suspicious activities together. However, this seems missing in the given reports as it was not explicitly reported. Compared to observed missingness, inferred missingness may not be easily verified. Thus, observed missingness is more confirmative, while inferred missingness is more hypothetical.
2.3 Ignored Missingness
Ignored missingness indicates no observation nor awareness of missingness, or the presence of possible missingness is not considered. It may appear for two reasons. First, the visibility of missingness is too low to raise user awareness. For example, in Figure 4, a user may never realize that missing edges exist after looking at lists of edges. Second, due to some biases or the impact of cognitive capture (or tunneling) [simons1999gorillas], users turn a blind eye to possible missingness. For example, to explore possible treatment for a disease, all effort has been put on the group of people who have been infected by the disease, while the uninfected group never gets any attention.
While ignored missingness cannot be completely avoided in the analysis, the ignored missingness, if identified, can bring critical values for sensemaking activities [pirolli2005sensemaking]. Based on this, we consider that ignored missingness is similar to the concept of white space (sometimes also named as opportunity space) discussed in the business domain [johnson2010seizing]. The white space suggests new leads for possible growths of a business. For example, the customers of a credit card product fall into two primary age groups 25-35 and 50-70. The gap between 35 and 50 reveals a white space. It indicates that the current product seems not attractive for the age group 35-50, given the lack of users of that age group. Thus, this white space implies an opportunity to design a different credit card product with new awarding features for competing in the market of the missing age group. While a white space can bring useful values to business, it is usually difficult to discover and may easily slips one’s attention [johnson2010seizing] (e.g., the missing age group catches no attention at all).
In summary, observed missingness takes the least amount of user effort to perceive possible missingness; while for ignored missingness, users are not aware of the existence of missingness in the whole sensemaking process. For inferred missingness, users can realize missingness but it takes more effort.
3 Handling Missingness: The Role of Visualization
Based on the data-centric and human-centric perspectives of missingness mentioned before, in this section, we discuss four possible roles of visualizations for supporting missingness handling. The first and second roles highlight supporting the detection of data-centric missingness. The other two roles aim to improve user awareness of data-related missingness. In summary, visualizations can assist to uncover the data-centric missingness and improve their expressiveness, so they become more visible and accessible to users.
3.1 Bridging Existing Data and Missing Data
Visualizations play a key role of bridging the gap between existing data and missing data. If we consider existing data as a visible land and missing data as an invisible world, a usable bridge connecting them is critical to enable users to explore and walk into the invisible part from the visible one. This is because users need existing data as a landing point before digging into the data-centric missingness. However, to enable the analytical transition from the existing data to the missing part, users need the support of necessary information hints or leads, which can be provided by visualizations [chi2001using].
To establish such a bridge, a commonly used strategy is space-filling that reveals missingness as empty (e.g., an empty space in a bar chart [song2018s]), gap (e.g., broken lines in a line chart [song2018s]), or different-looking space (e.g., a matrix with different colored cells [zhao2019missbin, fernstad2019definitionofmissingness]). The focus of such visualization techniques are on the existing data. As the present data is mapped to certain visual encodings, possible missingness gets visible. By looking at the visually salient space, users can be aware of data-centric missingness. Thus, the visualized existing data serves critical visual context that enables users to identify data-related missingness.
3.2 Supporting the Analysis of Analytic Provenance
Visualizations can serve a usable solution to understand and audit analytical provenance [ragan2015characterizing, li2020crowdtrace], which is helpful to address missingness in data usage. It is challenging for users to keep tracking the process of their analyses. In a sensemaking process, some parts of data may not receive enough attention and users may miss one or several possibly applicable methods unintentionally. To help avoid such data usage related missingness, visualizations can be used to support tracking analytical provenance and further analyze it.
To help identify missingness in data usage, two key aspects need to be considered: 1) the selected, investigated, derived and newly generated data, and 2) the method or process applied to such data. They, respectively, correspond to the provenance of data and process [simmhan2006performance]. Visualizing them offers a way of analyzing analytic provenance. By checking such visualizations, users may notice missingness in data usage and further realize the limitation of their analyses.
3.3 Improving Awareness: from Ignoring to Observing
From a perceptual-oriented perspective, a key role of visualization is to prevent users from falling in the trap of ignoring data-centric missingness. The presence of missingness can get more visible to users via the usage of visualizations than without them, so it is more likely for users to be aware of missingness. This implies that using visualizations can improve the expressiveness of data-centric missingness. The higher such expressiveness goes, the easier it is for users to observe possible missingness. Thus, using visualizations to handle data-centric missingness attempts to move forward from ignoring missingness to being able to observe it.
Proper visual encodings can direct user attention to data-centric missingness (e.g., missing data values) [song2018s]
, which may otherwise be ignored by users. The data-centric missingness is often unknown to users at the initial analysis stage, unless they are informed. Thus, a sensemaking process with data-centric missingness is exploratory in nature and the original analysis goal may not consider missingness at all. However, by referring to visualizations used in a sensemaking process, users may realize the existence of missingness, which could happen at an “aha” moment[mai2004aha]. This matches both the spontaneous insight [chang2009defining] of visual analytics and one of the key characteristics of visualization insight – unexpected [north2006toward].
3.4 Scaffolding Missingness Inference
Visualizations offer a usable mean to scaffold missingness inference. Different from the other two perceptual levels of missingness (i.e., observing and ignoring missingness), inferring missingness requires more user effort, as possible missingness is not directly revealed but somehow can be inferred with enough cognitive effort. This can be supported by using visualizations. In this case, instead of merely encoding missing parts of data or existing parts for the purpose of indicating the “hole” in data, visualizations may focus on displaying either the connections across different parts of data or the provenance of a sensemaking process. These help users to infer possible existence of data-centric missingness.
Since inference is a reasoning process, instead of a static stage, to scaffold missingness inference, multiple types of visualizations may be used and fusing information across them can be helpful. For example, by examining connections in a social network graph, checking related organizations, and reading relevant reports, users may infer that two suspects colluded some threats together, which has never been reported in a given dataset. Thus, visualizations can play an important role in scaffolding missingness inference, but it has more complex needs for the design of visualizations.
4 Discussion and Conclusion
While handling missingness remains a challenging problem in sensemaking, as an initial exploration, we present a framework that helps to systematically view and categorize missingness in visual analytics. It highlights considering missingness from two key perspectives: data-centric and human-centric. The former regards missingness in three data-related categories: data composition, data relationship and data usage. The latter focuses on the human-perceived missingness at three levels: observed missingness, inferred missingness and ignored missingness.
Based on the framework, we discuss four possible roles of visualizations for helping to handle missingness in a sensemaking process. While this framework lays a preliminary theoretical foundation that aims to systematically consider missingness in visual analytics, to handle missingness in practice, there are three research themes that are worthy of future studies: 1) missingness detection, 2) missingness visualization and 3) missingness insight.
4.1 Detecting Missingness
Missingness detection lays the foundation for effective data analysis. Detecting missingness is not as simple as it looks like. Unless there is a clear detection goal or some evidence that reveals something is missing (e.g., data value), detecting missingness is fundamentally attempting to address an unknown unknown problem [matta2018making]. This brings a deeper question: how can we help users know which types of data-centric missingness (e.g., missing data attributes, missing data relationships, or missing data selections in the usage) exist? This is essential as it sets the detection goal. If users were not clear about this, it would be hard for them to further explore and work on detection methods. Also, a sensemaking process can have multiple types of data-centric missingness. For example, a vulnerable system with an interrupted network connection and an incautiously designed logging mechanism can lead to missing both data values and attributes. For such cases, detecting missingness is even challenging. The framework presented in this work may help to clarify the detection goals.
4.2 Visualizing Missingness
The design of missingness-oriented visualizations remains an under-explored direction. Prior work has investigated visual encodings for missing values [song2018s, fernstad2019definitionofmissingness, fernstad2014visualanalyticsofmissingdata] and missing links [zhao2019missbin]. However, the design space of visualizing missingness can be broader, especially considering that there are different categories of data-centric missingness and they may need different visual encodings. As studied in [song2018s], even for the same type of missingness, different visual encodings can be designed, which further impacts user-perceived data quality. How to formalize the design space of missingness visualizations still needs further explorations. Furthermore, considering the evaluation of missingness visualization, how and if possible can we measure the expressiveness of visual encodings for data-centric missingness? It enables comparing different designs for visualizing missingness, which can be helpful to support making design decisions. The perceptual-perspective discussed in our framework may help to derive usable measures.
4.3 Discovering Insights from Missingness
Studying possible insights that users gain from missingness in sensemaking is a highly sought-after research challenge. Missingness can be considered as a type of “data” [song2018s] from which users can gain useful insights. This turns missingness from being considered as dirty [kim2003taxonomy] to usable. For example, in an intelligence analysis, a missing link between two key suspects may drive the subsequent analysis towards an investigation of any possible connections between them. While this is a simple example, it shows that missingness can be used in a sensemaking process. The insights derived from missingness may depend on an application domain and different types of data-centric missingness may bring different insights. Moreover, insights discovered from missingness, if possible, via using visual analytics, may enlarge the set of characteristics of visualization insight [north2006toward]. This may further broaden our understanding of evaluating visualizations by considering the value of missingness.
In summary, we present a framework that provides a systematic view of missingness in visual analytics. We hope this work can draw attention to future studies on visual sensemaking with missingness.