Slope-Dependent Rendering of Parallel Coordinates to Reduce Density Distortion and Ghost Clusters
Parallel coordinates are a popular technique to visualize multi-dimensional data. However, they face a significant problem influencing the perception and interpretation of patterns. The distance between two parallel lines differs based on their slope. Vertical lines are rendered longer and closer to each other than horizontal lines. This problem is inherent in the technique and has two main consequences: (1) clusters which have a steep slope between two axes are visually more prominent than horizontal clusters. (2) Noise and clutter can be perceived as clusters, as a few parallel vertical lines visually emerge as a ghost cluster. Our paper makes two contributions: First, we formalize the problem and show its impact. Second, we present a novel technique to reduce the effects by rendering the polylines of the parallel coordinates based on their slope: horizontal lines are rendered with the default width, lines with a steep slope with a thinner line. Our technique avoids density distortions of clusters, can be computed in linear time, and can be added on top of most parallel coordinate variations. To demonstrate the usefulness, we show examples and compare them to the classical rendering.READ FULL TEXT VIEW PDF
Slope-Dependent Rendering of Parallel Coordinates to Reduce Density Distortion and Ghost Clusters
Plenty of approaches try to reduce clutter and highlight patterns in PCP generally. However, to the best of our knowledge, a formalization of the pattern distortion based on the polyline slope is missing, and none of the existing approaches specifically target this limitation.
The basic premise for the use of sampling (and filtering) techniques is that with less data, the degree of clutter and overplotting decreases, while the general structures, typically represented by many data records, remain in the PCP [DBLP:conf/eurographics/HeinrichW13]. The taxonomy by Ellis & Dix [ellis07] provides a categorization of clutter reduction methods, including sampling, filtering, and clustering, as well as visual techniques such as point size and opacity. Sampling often removes relevant data records or dimensions, and in this way reduces the truthfulness of the sampling concerning the dataset in its entirety. Our technique reduces clutter by counterbalancing the distortion artifact inherent to PCPs. It can be applied to a sampled or filtered subset of the data if the dataset exceeds the size visualizable in a PCP. Dependent on the data, our technique increases the amount of data displayable in a given PCP by deemphasizing diagonal polyline segments.
Another approach to minimize clutter in PCPs is to reorder the dimension axes or reduce the number of displayed dimensions. For example, Pargnostics by Dasgupta and Kosara [dasgupta10] describes a set of quality metrics for PCPs which can be minimized or maximized (e.g., the number of line-crossings and parallelism). The authors also suggest the flipping of axes to reduce the number of line-crossings or diagonal clusters. The survey by Behrisch et al. [DBLP:journals/cgf/BehrischBKSEFSD18] discusses a large number of quality metrics as objective functions for axes reordering. Axes reordering, dimension reduction, and axes flipping can reduce ghost clusters by favoring horizontal structures. Depending on the data, however, it cannot be avoided entirely. Axes reordering is highly dependent on the data and analysis task. It is an orthogonal concept to our approach and can be combined with it.
Clusters and other patterns can also be highlighted by density-distributed rendering. The general idea is to render PCPs as density distributions rather than individual polylines. Johansson et al. [johansson05] measure the density based on the number of overlapping polylines per pixel. This notion of density serves as input to a transfer function that allows highlighting areas according to their local density. Heinrich & Weiskopf [heinrich09] apply the concept of continuous scatterplots [bachthaler08]
to PCPs to derive a density model and thus interpolate the data. The resulting rendering is specifically useful for cluster identification. The work by Palmas et al.[palmas14] provides a different approach, which bundles edges according to class membership. The resulting bundles are rendered as polygonal strips. Density- and cluster-based rendering may hide the underlying individual records and often require class labels to achieve a useful coloring or edge-bundling. While these approaches reduce clutter, they do not avoid the density distortion of clusters.
A common technique is to modify the polylines of PCPs, specifically the overall line width, opacity, color, and shape. One example is the edge-bundling approach by Heinrich et al. [heinrich11], which bundles polylines according to class membership and thus reshapes the line. The work by Zhou et al. [zhou09] called line splatting is most closely to ours. Line splatting is iteratively adjusting the opacity of lines based on the local neighborhood. Users can interactively change the degree of polyline and segment splatting. In contrast to Zhou et al. [zhou09], our work tries to mitigate the visual distortions intrinsic to PCPs, such as the perceived density of clusters and the effect of ghost clusters.
We formalize the line geometry of parallel coordinates and describe their effects on density distortions and ghost clusters.
In standard PCPs, a polyline segment has a constant line width , also called thickness or stroke width. As depicted in Fig. 1, the slope of a segment is defined by the angle between the horizon and the segment. denotes the space between the dimension axes and indicates the difference of data values. In contrast to , the line height is slope-dependent: , with . The area of a segment is defined as and the length is defined as .
In parallel coordinates, horizontal clusters correspond to a set of data points with a strong positive correlation in a subset of values across dimensions. Visually, these clusters have roughly horizontal cluster boundaries and only small line slopes. An example is depicted in dimensions 1–3 of Slope-Dependent Rendering of Parallel Coordinates to Reduce Density Distortion and Ghost Clusters (a). In contrast, diagonal clusters correspond to data points with similar values within, but a strong variation between dimensions. Visually, these clusters have steep cluster boundaries and high line slopes. The last three dimensions in Slope-Dependent Rendering of Parallel Coordinates to Reduce Density Distortion and Ghost Clusters (a) present examples. Horizontal and diagonal clusters are not defined in a precise way, and there is a smooth transition between them. Visual cluster density refers to the share of colored pixels within a line cluster. A large number of densely packed colored pixels induce a dense cluster and vice versa. The following effects characterize the emerging distortions influencing visual cluster density.
Increase of Line Length and Area. Line length , line-height and line surface area depend on the angle , with the exponential relationship shown by the figure on the right. This dependency affects the perception of clusters. Large line slopes imply larger surface areas (= more pixels, lower data-to-ink ratio [tufte86]) and therefore a more prominent line. The emphasis translates from lines to clusters, so that diagonal clusters are more noticeable than horizontal clusters. This effect is depicted in the top of Fig. 2.
Decrease of Line Distance. Large line slopes in diagonal clusters reduce the space between lines and increases the perceived density of the cluster as lines may overlap, and the background vanishes. The orthogonal distance between two parallel lines is depends on the angle , with , where is the distance of the intersections of both lines with a dimension axis. This effect creates the perception that the lines are cohesive as shown in Fig. 2.
The Gestalt law of proximity [koffka2013principles, ware2012information] indicates that the density of lines translates to a perception of cohesiveness and thereby enables users to recognize clusters in PCPs. Classical PCPs put undue emphasis on diagonal clusters, which is facilitated by the increase of line lengths and decrease of line distances. This contradicts the data-ink ratio coined by Tufte [tufte86], which describes the proportion of ink devoted to the actual data relative to the total amount of ink. Thus, it adds unnecessary distortion: Diagonal clusters are emphasized more than horizontal clusters. Classical PCPs, therefore, induce a systematically inaccurate perception of clusters, when the observer would expect that the visualization is inherently neutral in this respect. We can see the effect in Slope-Dependent Rendering of Parallel Coordinates to Reduce Density Distortion and Ghost Clusters (a), where diagonal and horizontal clusters receive a significantly different emphasis.
The rendering effects caused by the different slopes of the polyline segments can also produce artificial patterns in parallel coordinates plots. Fig. 3 (a–c) show three PCPs with uniformly distributed random data points, i.e., there is no structure in the data. One can easily see that a zig-zag pattern, alternating between high and low values is visually present. The corresponding polylines seem to be parallel and close together, forming two clusters. With an increasing number of data points, the “clusters” are perceptually stronger. In Fig. 3 (d), we mark one apparent cluster and highlight its polylines across the different dimensions. One can see that the data is indeed randomly distributed and not forming a cluster across the dimensions. We define these visible, but non-existing patterns as ghost clusters. Ghost clusters are not only a problem of datasets with clutter or noise. Also, in structured datasets, ghost clusters can be present and influence the interpretation of the data.
|Regular ()||Adjusted ()||Over-Adjusted ()|
To overcome the distortion of cluster densities and potential ghost clusters, we propose to render the polyline segments based on their angle . The general idea is to render horizontal lines with the default width and diagonal lines with a thinner line. As a result, we increase the space between vertical lines and decrease the surface area, i.e., the number of pixels to draw a line. In the ideal case, all line segments should end up with the same area and the same distance between the segments. To achieve the same area for all line segments, the width of the polyline segments needs to be scaled based on their length . As the line length is dependent on , the desired width also needs to depend on . We interpret all lines as parallelograms with an equal and constant area and thus equal and constant side length which is independent of (Fig. 1). The height of this parallelogram corresponds to the desired -dependent width , leading to . This results in the angle-dependent line width
The angle-dependent width can be generalized, allowing us to weaken or strengthen the adjustment of the line width
determines the adjustment strength. Our approach applies to pixel- and vector-based rendering techniques.
corresponds to classical PCP rendering, where all lines have the same width. corresponds to rendering with equal line heights resulting in the same surface area for all polylines. However, it does not fully correct the decreased line distances. Thus, we allow as over-adjustment to further compensate overplotting of lines with strong slopes. In particular, the parameter can be freely adapted to the degree of clutter, and the properties of the dataset. We want to highlight that our slope-dependent rendering can fully overcome the problem of different line surface area (), but the issue of varying distance between polylines can only be reduced with . Based on these geometric properties, we recommend for truthful representation. However, many properties of a PCP and dataset influence the quality of the rendering (see Sec. 3.2), therefore an over-adjustment () may be necessary. Our tests with various synthetic and real-world datasets showed that is an upper bound for most applications.
In Fig. 4, we apply our technique to a synthetic dataset and uniform random noise. We achieve a balanced emphasis of horizontal and diagonal clusters for and an over-emphasis of horizontal lines for . Ghost clusters are also reduced for because their density is corrected. However, the effect of smaller line distance cannot be avoided, and ghost clusters are still visible. We can compensate for the line distance effect by over-adjusting the line area effect (e.g., ), nearly eliminating the ghost clusters, but introducing an over-emphasis of horizontal lines.
The following parallel coordinates parameters influence the impact of ghost clusters and the distortion of cluster densities and should be taken into account when applying the slope-dependent rendering.
PCP Size, Axis Height and Spacing. The overall size of a PCP has a direct impact on the axis height and spacing between the axes. Axis height and determine the range of : Long axes and tight spacing, caused by high-dimensionality, increase the angles and distort cluster densities and increase the likelihood of ghost clusters.
Default Line Width. Manipulating the constant line-height influences the detail and the clarity of the PCP. Thick lines increase the problem of overplotting, in particular for diagonal lines and clusters. Thin lines are more distinguishable and therefore produce more salient visualizations. The result of the slope-dependent rendering depends on the default line width, typically determined by the user. The default width directly influences the area covered by each line segment. It is advisable to consider a manual adaptation of the constant line-height before applying a slope-dependent rendering.
Data Volume. The number of data records influences the visual representation a PCP and is strongly related to its size and the default line width. A high data volume visualized with a small PCP and/or a thick line width increases the problem of overplotting, but also the distortion of cluster densities and ghost clusters. For example, Fig. 3 shows how the dataset size increases the perception of ghost clusters. Therefore, these properties should be optimized for a given dataset before applying the slope-dependent rendering.
Line Color and Transparency. When no transparency is used, then the color of the polylines does not affect PCPs and therefore also not our approach. Transparency can be used to avoid clutter and overplotting but introduces another artifact, which negatively influences the perception of patterns. Crossing lines introduce a darker color, which may be interpreted as a cluster. Combined with the slope-dependent rendering, new ghost clusters may occur, while other patterns may vanish: Adjusting the transparency of lines based on their slopes, as opposed to the line width, is not useful.
To test the effectiveness of our slope-dependent rendering, we implemented a tool which is available on our website111See http://subspace.dbvis.de/pcp-adjustment for the tool and https://github.com/davidpomerenke/slope for code and data.. Users can upload their data, or try out various synthetic and real-world datasets, comparing the results of classical and slope-based rendering. During our testing with the implementation, we found out that our slope-dependent line adjustment technique performs well on various datasets, reduces ghost clusters, and counterbalances distortions. We also tested the impact of our approach with other patterns, such as positive and negative correlations (Fig. 4). While positive correlations are not affected even with a large value (), the slope-dependent rendering influences the diagonal lines of negative correlation. We found that negative correlations also remain visible. However, the line representing data points at the ends of the dimension ranges are drawn with a small line width, making the visibility of this pattern susceptible to large values ().
Our approach can be combined with other techniques, such as axes reordering and dimension reduction, as they do not manipulate the polylines of a PCP. It can also be combined with polyline modifications like edge-bundling. However, the line width should then be calculated relative to the line length rather than the slope. As described above, various PCP properties generally influence the visual distortion and ghost clusters in PCPs. To achieve optimal results, these parameters should be optimized before the slope-dependent rendering is applied, and focus on the reduction of overplotting and the average angles of polylines.
A careful selection of the parameter is necessary. The usefulness of a particular depends on many general PCP properties, as well as data characteristics such as the number of data records and dimensions. Therefore, cannot be determined fully automatically based on a fixed parameter. However, we envision an algorithm which measures the density distribution, overlapping, and distortion and automatically selects an appropriate to achieve a reliable representation of the data. We want to address this algorithm as part of future work. Furthermore, we want to evaluate the usefulness of our approach, in particular in comparison to other methods, by conducting a quantitative user study.
We formalize two general problems of parallel coordinates: The density of clusters are often distorted and non-existing ghost-clusters emerge. As a solution, we propose a novel rendering technique for the polyline segments: The line width is adjusted according to the angle of each line segment. Our method can be computed in linear time, depends on a single parameter, and can be combined with many existing parallel coordinates’ variations.