Measuring and Explaining the Inter-Cluster Reliability of Multidimensional Projections

by   Hyeon Jeon, et al.
Seoul National University

We propose Steadiness and Cohesiveness, two novel metrics to measure the inter-cluster reliability of multidimensional projection (MDP), specifically how well the inter-cluster structures are preserved between the original high-dimensional space and the low-dimensional projection space. Measuring inter-cluster reliability is crucial as it directly affects how well inter-cluster tasks (e.g., identifying cluster relationships in the original space from a projected view) can be conducted; however, despite the importance of inter-cluster tasks, we found that previous metrics, such as Trustworthiness and Continuity, fail to measure inter-cluster reliability. Our metrics consider two aspects of the inter-cluster reliability: Steadiness measures the extent to which clusters in the projected space form clusters in the original space, and Cohesiveness measures the opposite. They extract random clusters with arbitrary shapes and positions in one space and evaluate how much the clusters are stretched or dispersed in the other space. Furthermore, our metrics can quantify pointwise distortions, allowing for the visualization of inter-cluster reliability in a projection, which we call a reliability map. Through quantitative experiments, we verify that our metrics precisely capture the distortions that harm inter-cluster reliability while previous metrics have difficulty capturing the distortions. A case study also demonstrates that our metrics and the reliability map 1) support users in selecting the proper projection techniques or hyperparameters and 2) prevent misinterpretation while performing inter-cluster tasks, thus allow an adequate identification of inter-cluster structure.


page 6

page 8


TopoMap: A 0-dimensional Homology Preserving Projection of High-Dimensional Data

Multidimensional Projection is a fundamental tool for high-dimensional d...

Distortion-Aware Brushing for Interactive Cluster Analysis in Multidimensional Projections

Brushing is an everyday interaction in 2D scatterplots, which allows use...

Clusterplot: High-dimensional Cluster Visualization

We present Clusterplot, a multi-class high-dimensional data visualizatio...

Interdependence of clusters measures and distance distribution in compact metric spaces

A compact metric space (X, ρ) is given. Let μ be a Borel measure on X. B...

On improved bound for measure of cluster structure in compact metric spaces

A compact metric space (X, ρ) is given. Let μ be a Borel measure on X. B...

Identifying bias in cluster quality metrics

We study potential biases of popular cluster quality metrics, such as co...

1 Background and Related Work

1.1 MDP Distortions

Many MDP techniques, such as -SNE [maaten2008visualizing], Isomap [tenenbaum2000global], and UMAP [mcinnes2018umap], have been proposed to understand and visualize high-dimensional data111This paper denotes both linear and nonlinear embedding of multidimensional data as MDP, following previous research [nonato2018multidimensional, etemadpour2014perception, etemadpour2015user].; however, every MDP produces distortions because information loss is inevitable when dimensionality is reduced.

1.1.1 Distortion Types

In his seminal work [aupetit2007visualizing], Michaël Aupetit defined two types of MDP distortion: stretching and compression. Stretching occurs when pairwise distances in the projected space are expanded compared to the original pairwise distances, and compression does the opposite. Afterward, Missing Neighbors and False Neighbors [lespinats2007dd, lespinats2011checkviz] distortion types were introduced to interpret the stretching and compression in the context of neighborhood preservation. Let be a smooth mapping where and for some . Each data point has its high-dimensional coordinate, , and the corresponding low-dimensional coordinate . For any point , its neighbors in the projected and original spaces are denoted as and , respectively. Missing Neighbors are then defined as . Similarly, False Neighbors are defined as (Figure 1a); however, our literature review (subsection 2.1) indicated that measuring Missing and False Neighbors distortion cannot reflect how well inter-cluster tasks can be performed, and thus cannot correctly evaluate inter-cluster reliability. To alleviate the mismatch between the inter-cluster reliability and point-point distortions, Martins et al. [martins2014visual] defined distortion types relevant to the cluster-point relationship: Missing Members and False Members with regard to a group of data points. For a group of similar points (e.g., within the same category of a dataset or clustered by a clustering algorithm) in the original space, is used to denote the “projected group” that corresponds to . Here, False Members are the points in , and Missing Members are in (Figure 1b); however, the literature review also revealed that the generalization was insufficient to reflect the degree to which users can perform inter-cluster tasks precisely. We further generalize the distortion types by proposing new inter-cluster distortion types that directly harm inter-cluster reliability.

1.1.2 Distortion Metrics

According to a survey conducted by Nonato and Aupetit [nonato2018multidimensional], most distortion metrics aim to measure point-point distortion. Among them, a few metrics evaluate how much Missing and False Neighbors distortion has occurred. For instance, Trustworthiness and Continuity (T&C) [venna2006local] locally measure how Missing and False Neighbors distort the ranks of each point’s neighbors. Mean Relative Rank Errors (MRREs) [lee2007nonlinear]

are similar to T&C; however, they consider not only the rank variance of the Missing and False Neighbors but also of True Neighbors—the points that are judged as neighbors in both spaces. Local Continuity Meta-Criteria (LCMC) 

[chen2009local] is another variant of T&C; it only considers True Neighbors. Still, measuring point-point distortion cannot adequately measure inter-cluster reliability, since it needs to quantify the relationship between clusters. Motta et al. [motta2015graph] proposed graph-based group validation, which is the only metric measuring cluster-point distortion we could find as a relevant work. The metric first extracts clusters from both the original and projected spaces using graph-based clustering. The metric then calculates each cluster’s structural persistence in the opposite space by measuring how much Missing and False Members distorted the cluster. Given that the metric examines each cluster independently, it is inappropriate to use it to measure inter-cluster reliability, which refers to multiple clusters at once. Measuring the distortion of predefined clusters with a clustering quality metric has been widely used to evaluate MDP. For example, Joia et al. [joia2011local] and Fadel et al. [fadel2015loch] used the silhouette coefficient [rousseeuw1987silhouettes] to quantify cluster preservation in MDP. However, one limitation is that the inter-cluster structures of real-world datasets are usually unknown. Graph-based group validation also suffers from the same problem, as it performs clustering once for all data and uses it as predefined clusters. By contrast, our metrics consider the complex inter-cluster structure by examining repeatedly extracted random clusters, thus much accurately quantify inter-cluster reliability.

1.1.3 Distortion Visualizations

To overcome the inherent limitation of metrics that describe only the overall distortions with one or two representative numerical values, complementary visualizations are proposed [nonato2018multidimensional]. The visualizations aim to reveal the submerged distortion information summarized by a single or two values, thus helping users identify trustworthy areas of the projection or detect distortion patterns. Distortion visualizations commonly highlight regions with local point-point distortions by decomposing the area into grids, where each grid cell corresponds to the data point and encodes the corresponding point’s distortion to the cells. The decomposition is usually done using a heatmap [seifert2010stress], Voronoi diagram [aupetit2007visualizing, lespinats2011checkviz, heulot2012proxiviz], or 2D point-cloud [martins2014visual, martins2015explaining]. By contrast, MING [colange2019interpreting] explains False and Missing Neighbors by visualizing the shared amount of the nearest neighbor graphs constructed in the original and projected space. In this work, we quantified pointwise distortions by aggregating the inter-cluster distortion of the clusters and visualized them.

1.2 Inter-Cluster Reliability

As many MDP techniques intentionally focus on local neighbors, they have trouble reflecting the original high-dimensional space’s global inter-cluster structure. For example, Barnes-Hut -SNE [van2014accelerating] and LargeVis [tang2016visualizing] concentrate on local structures by interpreting data based on -Nearest Neighbor (NN) graphs. Using a NN with a small () also allows them to reduce computation. However, as NN graphs with small only maintains the relations between the point and its local neighbors, they can only reflect limited local structures [fu2019atsne]. Recently proposed MDP techniques have tried to preserve both the local and global inter-cluster structures. For example, Narayan et al. [narayan2020density] introduced den-SNE and densMAP, which modify -SNE and UMAP, respectively, to better preserve clusters’ density. Another common strategy is to first construct a global skeletal layout using representative points (i.e., landmarks) and formulate local structure around each landmark [fu2019atsne, joia2011local, paulovich2010two, fadel2015loch, pezzotti2016hierarchical]. However, even for these approaches, completely retaining the original space’s inter-cluster structure during the projection is inherently impossible. Therefore, it is vital to measure the extent to which these techniques preserve inter-cluster reliability for a proper evaluation and analysis. Previous studies have attempted to explain the inter-cluster reliability of MDP through visualizations. For instance, the Compadre system [cutura2020comparing] enables an inter-cluster structure analysis based on matrix visualization, and ClustVis [metsalu2015clustvis] does so with a heatmap. Visual analytics systems [chatzimparmpas2020t, liu2019latent] with similar goals have also been proposed. Unlike these previous works, which utilized separate visual idioms to show the distortion, we adopted a strategy of visualizing the distortion within the projection [lespinats2011checkviz, martins2014visual] to explain inter-cluster reliability. Therefore, users can directly identify where and how the inter-cluster distortions occurred in the projection.

2 Design Considerations for Steadiness and Cohesiveness

In this section, we first survey inter-cluster tasks, which are essential in data analysis using MDP [martins2014visual, sedlmair2013empirical], through a literature review. We then establish the design considerations based on the survey that our metrics (Steadiness and Cohesiveness) should satisfy to adequately measure how much inter-cluster tasks can be held precisely in MDP.

2.1 Inter-Cluster Tasks Analysis

To identify the importance of inter-cluster structure preservation and to elicit the design considerations for our metrics, we inspected previous papers that addressed tasks related to clusters. We first investigated 31 papers introduced in a systematic review conducted by Sacha et al. [sacha2016visual], which surveyed how analysts interact with MDP. We also investigated 155 articles citing Sacha et al. using Google Scholar to expand the search space. As a result, we identified 26 papers concerning inter-cluster tasks: tasks that investigate the inter-cluster structure of original data through its 2D projections. Regarding the task taxonomy for MDP proposed by Etemadpour et al. [etemadpour2014perception, etemadpour2015user]

, we classified the tasks into three categories in terms of inter-cluster distortions. We then organized them into individual tasks, as listed in the following:

  • [noitemsep]

  • Identify separate clusters in the original space by exploring clusters in the projected space. Recognize the separation between clusters [choo2010ivisclassifier, endert2011observation, wang2017perception] or distinguish a cluster from the others [poco2011framework, nam2007clustersculptor].

  • Seek the relationships between clusters of the original space based on those of the projected space. (1) Investigate the hierarchical or inclusion relation between clusters (i.e., check whether clusters can again be divided into smaller parts with higher density, which we call “subclusters”) [liu2014distortion, xia2017ldsscanner]. (2)Estimate the clusters’ similarities based on their distances in the projected space [nam2007clustersculptor, wenskovitch2020respect].

  • Compare clusters in the original space based on their features in the projected space. Estimate and compare the clusters’ original sizes or densities based on their sizes or densities in the projected space [chatzimparmpas2020t, amabili2017visualizing].

The tasks were verified through semi-structured interviews with four machine learning (ML) engineers (E1-E4) with more than three years of experience. Three engineers confirmed that they practically perform the tasks for the real-world problem. Only E1 said that he does not perform the tasks. This is because he usually works with data with well-distributed vector representations processed by a deep neural network, where no inter-cluster structure exists. Previous surveys of high-dimensional data analysis tasks based on MDPs further confirm our task analysis results, as those works show similar results to ours, despite using different methodologies. T1 is covered by Brehmer et al.’s task taxonomy based on interviews with 10 data analysts

[brehmer2014visualizing], and T2 and T3 are covered by the taxonomy of cluster separations in MDPs discussed by Sedlmair et al. [sedlmair2012taxonomy]. Our survey indicated that point-point and cluster-point distortion metrics cannot correctly quantify how well inter-cluster tasks can be performed. Point-point distortion metrics focus on each point’s neighborhood instead of the inter-cluster structure. Therefore, the metrics can only measure the potential accuracy of relation-seeking tasks relevant to point-point relations, such as finding NN of the given point [etemadpour2015user]

; they cannot measure the extent to which inter-cluster tasks can be performed accurately as those tasks focus on the cluster level. Cluster-point distortion metrics can estimate the potential accuracy of T3, as the size and density of each cluster are related to the cluster itself. More precisely, if an MDP generates outliers for a cluster, the cluster’s size is reduced (if the density is maintained), or its density is reduced (if the size is maintained). Both distortions directly affect the comparison task. By contrast, cluster-point distortion metrics still fails for T1 and T2. As the metrics consider each cluster independently, they can only work for cluster identification tasks related to a single cluster (e.g., distinguishing the outliers of the cluster

[etemadpour2015user]) or relation-seeking tasks about a single cluster (e.g., finding closest points of the given cluster [etemadpour2015user]); however, they cannot provide sufficient information required to support T1 and T2 that consider multiple clusters at once.

2.2 Design Considerations

Based on the task analysis, we formulated three design considerations (C1, C2, C3) that Steadiness and Cohesiveness should satisfy to adequately quantify how accurately the three inter-cluster tasks can be performed and thus able to precisely measure inter-cluster reliability.

  • [noitemsep]

  • Capture the inter-cluster structure in detail. The inter-cluster structure in MDP is complex and intertwined [xia2017ldsscanner], and often has no ground truth. Furthermore, each cluster’s characteristics (e.g., shape, density, or size) vary widely [harel2001clustering]. Therefore, to quantify how precisely users can identify clusters (T1) or seek relationships between them (T2), we should thoroughly consider the inter-cluster structure in detail.

  • Consider stretching and compression individually. The distances between clusters may be affected by two aspects of geometric distortions: stretching and compression [aupetit2007visualizing]. If stretching occurs, users can misunderstand nearby clusters as distinct clusters. The opposite can happen if compression occurs (i.e., nearby groups can be identified as a single cluster). Furthermore, clusters’ size and density can be overestimated due to stretching or can be underestimated by compression. As the two aspects of distortion result in different types of misperceptions about the clusters’ size and density (T3) or their distance (T2-2), we should consider both aspects individually.

  • Measure how accurately the clusters identified in the projection reflect their original density and size. Users can have misconceptions when comparing clusters (T3) if the projected clusters’ size and density do not reflect those in the original space. To correctly quantify how much such misunderstandings can happen, we need to measure how accurately the clusters in the projection reflect their original density and size.

3 Steadiness and Cohesiveness

We propose Steadiness and Cohesiveness to measure inter-cluster reliability by evaluating inter-cluster distortion, satisfying our four design considerations. Steadiness measures inter-cluster reliability in the projected space (e.g., separated clusters in the original high-dimensional space are still separated in the projected space), while Cohesiveness does the same for the original space (e.g., each cluster in the original space is not dispersed in the projected space).

3.1 Defining Inter-Cluster Distortion Types

To design Steadiness and Cohesiveness, we first defined two inter-cluster distortions types—False Groups and Missing Groups—by generalizing False and Missing Neighbors to the cluster level. False Groups distortion denotes the cases in which a low-dimensional group in a single cluster (red dashed circle in Figure 1d) consists of separated groups in the original space (blue dotted circles in Figure 1d), and Missing Groups distortion occurs when the original group (red dashed circle in Figure 1c) misses its subgroups (green dotted circles in Figure 1c) and therefore is divided into multiple separated subgroups in the projected space. Steadiness and Cohesiveness evaluate how well projections avoid False and Missing Groups, respectively (C2).

3.2 Computing Steadiness and Cohesiveness

We compute inter-cluster reliability through the following procedure: (Step 1) Constructing dissimilarity matrices. (Step 2) Iteratively computing partial distortions. (Step 3) Aggregating partial distortions into Steadiness and Cohesiveness. Based on the definitions of the two measures, Steadiness increases as clusters extracted from the projected space stay closer consistently together in the original space. In contrast, Cohesiveness increases when clusters in the original space are maintained more consistently in the projected space. Each step is designed to satisfy all the design considerations (subsection 2.2). First, we split the workflow to handle Steadiness and Cohesiveness independently after step 1 (C2). Step 2 exploits randomness to cover the complex inter-cluster structures (C1) and inherently quantifies how well the original density and size of clusters are retained (C3). The workflow requires four functions as hyperparameters:

  • [noitemsep]

  • Distance function for points, dist

    • [noitemsep, nolistsep]

    • Input: two points and

    • Output: the distance (or dissimilarity) between and

  • Distance function for clusters, dist_cluster

    • [noitemsep, nolistsep]

    • Input: two clusters and

    • Output: the distance (or dissimilarity) between and

  • Cluster extraction function, extract_cluster

    • [noitemsep, nolistsep]

    • Input: a seed point

    • Output: a cluster in the projected space (for Steadiness) or the original space (for Cohesiveness) centered on

  • Clustering function, clustering

    • [noitemsep, nolistsep]

    • Input: a set of points

    • Output: a clustering result of the input points where the clustering takes place in the original space (for Steadiness) or the projected space (for Cohesiveness)

Two distance functions are used to compute the amount of inconsistency, while the other two functions are used for the iterative computation of partial distortions. These functions are explained in detail in subsection 3.3.

3.2.1 Step 1: Constructing Dissimilarity Matrices

We begin the measurement by constructing dissimilarity matrices and . We first construct distance matrices and satisfying and , where and denote the original and projected coordinates of input data point , respectively. For dist, we used Shared-Nearest Neighbor (SNN) based dissimilarity [ertoz2003finding] as a default (subsection 3.3). and are then normalized by dividing all elements by their max elements and . Raw dissimilarity matrix is obtained by subtracting from . The positive elements in denote that the distance between the corresponding points pair is compressed, and the opposite denotes that the distance is stretched. We then construct and , where if , else and if , else .

3.2.2 Step 2: Iteratively Computing Partial Distortions

The next step is to iteratively compute partial distortions by randomly extracting clusters from one space and evaluating their dispersion in the opposite space. In this section, we describe how to compute partial distortions in a single iteration. Extracting random clusters    For each iteration, we first select a seed point randomly in the projected space (Steadiness) or the original space (Cohesiveness). Then, the extract_cluster function takes the random seed point as input and extracts a cluster centered on the point as output. The random selection of the seed point leads to the extraction of clusters from diverse locations and therefore it is possible to cover the entire inter-cluster structure after sufficient iterations (e.g., 200 iterations for the data consists of 10,000 points) (C1). By default, we use the SNN similarity (subsection 3.3) for the extract_cluster function to gather points near the seed point. Revealing the cluster’s dispersion in the opposite space    Next, we reveal how the randomly extracted cluster is dispersed in the opposite space. To do this, the clustering function takes the points of the extracted cluster generated by extract_cluster as input. Afterward, the clustering function clusters the input points in the opposite space and returns the set of separated clusters as output. Hierarchical DBSCAN (HDBSCAN) [campello2013density, mcinnes2017hdbscan] utilizing an SNN-based distance function is used as the default clustering function (subsection 3.3). This step also allows the metrics to measure how well the clusters reflect their original density and size (C3). If a cluster’s original outliers are merged into a single cluster during MDP (False Groups distortion), either the cluster’s size or density will be increased. This situation can be captured while checking the projected cluster’s dispersion in the original space. For the opposite case (Cohesiveness), if an original space’s cluster loses some of its points during MDP, either its size or density in the projected space will be reduced. Revealing Missing Groups distortion captures this issue (subsection 6.1). Computing distortions between dispersed groups    In this step, we take as input and generate distortion and its weight for each pair of clusters (), based on point-stretching and point-compression metrics proposed by Michaël Aupetit [aupetit2007visualizing]. We generalized the point-stretching and point-compression to the cluster-stretching (Steadiness) and cluster-compression (Cohesiveness) by substituting the distance between points to the distance between clusters. For each cluster pair (, ), we compute their distance and in the projected space and the original space, respectively, utilizing dist_cluster. The default dist_cluster is designed by expanding the SNN-based distance function for points (subsection 3.3). Then, we check whether the distance is compressed or stretched and consecutively compute the distortion as follows:


The weight of a pair is determined as . The weights penalize the distortion of larger clusters more than smaller ones; thus, we can deal with the inter-cluster structure consisting of the clusters of various sizes (C1).

3.2.3 Step 3: Aggregating Partial Distortions

This step aggregates the iteratively computed partial distortions to Steadiness and Cohesiveness. The iterative partial distortion measurement generates a set of distortions and their corresponding weights. Let’s denote the set as follows:

  • [leftmargin=*, itemsep=0.05pt]

  • where denotes the number of total cluster pairs generated throughout the entire partial distortion measurement of Steadiness.

  • where denotes the number of total cluster pairs generated throughout the entire partial distortion measurement of Cohesiveness.

We then calculate the final scores as follows:

  • [leftmargin=*]

  • .

  • .

The final scores lie in the range of [0, 1]. The weighted average is subtracted from 1 to assign lower scores to lower-quality projections.

3.3 Designing Hyperparameter Functions

3.3.1 Parameterizing Hyperparameter Functions

The workflow of computing Steadiness and Cohesiveness requires four hyperparameter functions: dist, dist_cluster, clustering, and extract_cluster

. We parameterized these functions because both the definition of distance and the definition of clusters vary depending on the analysis goals. There are various ways to define the distance between two data points (e.g., Euclidean distance, geodesic distance, cosine similarity). The definition of clusters also varies, and thus many different clustering algorithms (e.g.,

K-Means [duda1973pattern], Density-based clustering [ester1996density], Mean shift[comaniciu2002mean]) exist. Therefore, it is unreasonable to use a fixed definition for both. This is in line with the fact that, as there are various ways to define similarity between each point and its local neighbors, there are diverse local metrics that utilize different similarity definitions. However, parameterization could reduce metrics’ interpretability. Thus, we designed the default hyperparameter functions that align with our design considerations to allow users to easily understand and use our metrics.

3.3.2 Default Hyperparameter Functions

To design the default functions, we first set the definitions of distance and cluster. We defined distance as the dissimilarity of the points based on the Shared-Nearest Neighbors (SNN) [ertoz2003finding] similarity, which assigns a high similarity to point pairs sharing more NNs. SNN-based dissimilarity was selected because Steadiness and Cohesiveness should reflect the inter-cluster structure of the original high-dimensional space. Although it is common to use a NN graph to reflect a high-dimensional space [van2014accelerating, tang2016visualizing], NN’s ability to describe the structure of data decreases as dimensionality grows [beyer1999nearest, hinneburg2000nearest]. SNN-based dissimilarity tackles this issue as the similarity of two points is robustly confirmed by their shared neighbors, thus better representing the structure of high-dimensional spaces compared to NN [ertoz2002new, liu2018shared]. We also defined a cluster as the contiguous data region, or manifold with an arbitrary shape, where the density of the region is higher than its surroundings. This definition is followed by density-based clustering algorithms. We used this definition because the metrics should capture the complex and intertwined inter-cluster structure consisting of clusters of various sizes and shapes (C1), and therefore should be able to define clusters more flexibly. We designed the default hyperparameter procedures to satisfy both the definitions and the original design considerations (subsection 2.2). Distance function for points, dist    As mentioned, the distance function is based on SNN similarity. Let us first denote -nearest neighbors of a point as , in order. The SNN similarity between two points is defined as where is a set of each pair satisfying . increases when more -nearest neighbors with high ranks overlap. We consecutively normalized all SNN similarity values by dividing them by the max SNN similarity max_sim of the dataset. Finally, we defined distance function dist as dist. We used reciprocal transformation [tan2016introduction] to further penalize low similarity, where controls the amount of penalization. is used as the default. Distance function for clusters, dist_cluster    As for dist_clust-er, we first defined the similarity between clusters and converted it to their distance. We used average linkage [murtagh2012algorithms], as it is robust to outliers compared to competitors (e.g., simple linkage), thus defining the similarity of two clusters and as , where denotes the points in and . We then defined the distance between and as dist_cluster. Clustering function, clustering    As our definition of cluster is the one used in conventional density-based clustering, designing clustering required a single decision: selecting the proper density-based clustering algorithm. We selected HDBSCAN, which is a state-of-the-art density-based clustering algorithms. As HDBSCAN can handle clusters with various shapes and densities and is robust to noises (outliers) [mcinnes2017accelerated]

, exploiting it helps to reveal the dispersion of clusters regardless of the clusters’ characteristics (e.g., shape, size, or density). Therefore, it helps the metrics deal with complex inter-cluster structures (C1). HDBSCAN also tackles the curse of dimensionality

[vijendra2011efficient], which suits our metrics that need to consider the higher dimensional space. To align clustering with our dissimilarity definition, our HDBSCAN utilized dist for the distance calculation. Cluster extraction function, extract_cluster    The design of extract_cluster mainly follows a density-based clustering process, aligned to clustering; it uses random seed point as a sole core point and assigns nearby points, which are treated as non-core points, successively to form a cluster. In detail, the function traverses seed point ’s -nearest neighbors and includes each neighbor point

as a cluster member with a probability of

max_sim. When the neighbor point is determined as a cluster member, it goes into a queue so that its neighbor can also be traversed later. Adding neighbors stochastically makes extracted clusters not span the entire NN graph but form a dense structure. To diversify the size of the extracted clusters, we limited the traversal number starting from the seed point and allowed repeated visits. Combined with the random starting seed point, this strategy enriches the range that our metrics cover, thus helping the metrics deal with complex inter-cluster structures (C1). The strategy fundamentally relies on the fact that randomness can help analyze a complex, uncertain system [tempo2012randomized]. We fixed the number of traversal to 40% of the total number of data points for our evaluations (section 5, 6).

Figure 2: The datasets and their projections used in Experiment A-D. A, B) The synthetic projection of the dataset with six 100-dimensional spheres, consists of six circles (A) and 12 circles (B) with an increasing amount of overlapping. C) MNIST dataset and their projections created by replacing a certain proportion of the -SNE projection with random points with increasing replacement (repl.) ratios. D) The UMAP projection of the data randomly sampled from the RGB color cube dataset with an increasing nearest neighbors hyperparameter value.

3.4 Visualizing Steadiness and Cohesiveness

To overcome the limitation of metrics in that they describe the overall distortion in a single or two numeric values, we developed a complementary visualization: a reliability map (Figure 4, 5, 6). The reliability map reveals how and where inter-cluster distortion occurred by showing Steadiness and Cohesiveness at each point. The distortions at each point are quantified by aggregating partial distortion values computed throughout the measurement of our metrics. The map shows these pointwise distortions embedded within the projection. The pointwise distortion is obtained by aggregating partial distortions computed throughout the iterative process (subsubsection 3.2.2). Recall that the iterative computation results in a set of distortion or and weight between a pair of clusters . For all , we register every to every with the distortion strength , and do the same in the opposite direction. Duplicated registration of a point are removed by averaging distortion strengths. We compute each point’s approximated local distortion by summing up the registered distortion strengths. The reliability map visualizes these pointwise distortions through edge-based distortion encoding. We constructed a NN graph in the projection and made each edge of the graph depict the sum of and ’s pointwise distortion. If the points within a narrow region have high distortion, the edges between the points will be intertwined in the region (e.g., red dotted contours in Figure 5, 6); they will be recognized as clusters with distinguishable inter-cluster distortion. However, using a large might generate visual clutter; we empirically found that between 8 and 10 is an adequate choice for both expressing inter-cluster distortion and avoiding visual clutter. Martins et al.’s point cloud distortion visualization [martins2014visual] is similar to ours, but it computes the distortion value at each pixel instead of encoding to edges. To express False Groups and Missing Groups distortion types simultaneously, we used CheckViz’s two-dimensional color scale [lespinats2011checkviz] (lower right corner of Figure 5). Following the color scheme of CheckViz, we assigned purple to the edges with False Groups distortion and green to the edges with Missing Groups distortion. Moreover, edges with no distortion are represented as white, while black edges indicate that both distortion types occurred together. We also implemented a cluster selection interaction (e.g., lower right box in Figure 4) to allow users to identify Missing Groups distortion more precisely. After users select a cluster by making a lasso with a mouse interaction, the reliability map constructs , where denotes the set of registered points of a point . Subsequently, the edges connected to the points in are highlighted in red. Each highlighted edge’s opacity encodes the sum of distortion strength of its incident points toward (i.e., how much distance between its incident points and is stretched).

4 Implementation

Steadiness and Cohesiveness are written in Python with an interface for users or programmers to easily implement and use user-defined hyperparameter functions. This is to facilitate the development and verification of possible alternatives of Steadiness and Cohesiveness later. The partial distortion computation is parallelized with CUDA GPGPU [nickolls2008scalable] supported by Numba [lam2015numba]. We implemented the reliability map in JavaScript using D3.js [bostock2011d3]. The source code of the metrics and the map is available at and, respectively.

5 Quantitative Evaluations and Discussions

We evaluated Steadiness and Cohesiveness in terms of quantifying inter-cluster reliability by comparing them with existing local distortion metrics. We verified that our metrics well capture inter-cluster reliability, while previous local metrics miss some cases even with the apparent distortions. The reliability map ascertained that our metrics accurately captured where and how the inter-cluster distortion occurred. Moreover, we evaluated our metrics’ robustness by testing simpler hyperparameter functions (subsection 3.3). As baseline metrics, we chose T&C and MRREs (subsubsection 1.1.2), the two representative local metrics that measure nearest neighbors preservation. We chose the two for comparison because 1) they were designed to measure Missing and False Neighbors, the point-wise version of Missing and False Groups and 2) nearest-neighbor preservation has been used as the core evaluation criteria for evaluating MDP techniques previously [pezzotti2016hierarchical, van2014accelerating, fu2019atsne, lee2011shift, Moor19Topological]. For MRREs, in this section we use “MRRE [Missing]” for the one that quantifies Missing Neighbors, and “MRRE [False]” for the other that quantifies False Neighbors.

5.1 Sensitivity Analysis

We conducted four experiments to check whether Steadiness and Cohesiveness can sensitively measure inter-cluster reliability. We designed the first two experiments (A, B) to evaluate our metrics’ ability to quantify the inter-cluster distortion using the projections with synthetically generated False Groups (Experiment A) or Missing Groups (Experiment B) distortions respectively. The next two experiments (C, D) were conducted to investigate whether our metrics have the ability to properly assess the overall inter-cluster reliability difference of the projections.

5.1.1 Experimental Design

Experiment A: Identifying False Groups    The goal of the first experiment was to evaluate whether and how Steadiness and previous local metrics (Continuity, MRRE [False]) identify False Groups. We first generated high-dimensional data consisting of six 100-dimensional spheres whose centers were equidistant from the origin. Each sphere consisted of 500 points. We then set the initial 2D projection of the dataset as six circles around the origin (the first projection on the first row of Figure 2). Note that this projection is the most faithful view of the original data as we made each circle correspond to one high-dimensional sphere. To simulate False Groups distortion, we then distorted this ground-truth projection by overlapping the circles in pairs (the first row of Figure 2). For each pair of circles centered at , , respectively, we adjusted the degree of overlap by changing from to with an interval of , where is the origin. For each projection, we measured Steadiness and Cohesiveness (, 500 iterations), T&C (), and MRREs (). We used different values and used the mean of their results as the final score for soundness. Experiment B: Identifying Missing Groups    To evaluate Cohesiveness and previous local metrics’ (Trustworthiness, MRRE [Missing]) ability to measure Missing Groups distortion, we used the same high-dimensional dataset as Experiment A, but this time, we synthesized the initial projection consisting of 12 equally distant circles, where each consists of 250 points. We made a pair of nearby circles correspond to a single sphere in the original space (the second row of Figure 2). We then overlapped each pair of circles by adjusting from to with an interval of (the second row of Figure 2). Note that unlike Experiment A, the initial projection is the least faithful projection but becomes more faithful as the circles in each pair overlap more. We used the same metrics setting as Experiment A. Experiment C: Capturing quality degradation    To test our metrics’ ability to capture the quality degradation of the projection, we computed our and previous metrics for the projections with different levels of quality degradation. We first created a 2D -SNE projection of the MNIST dataset [lecun1998mnist] (the first projection on the third row of Figure 2). We then replaced a certain proportion of the projected points with random points. We varied the replacement rate from 0 to 100% with an interval of 5% (the third row of Figure 2). The inter-cluster reliability of the projections certainly degrade as the replacement rate increases. We checked whether the metrics can capture such quality degradation. We used the same metrics setting as Experiment A. Experiment D: Identifying the effect of projection hyperparameters    The final experiment was conducted to evaluate the capability of our metrics to capture the inter-cluster reliability differences caused by the hyperparameter choices of an MDP technique. This experiment was inspired by an analysis from the UMAP paper [mcinnes2018umap] where the authors assessed the impact of a hyperparameter, the number of nearest neighbors , on the projection quality. Lower values drive UMAP to more local structures, while higher values make the projection to preserve the global structures rather than the local details. In the original analysis, the authors qualitatively analyzed how affects the UMAP projection of a randomly sampled 3-dimensional RGB cube data. The authors concluded that since randomly sampled data have no manifold structure, larger values generate more appropriate projections than lower values. Lower values instead treat the noises from random sampling as fine-scale local manifold structures, generating an unreliable interpretation of the structure [mcinnes2018umap]. We tested whether our and previous metrics can quantitatively reproduce this conclusion. We first constructed a dataset of 4000 points randomly sampled from a 3-dimensional RGB cube. UMAP projections of the dataset with different ( with an interval of 1, with an interval of 10) were then generated (the fourth row of Figure 2) and tested with the same metrics setting as Experiment A. We set another hyperparameter of UMAP, min_dist, to 0 because higher min_dist values tune projections to lose the local structure, reducing the difference between the projections generated with higher and lower values. Setting it at 0.0 prevents such an effect from affecting the experiment.

Figure 3: The result of quantitative experiments A-D; the scores measured by our metrics (Steadiness and Cohesiveness) and baseline local distortion metrics (MRREs, T&C).

5.1.2 Results

Experiment A    As we decreased the angle between each circle pair (i.e., increasing the amount of false overlap), both Steadiness (slope , ) and Cohesiveness (slope , ) decreased. The baseline local metrics: Trustworthiness (slope , ), Continuity (slope , ), MRRE [Missing] (slope , ), and MRRE [False] (slope , ), also decreased, but the slope was statistically gentle compared the our metrics ( for all). (Figure 3A) Experiment B    As we decreased the angles between each circle pair (i.e., making projections more faithful), Cohesiveness drastically increased around (slope in range , ), which is the point where the circle pair starts to overlap. Other measures such as Steadiness (slope , ), Trustworthiness (slope , ), Continuity (slope , ), MRRE [Missing] (slope , ) and MRRE [False] (slope , ) did not changed significantly (Figure 3B). Experiment C    As the replacement rate increased, Steadiness (slope , ), Cohesiveness (slope , ), Trustworthiness (slope , ), Continuity (slope , ), MRRE [Missing] (slope , ), and MRRE [False] (slope , ) all decreased. (Figure 3C) Experiment D    As we increased , both Steadiness (slope , ) and Cohesiveness (slope , ) increased, while Trustworthiness (slope , ), MRRE [Missing] (slope , ) decreased. Continuity (slope , ) and MRRE [False] (slope , ) increased, though the slopes were statistically gentle compared to Steadiness ( for both). All baseline local metrics early saturated near the max score around . (Figure 3D).

Figure 4: The reliability map and CheckViz visualizing the distortion of the projections from Experiment A (red dotted square in Figure 2) and Experiment B (blue dashed square in Figure 2)). Unlike CheckViz, where no interesting pattern was shown, the reliability map demonstrated where and how False Groups distortion occurred (Exp. A). Even Missing Groups distortions was identified by the cluster selection interaction (Exp. B).
Figure 5: The UMAP, Isomap, and LLE projections of MNIST test dataset and the reliability maps that visualize each projection’s inter-cluster distortion. Overall Steadiness (St) and Cohesiveness (Co) scores are depicted under the name of each technique. For each MDP technique, the left pane shows the class identity, and the right pane shows the reliability map. Color-encoding of the inter-cluster distortion used in the reliability map is in the lower right corner. The projection and the corresponding reliability map of -SNE and PCA are in Appendix B.
Figure 6: The reliability maps that visualize the inter-cluster distortion of -SNE projections made for Fashion-MNIST test dataset. Steadiness and Cohesiveness scores are depicted under the the perplexity value of each projection.

5.1.3 Discussion

The result of Experiment A suggests that our metrics could identify a loss of the inter-cluster reliability caused by False Groups distortion, as Steadiness decreased when the overlap of circle pairs increased. Cohesiveness also decreased, which means that not only False but also Missing Groups distortions had occurred. This is because for point and in a circle, although their Euclidean distance is maintained while the circle is overlapping with another circle, the SNN similarity decreases as more points intervene between and . Continuity and MRRE [False] also captured the decrease in the inter-cluster reliability due to False Groups distortion, but slower compared to our metrics. For Experiment B, the result confirms that Cohesiveness correctly identifies Missing Groups distortion as the metric increased following the increasing overlap of circles that reduces Missing Groups distortion. Moreover, in Experiment B, the amount of Missing Groups distortion was captured only by Cohesiveness, which showed that our metrics have the ability to pinpoint the particular inter-cluster distortion type. In contrast, both Trustworthiness and MRRE [Missing] failed to capture this apparent Missing Groups distortion. The Reliability map further confirms the results of Experiment A and B as it showed that Steadiness and Cohesiveness accurately identified the place where False Groups and Missing Groups occurred (Figure 4). Reliability map located the False Groups distortion of Experiment A by highlighting the overlapped area in purple. For Experiment B, it was able to identify the Missing Groups relationship of two separated circles in a pair through the cluster selection interaction, as selecting the portion of one circle showed that the other circle was actually close to it; this result matches the ground truth. In contrast, CheckViz, which visualized the False and Missing Neighbors distortion of each point computed by T&C, did not show any pattern. In Experiments C and D, both Steadiness and Cohesiveness could capture the decrease (Experiment C) and the increase (Experiment D) in inter-cluster reliability. Moreover, Experiment D showed that our metrics also can be used to quantify the effect of a hyperparameter by reproducing the result of human observers’ qualitative analysis [mcinnes2018umap]. In contrast, local metrics barely captured the certain increase of inter-cluster reliability in Experiment D. Overall, the experiments proved that our metrics can properly measure inter-cluster reliability. On the contrary, local metrics failed for some cases even with the apparent inter-cluster distortion.


Distance Measurement
clustering SNN-based Euclidean
St Co St Co


Table 1: The result of Experiment D conducted with diverse hyperparameter procedure settings (subsection 5.2). Each cell depicts the slope of regression line which represents the relation between nearest neighbor

value and the score of Steadiness (St) and Cohesiveness (Co). Every regression analysis result satisfied


5.2 Robustness Analysis

We also investigated the robustness of Steadiness and Cohesiveness against hyperparameters by conducting Experiment D using Steadiness and Cohesiveness with different hyperparameters. We tested Steadiness and Cohesiveness with simpler hyperparameters functions as hyperparameter functions can considerably change the behavior of our metrics. For the goal, we tested simpler clustering algorithms, X-Means [pelleg2000x] and -Means [duda1973pattern] (number of clusters ), instead of the default HDBSCAN algorithm for clustering. We also tested the Euclidean distance as dist instead of the default SNN-based distance. While using Euclidean distance for the distance measurement between points, we also defined the distance between two clusters dist_cluster as the Euclidean distance between their centroids instead of the default definition based on SNN similarity to align with dist. For extract_cluster, we treated every traversed points as cluster members instead of using probability to weight the points with high SNN similarity. As a result (Table 1), Steadiness and Cohesiveness with simpler clustering hyperparameter functions both increased as nearest neighbors values increased, which confirms the ability to properly quantify inter-cluster reliability. This result shows that our metrics’ capability is not bound mainly by the selection of clustering but instead originates more from the power of randomness to analyze complex structures [tempo2012randomized]. What is interesting here is that the case of K-Means () showed the most similar results to the case of the default HDBSCAN hyperparameter both for Steadiness and Cohesiveness. This is because when clustering the extracted clusters in the opposite space, the inter-cluster structure composed of arbitrary shapes and sizes can be better represented by the fine-grained K-Means clustering result than the coarse-grained result. However, Cohesiveness failed for all cases that used Euclidean distance as dist. This result shows that while designing hyperparameters functions for Steadiness and Cohesiveness, users should carefully consider the definition of the distance between two points.

6 Case Studies

We report two case studies that we conducted with two ML engineers (E2, E3). During the study, we demonstrated to the engineers how Steadiness, Cohesiveness, and the reliability map works, and they explored with us the original inter-cluster structure of MNIST and Fashion-MNIST [xiao2017fashion] test datasets, where both live in a 784-dimensional space and consist of 10 classes. The case study showed that our metrics and the reliability map supports users in 1) selecting adequate projection techniques or hyperparameter settings that match the dataset and 2) preventing users’ misinterpretation that could potentially occur while conducting inter-cluster tasks (subsection 2.1). ML engineers agreed that such support is helpful in interpreting the inter-cluster structure of high-dimensional data.

6.1 MNIST Exploration with Diverse MDP Techniques

To explore the inter-cluster structure of MNIST, we projected it with -SNE, UMAP, PCA [pearson1901liii], Isomap [tenenbaum2000global], and LLE. We measured the Steadiness and Cohesiveness (, 500 iterations) of each projection and visualized the result with the reliability map (Figure 5). We first discovered that visualizing Steadiness and Cohesiveness can prevent users from misidentifying a cluster separation of the original space (T1). For instance, in the Isomap projection, we found the region with high False Groups distortion consists of categories #4 and #7 (red dotted circle in Figure 5). A similar region was also observed in the PCA projection (Appendix B). LLE also has the cluster with high False Groups distortion composed of categories #3, #6, #8, and #9. Without checking False Groups distortion, one could make the wrong interpretation that such a region belongs to same cluster; visualizing the distortion with the reliability map helped us avoid this misperception. Visualizing False Groups distortion also allowed additional reasoning beyond a mere quantitative score comparison to choose the proper projection technique. We found that False Groups distortions that occurred in Isomap, PCA (overlap of category #4 and #7), and LLE (overlap of category #3, #6, #8, #9) did not occur in -SNE or UMAP. This finding explains why the Steadiness of -SNE and UMAP are higher than other projections, advocating the use of -SNE and UMAP in exploring the inter-cluster structure of a MNIST dataset. Still, as the -SNE and UMAP projections also suffered from Missing Groups distortion, we critically interpreted that the clusters in these projections actually stay closer to each other than they look. E3 noted that this interpretation matches the ground truth that digits in MNIST stay much closer and mixed in the original space than their representations. Moreover, we found that by using cluster selection interaction, users can accurately estimate and compare cluster sizes and shapes (T3). As we selected the local area in LLE (blue dashed ellipse in Figure 5), the reliability map highlighted a much larger region around the selected region (black long-dashed contour in Figure 5). This means that the original cluster containing the selected points was much larger than we can see in the projections and lost its portion as dispersed outliers (i.e., Missing Groups distortion occurred). We identified this problem through cluster selection interaction and escaped from the misinterpretation.

6.2 Fashion-MNIST Exploration with t-Sne

In the second case study, we explored Fashion-MNIST dataset using -SNE projections with varying hyperparameters. We measured our metrics (, 500 iterations) on -SNE projections generated with different perplexity values and visualized the result with the reliability map (Figure 6). Note that using a high value for makes the -SNE focus more on preserving global structures [wattenberg2016how]. As a result, we found that our metrics and the reliability map can help in selecting adequate hyperparameter settings. For example, the projection with had both False and Missing Groups distortion distributed uniformly across the entire projection space. This finding, which aligns with the low score of the projection, showed that low values are not sufficient to capture the global inter-cluster structure, which matches its actual behavior. This result justifies that it is proper to select a higher value to investigate Fashion-MNIST. The fact that and projections earned higher scores for both Steadiness and Cohesiveness compared to and projections strengthens this interpretation. Thus, we subsequently analyzed the and projections and discovered that our metrics prevent users from making the wrong interpretation of the relations between clusters (T2). We first noticed that the projection has more compact clusters compared to the projection with , where clusters are slightly more disperser and closer to each other. As the projection achieved a relatively high Steadiness score, we could conclude that each compact cluster also exists as a cluster also in the original space. However, as the projection increased Cohesiveness, we were not able to believe the separation of the clusters depicted in the projection (T2-1). According to Cohesiveness, the distances between the clusters in the original spaces is better depicted in the projection. Therefore, it is more reliable to interpret the original inter-cluster structure as a set of subclusters that constitute one large cluster rather than as a set of separated clusters (T2-2). E2 paid particular attention to this result. She pointed that it is common to perceive that projections with well-divided clusters (e.g., projection) better reflects the inter-cluster structure, but this result shows that such a common perception could lead to a misinterpretation of inter-cluster structure.

7 Conclusion and Future Work

Although it is important to investigate the inter-cluster distortion in many MDP tasks, there were previously no metrics that directly measure such distortions. In this work, we first surveyed user tasks related to identifying the inter-cluster structures of MDP and elicited design considerations for the metrics to evaluate the inter-cluster reliability. Next, we presented Steadiness and Cohesiveness to evaluate inter-cluster reliability by measuring the False Groups and Missing Groups distortions and presented the reliability map to visualize the metrics. Through quantitative evaluations, we validated that our metrics adequately measure inter-cluster reliability. The qualitative case studies showed that our metrics can also help users select proper projection techniques or hyperparameter settings and perform inter-cluster tasks with fewer misperceptions, assisting them in interpreting the original space’s inter-cluster structure. As a future work, we plan to enhance the scalability of our algorithm. The algorithm currently computes the iterative partial distortion measurement sequentially. As each iteration works independently, we plan to accelerate the algorithm leveraging multiprocessing. We also plan to improve our metrics to consider the hierarchical aspect of inter-cluster structures and reduce the number of hyperparameters. Another interesting research direction would be to investigate how Steadiness, Cohesiveness, and their visualizations affect users’ perception of the original data, which will provide an in-depth understanding of MDP, as an expansion of our case study.

Thanks to Yoo-Min Jung and Aeri Cho for their valuable feedback. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2019R1A2C208906213).