1 Background and Related Work
1.1 MDP Distortions
Many MDP techniques, such as SNE [maaten2008visualizing], Isomap [tenenbaum2000global], and UMAP [mcinnes2018umap], have been proposed to understand and visualize highdimensional data^{1}^{1}1This paper denotes both linear and nonlinear embedding of multidimensional data as MDP, following previous research [nonato2018multidimensional, etemadpour2014perception, etemadpour2015user].; however, every MDP produces distortions because information loss is inevitable when dimensionality is reduced.
1.1.1 Distortion Types
In his seminal work [aupetit2007visualizing], Michaël Aupetit defined two types of MDP distortion: stretching and compression. Stretching occurs when pairwise distances in the projected space are expanded compared to the original pairwise distances, and compression does the opposite. Afterward, Missing Neighbors and False Neighbors [lespinats2007dd, lespinats2011checkviz] distortion types were introduced to interpret the stretching and compression in the context of neighborhood preservation. Let be a smooth mapping where and for some . Each data point has its highdimensional coordinate, , and the corresponding lowdimensional coordinate . For any point , its neighbors in the projected and original spaces are denoted as and , respectively. Missing Neighbors are then defined as . Similarly, False Neighbors are defined as (Figure 1a); however, our literature review (subsection 2.1) indicated that measuring Missing and False Neighbors distortion cannot reflect how well intercluster tasks can be performed, and thus cannot correctly evaluate intercluster reliability. To alleviate the mismatch between the intercluster reliability and pointpoint distortions, Martins et al. [martins2014visual] defined distortion types relevant to the clusterpoint relationship: Missing Members and False Members with regard to a group of data points. For a group of similar points (e.g., within the same category of a dataset or clustered by a clustering algorithm) in the original space, is used to denote the “projected group” that corresponds to . Here, False Members are the points in , and Missing Members are in (Figure 1b); however, the literature review also revealed that the generalization was insufficient to reflect the degree to which users can perform intercluster tasks precisely. We further generalize the distortion types by proposing new intercluster distortion types that directly harm intercluster reliability.
1.1.2 Distortion Metrics
According to a survey conducted by Nonato and Aupetit [nonato2018multidimensional], most distortion metrics aim to measure pointpoint distortion. Among them, a few metrics evaluate how much Missing and False Neighbors distortion has occurred. For instance, Trustworthiness and Continuity (T&C) [venna2006local] locally measure how Missing and False Neighbors distort the ranks of each point’s neighbors. Mean Relative Rank Errors (MRREs) [lee2007nonlinear]
are similar to T&C; however, they consider not only the rank variance of the Missing and False Neighbors but also of True Neighbors—the points that are judged as neighbors in both spaces. Local Continuity MetaCriteria (LCMC)
[chen2009local] is another variant of T&C; it only considers True Neighbors. Still, measuring pointpoint distortion cannot adequately measure intercluster reliability, since it needs to quantify the relationship between clusters. Motta et al. [motta2015graph] proposed graphbased group validation, which is the only metric measuring clusterpoint distortion we could find as a relevant work. The metric first extracts clusters from both the original and projected spaces using graphbased clustering. The metric then calculates each cluster’s structural persistence in the opposite space by measuring how much Missing and False Members distorted the cluster. Given that the metric examines each cluster independently, it is inappropriate to use it to measure intercluster reliability, which refers to multiple clusters at once. Measuring the distortion of predefined clusters with a clustering quality metric has been widely used to evaluate MDP. For example, Joia et al. [joia2011local] and Fadel et al. [fadel2015loch] used the silhouette coefficient [rousseeuw1987silhouettes] to quantify cluster preservation in MDP. However, one limitation is that the intercluster structures of realworld datasets are usually unknown. Graphbased group validation also suffers from the same problem, as it performs clustering once for all data and uses it as predefined clusters. By contrast, our metrics consider the complex intercluster structure by examining repeatedly extracted random clusters, thus much accurately quantify intercluster reliability.1.1.3 Distortion Visualizations
To overcome the inherent limitation of metrics that describe only the overall distortions with one or two representative numerical values, complementary visualizations are proposed [nonato2018multidimensional]. The visualizations aim to reveal the submerged distortion information summarized by a single or two values, thus helping users identify trustworthy areas of the projection or detect distortion patterns. Distortion visualizations commonly highlight regions with local pointpoint distortions by decomposing the area into grids, where each grid cell corresponds to the data point and encodes the corresponding point’s distortion to the cells. The decomposition is usually done using a heatmap [seifert2010stress], Voronoi diagram [aupetit2007visualizing, lespinats2011checkviz, heulot2012proxiviz], or 2D pointcloud [martins2014visual, martins2015explaining]. By contrast, MING [colange2019interpreting] explains False and Missing Neighbors by visualizing the shared amount of the nearest neighbor graphs constructed in the original and projected space. In this work, we quantified pointwise distortions by aggregating the intercluster distortion of the clusters and visualized them.
1.2 InterCluster Reliability
As many MDP techniques intentionally focus on local neighbors, they have trouble reflecting the original highdimensional space’s global intercluster structure. For example, BarnesHut SNE [van2014accelerating] and LargeVis [tang2016visualizing] concentrate on local structures by interpreting data based on Nearest Neighbor (NN) graphs. Using a NN with a small () also allows them to reduce computation. However, as NN graphs with small only maintains the relations between the point and its local neighbors, they can only reflect limited local structures [fu2019atsne]. Recently proposed MDP techniques have tried to preserve both the local and global intercluster structures. For example, Narayan et al. [narayan2020density] introduced denSNE and densMAP, which modify SNE and UMAP, respectively, to better preserve clusters’ density. Another common strategy is to first construct a global skeletal layout using representative points (i.e., landmarks) and formulate local structure around each landmark [fu2019atsne, joia2011local, paulovich2010two, fadel2015loch, pezzotti2016hierarchical]. However, even for these approaches, completely retaining the original space’s intercluster structure during the projection is inherently impossible. Therefore, it is vital to measure the extent to which these techniques preserve intercluster reliability for a proper evaluation and analysis. Previous studies have attempted to explain the intercluster reliability of MDP through visualizations. For instance, the Compadre system [cutura2020comparing] enables an intercluster structure analysis based on matrix visualization, and ClustVis [metsalu2015clustvis] does so with a heatmap. Visual analytics systems [chatzimparmpas2020t, liu2019latent] with similar goals have also been proposed. Unlike these previous works, which utilized separate visual idioms to show the distortion, we adopted a strategy of visualizing the distortion within the projection [lespinats2011checkviz, martins2014visual] to explain intercluster reliability. Therefore, users can directly identify where and how the intercluster distortions occurred in the projection.
2 Design Considerations for Steadiness and Cohesiveness
In this section, we first survey intercluster tasks, which are essential in data analysis using MDP [martins2014visual, sedlmair2013empirical], through a literature review. We then establish the design considerations based on the survey that our metrics (Steadiness and Cohesiveness) should satisfy to adequately measure how much intercluster tasks can be held precisely in MDP.
2.1 InterCluster Tasks Analysis
To identify the importance of intercluster structure preservation and to elicit the design considerations for our metrics, we inspected previous papers that addressed tasks related to clusters. We first investigated 31 papers introduced in a systematic review conducted by Sacha et al. [sacha2016visual], which surveyed how analysts interact with MDP. We also investigated 155 articles citing Sacha et al. using Google Scholar to expand the search space. As a result, we identified 26 papers concerning intercluster tasks: tasks that investigate the intercluster structure of original data through its 2D projections. Regarding the task taxonomy for MDP proposed by Etemadpour et al. [etemadpour2014perception, etemadpour2015user]
, we classified the tasks into three categories in terms of intercluster distortions. We then organized them into individual tasks, as listed in the following:

[noitemsep]

Identify separate clusters in the original space by exploring clusters in the projected space. Recognize the separation between clusters [choo2010ivisclassifier, endert2011observation, wang2017perception] or distinguish a cluster from the others [poco2011framework, nam2007clustersculptor].

Seek the relationships between clusters of the original space based on those of the projected space. (1) Investigate the hierarchical or inclusion relation between clusters (i.e., check whether clusters can again be divided into smaller parts with higher density, which we call “subclusters”) [liu2014distortion, xia2017ldsscanner]. (2)Estimate the clusters’ similarities based on their distances in the projected space [nam2007clustersculptor, wenskovitch2020respect].

Compare clusters in the original space based on their features in the projected space. Estimate and compare the clusters’ original sizes or densities based on their sizes or densities in the projected space [chatzimparmpas2020t, amabili2017visualizing].
The tasks were verified through semistructured interviews with four machine learning (ML) engineers (E1E4) with more than three years of experience. Three engineers confirmed that they practically perform the tasks for the realworld problem. Only E1 said that he does not perform the tasks. This is because he usually works with data with welldistributed vector representations processed by a deep neural network, where no intercluster structure exists. Previous surveys of highdimensional data analysis tasks based on MDPs further confirm our task analysis results, as those works show similar results to ours, despite using different methodologies. T1 is covered by Brehmer et al.’s task taxonomy based on interviews with 10 data analysts
[brehmer2014visualizing], and T2 and T3 are covered by the taxonomy of cluster separations in MDPs discussed by Sedlmair et al. [sedlmair2012taxonomy]. Our survey indicated that pointpoint and clusterpoint distortion metrics cannot correctly quantify how well intercluster tasks can be performed. Pointpoint distortion metrics focus on each point’s neighborhood instead of the intercluster structure. Therefore, the metrics can only measure the potential accuracy of relationseeking tasks relevant to pointpoint relations, such as finding NN of the given point [etemadpour2015user]; they cannot measure the extent to which intercluster tasks can be performed accurately as those tasks focus on the cluster level. Clusterpoint distortion metrics can estimate the potential accuracy of T3, as the size and density of each cluster are related to the cluster itself. More precisely, if an MDP generates outliers for a cluster, the cluster’s size is reduced (if the density is maintained), or its density is reduced (if the size is maintained). Both distortions directly affect the comparison task. By contrast, clusterpoint distortion metrics still fails for T1 and T2. As the metrics consider each cluster independently, they can only work for cluster identification tasks related to a single cluster (e.g., distinguishing the outliers of the cluster
[etemadpour2015user]) or relationseeking tasks about a single cluster (e.g., finding closest points of the given cluster [etemadpour2015user]); however, they cannot provide sufficient information required to support T1 and T2 that consider multiple clusters at once.2.2 Design Considerations
Based on the task analysis, we formulated three design considerations (C1, C2, C3) that Steadiness and Cohesiveness should satisfy to adequately quantify how accurately the three intercluster tasks can be performed and thus able to precisely measure intercluster reliability.

[noitemsep]

Capture the intercluster structure in detail. The intercluster structure in MDP is complex and intertwined [xia2017ldsscanner], and often has no ground truth. Furthermore, each cluster’s characteristics (e.g., shape, density, or size) vary widely [harel2001clustering]. Therefore, to quantify how precisely users can identify clusters (T1) or seek relationships between them (T2), we should thoroughly consider the intercluster structure in detail.

Consider stretching and compression individually. The distances between clusters may be affected by two aspects of geometric distortions: stretching and compression [aupetit2007visualizing]. If stretching occurs, users can misunderstand nearby clusters as distinct clusters. The opposite can happen if compression occurs (i.e., nearby groups can be identified as a single cluster). Furthermore, clusters’ size and density can be overestimated due to stretching or can be underestimated by compression. As the two aspects of distortion result in different types of misperceptions about the clusters’ size and density (T3) or their distance (T22), we should consider both aspects individually.

Measure how accurately the clusters identified in the projection reflect their original density and size. Users can have misconceptions when comparing clusters (T3) if the projected clusters’ size and density do not reflect those in the original space. To correctly quantify how much such misunderstandings can happen, we need to measure how accurately the clusters in the projection reflect their original density and size.
3 Steadiness and Cohesiveness
We propose Steadiness and Cohesiveness to measure intercluster reliability by evaluating intercluster distortion, satisfying our four design considerations. Steadiness measures intercluster reliability in the projected space (e.g., separated clusters in the original highdimensional space are still separated in the projected space), while Cohesiveness does the same for the original space (e.g., each cluster in the original space is not dispersed in the projected space).
3.1 Defining InterCluster Distortion Types
To design Steadiness and Cohesiveness, we first defined two intercluster distortions types—False Groups and Missing Groups—by generalizing False and Missing Neighbors to the cluster level. False Groups distortion denotes the cases in which a lowdimensional group in a single cluster (red dashed circle in Figure 1d) consists of separated groups in the original space (blue dotted circles in Figure 1d), and Missing Groups distortion occurs when the original group (red dashed circle in Figure 1c) misses its subgroups (green dotted circles in Figure 1c) and therefore is divided into multiple separated subgroups in the projected space. Steadiness and Cohesiveness evaluate how well projections avoid False and Missing Groups, respectively (C2).
3.2 Computing Steadiness and Cohesiveness
We compute intercluster reliability through the following procedure: (Step 1) Constructing dissimilarity matrices. (Step 2) Iteratively computing partial distortions. (Step 3) Aggregating partial distortions into Steadiness and Cohesiveness. Based on the definitions of the two measures, Steadiness increases as clusters extracted from the projected space stay closer consistently together in the original space. In contrast, Cohesiveness increases when clusters in the original space are maintained more consistently in the projected space. Each step is designed to satisfy all the design considerations (subsection 2.2). First, we split the workflow to handle Steadiness and Cohesiveness independently after step 1 (C2). Step 2 exploits randomness to cover the complex intercluster structures (C1) and inherently quantifies how well the original density and size of clusters are retained (C3). The workflow requires four functions as hyperparameters:

[noitemsep]

Distance function for points, dist

[noitemsep, nolistsep]

Input: two points and

Output: the distance (or dissimilarity) between and


Distance function for clusters, dist_cluster

[noitemsep, nolistsep]

Input: two clusters and

Output: the distance (or dissimilarity) between and


Cluster extraction function, extract_cluster

[noitemsep, nolistsep]

Input: a seed point

Output: a cluster in the projected space (for Steadiness) or the original space (for Cohesiveness) centered on


Clustering function, clustering

[noitemsep, nolistsep]

Input: a set of points

Output: a clustering result of the input points where the clustering takes place in the original space (for Steadiness) or the projected space (for Cohesiveness)

Two distance functions are used to compute the amount of inconsistency, while the other two functions are used for the iterative computation of partial distortions. These functions are explained in detail in subsection 3.3.
3.2.1 Step 1: Constructing Dissimilarity Matrices
We begin the measurement by constructing dissimilarity matrices and . We first construct distance matrices and satisfying and , where and denote the original and projected coordinates of input data point , respectively. For dist, we used SharedNearest Neighbor (SNN) based dissimilarity [ertoz2003finding] as a default (subsection 3.3). and are then normalized by dividing all elements by their max elements and . Raw dissimilarity matrix is obtained by subtracting from . The positive elements in denote that the distance between the corresponding points pair is compressed, and the opposite denotes that the distance is stretched. We then construct and , where if , else and if , else .
3.2.2 Step 2: Iteratively Computing Partial Distortions
The next step is to iteratively compute partial distortions by randomly extracting clusters from one space and evaluating their dispersion in the opposite space. In this section, we describe how to compute partial distortions in a single iteration. Extracting random clusters For each iteration, we first select a seed point randomly in the projected space (Steadiness) or the original space (Cohesiveness). Then, the extract_cluster function takes the random seed point as input and extracts a cluster centered on the point as output. The random selection of the seed point leads to the extraction of clusters from diverse locations and therefore it is possible to cover the entire intercluster structure after sufficient iterations (e.g., 200 iterations for the data consists of 10,000 points) (C1). By default, we use the SNN similarity (subsection 3.3) for the extract_cluster function to gather points near the seed point. Revealing the cluster’s dispersion in the opposite space Next, we reveal how the randomly extracted cluster is dispersed in the opposite space. To do this, the clustering function takes the points of the extracted cluster generated by extract_cluster as input. Afterward, the clustering function clusters the input points in the opposite space and returns the set of separated clusters as output. Hierarchical DBSCAN (HDBSCAN) [campello2013density, mcinnes2017hdbscan] utilizing an SNNbased distance function is used as the default clustering function (subsection 3.3). This step also allows the metrics to measure how well the clusters reflect their original density and size (C3). If a cluster’s original outliers are merged into a single cluster during MDP (False Groups distortion), either the cluster’s size or density will be increased. This situation can be captured while checking the projected cluster’s dispersion in the original space. For the opposite case (Cohesiveness), if an original space’s cluster loses some of its points during MDP, either its size or density in the projected space will be reduced. Revealing Missing Groups distortion captures this issue (subsection 6.1). Computing distortions between dispersed groups In this step, we take as input and generate distortion and its weight for each pair of clusters (), based on pointstretching and pointcompression metrics proposed by Michaël Aupetit [aupetit2007visualizing]. We generalized the pointstretching and pointcompression to the clusterstretching (Steadiness) and clustercompression (Cohesiveness) by substituting the distance between points to the distance between clusters. For each cluster pair (, ), we compute their distance and in the projected space and the original space, respectively, utilizing dist_cluster. The default dist_cluster is designed by expanding the SNNbased distance function for points (subsection 3.3). Then, we check whether the distance is compressed or stretched and consecutively compute the distortion as follows:
where
The weight of a pair is determined as . The weights penalize the distortion of larger clusters more than smaller ones; thus, we can deal with the intercluster structure consisting of the clusters of various sizes (C1).
3.2.3 Step 3: Aggregating Partial Distortions
This step aggregates the iteratively computed partial distortions to Steadiness and Cohesiveness. The iterative partial distortion measurement generates a set of distortions and their corresponding weights. Let’s denote the set as follows:

[leftmargin=*, itemsep=0.05pt]

where denotes the number of total cluster pairs generated throughout the entire partial distortion measurement of Steadiness.

where denotes the number of total cluster pairs generated throughout the entire partial distortion measurement of Cohesiveness.
We then calculate the final scores as follows:

[leftmargin=*]

.

.
The final scores lie in the range of [0, 1]. The weighted average is subtracted from 1 to assign lower scores to lowerquality projections.
3.3 Designing Hyperparameter Functions
3.3.1 Parameterizing Hyperparameter Functions
The workflow of computing Steadiness and Cohesiveness requires four hyperparameter functions: dist, dist_cluster, clustering, and extract_cluster
. We parameterized these functions because both the definition of distance and the definition of clusters vary depending on the analysis goals. There are various ways to define the distance between two data points (e.g., Euclidean distance, geodesic distance, cosine similarity). The definition of clusters also varies, and thus many different clustering algorithms (e.g.,
KMeans [duda1973pattern], Densitybased clustering [ester1996density], Mean shift[comaniciu2002mean]) exist. Therefore, it is unreasonable to use a fixed definition for both. This is in line with the fact that, as there are various ways to define similarity between each point and its local neighbors, there are diverse local metrics that utilize different similarity definitions. However, parameterization could reduce metrics’ interpretability. Thus, we designed the default hyperparameter functions that align with our design considerations to allow users to easily understand and use our metrics.3.3.2 Default Hyperparameter Functions
To design the default functions, we first set the definitions of distance and cluster. We defined distance as the dissimilarity of the points based on the SharedNearest Neighbors (SNN) [ertoz2003finding] similarity, which assigns a high similarity to point pairs sharing more NNs. SNNbased dissimilarity was selected because Steadiness and Cohesiveness should reflect the intercluster structure of the original highdimensional space. Although it is common to use a NN graph to reflect a highdimensional space [van2014accelerating, tang2016visualizing], NN’s ability to describe the structure of data decreases as dimensionality grows [beyer1999nearest, hinneburg2000nearest]. SNNbased dissimilarity tackles this issue as the similarity of two points is robustly confirmed by their shared neighbors, thus better representing the structure of highdimensional spaces compared to NN [ertoz2002new, liu2018shared]. We also defined a cluster as the contiguous data region, or manifold with an arbitrary shape, where the density of the region is higher than its surroundings. This definition is followed by densitybased clustering algorithms. We used this definition because the metrics should capture the complex and intertwined intercluster structure consisting of clusters of various sizes and shapes (C1), and therefore should be able to define clusters more flexibly. We designed the default hyperparameter procedures to satisfy both the definitions and the original design considerations (subsection 2.2). Distance function for points, dist As mentioned, the distance function is based on SNN similarity. Let us first denote nearest neighbors of a point as , in order. The SNN similarity between two points is defined as where is a set of each pair satisfying . increases when more nearest neighbors with high ranks overlap. We consecutively normalized all SNN similarity values by dividing them by the max SNN similarity max_sim of the dataset. Finally, we defined distance function dist as dist. We used reciprocal transformation [tan2016introduction] to further penalize low similarity, where controls the amount of penalization. is used as the default. Distance function for clusters, dist_cluster As for dist_cluster, we first defined the similarity between clusters and converted it to their distance. We used average linkage [murtagh2012algorithms], as it is robust to outliers compared to competitors (e.g., simple linkage), thus defining the similarity of two clusters and as , where denotes the points in and . We then defined the distance between and as dist_cluster. Clustering function, clustering As our definition of cluster is the one used in conventional densitybased clustering, designing clustering required a single decision: selecting the proper densitybased clustering algorithm. We selected HDBSCAN, which is a stateoftheart densitybased clustering algorithms. As HDBSCAN can handle clusters with various shapes and densities and is robust to noises (outliers) [mcinnes2017accelerated]
, exploiting it helps to reveal the dispersion of clusters regardless of the clusters’ characteristics (e.g., shape, size, or density). Therefore, it helps the metrics deal with complex intercluster structures (C1). HDBSCAN also tackles the curse of dimensionality
[vijendra2011efficient], which suits our metrics that need to consider the higher dimensional space. To align clustering with our dissimilarity definition, our HDBSCAN utilized dist for the distance calculation. Cluster extraction function, extract_cluster The design of extract_cluster mainly follows a densitybased clustering process, aligned to clustering; it uses random seed point as a sole core point and assigns nearby points, which are treated as noncore points, successively to form a cluster. In detail, the function traverses seed point ’s nearest neighbors and includes each neighbor pointas a cluster member with a probability of
max_sim. When the neighbor point is determined as a cluster member, it goes into a queue so that its neighbor can also be traversed later. Adding neighbors stochastically makes extracted clusters not span the entire NN graph but form a dense structure. To diversify the size of the extracted clusters, we limited the traversal number starting from the seed point and allowed repeated visits. Combined with the random starting seed point, this strategy enriches the range that our metrics cover, thus helping the metrics deal with complex intercluster structures (C1). The strategy fundamentally relies on the fact that randomness can help analyze a complex, uncertain system [tempo2012randomized]. We fixed the number of traversal to 40% of the total number of data points for our evaluations (section 5, 6).3.4 Visualizing Steadiness and Cohesiveness
To overcome the limitation of metrics in that they describe the overall distortion in a single or two numeric values, we developed a complementary visualization: a reliability map (Figure 4, 5, 6). The reliability map reveals how and where intercluster distortion occurred by showing Steadiness and Cohesiveness at each point. The distortions at each point are quantified by aggregating partial distortion values computed throughout the measurement of our metrics. The map shows these pointwise distortions embedded within the projection. The pointwise distortion is obtained by aggregating partial distortions computed throughout the iterative process (subsubsection 3.2.2). Recall that the iterative computation results in a set of distortion or and weight between a pair of clusters . For all , we register every to every with the distortion strength , and do the same in the opposite direction. Duplicated registration of a point are removed by averaging distortion strengths. We compute each point’s approximated local distortion by summing up the registered distortion strengths. The reliability map visualizes these pointwise distortions through edgebased distortion encoding. We constructed a NN graph in the projection and made each edge of the graph depict the sum of and ’s pointwise distortion. If the points within a narrow region have high distortion, the edges between the points will be intertwined in the region (e.g., red dotted contours in Figure 5, 6); they will be recognized as clusters with distinguishable intercluster distortion. However, using a large might generate visual clutter; we empirically found that between 8 and 10 is an adequate choice for both expressing intercluster distortion and avoiding visual clutter. Martins et al.’s point cloud distortion visualization [martins2014visual] is similar to ours, but it computes the distortion value at each pixel instead of encoding to edges. To express False Groups and Missing Groups distortion types simultaneously, we used CheckViz’s twodimensional color scale [lespinats2011checkviz] (lower right corner of Figure 5). Following the color scheme of CheckViz, we assigned purple to the edges with False Groups distortion and green to the edges with Missing Groups distortion. Moreover, edges with no distortion are represented as white, while black edges indicate that both distortion types occurred together. We also implemented a cluster selection interaction (e.g., lower right box in Figure 4) to allow users to identify Missing Groups distortion more precisely. After users select a cluster by making a lasso with a mouse interaction, the reliability map constructs , where denotes the set of registered points of a point . Subsequently, the edges connected to the points in are highlighted in red. Each highlighted edge’s opacity encodes the sum of distortion strength of its incident points toward (i.e., how much distance between its incident points and is stretched).
4 Implementation
Steadiness and Cohesiveness are written in Python with an interface for users or programmers to easily implement and use userdefined hyperparameter functions. This is to facilitate the development and verification of possible alternatives of Steadiness and Cohesiveness later. The partial distortion computation is parallelized with CUDA GPGPU [nickolls2008scalable] supported by Numba [lam2015numba]. We implemented the reliability map in JavaScript using D3.js [bostock2011d3]. The source code of the metrics and the map is available at github.com/hjn/steadinesscohesiveness and github.com/hjn/sncreliabilitymap, respectively.
5 Quantitative Evaluations and Discussions
We evaluated Steadiness and Cohesiveness in terms of quantifying intercluster reliability by comparing them with existing local distortion metrics. We verified that our metrics well capture intercluster reliability, while previous local metrics miss some cases even with the apparent distortions. The reliability map ascertained that our metrics accurately captured where and how the intercluster distortion occurred. Moreover, we evaluated our metrics’ robustness by testing simpler hyperparameter functions (subsection 3.3). As baseline metrics, we chose T&C and MRREs (subsubsection 1.1.2), the two representative local metrics that measure nearest neighbors preservation. We chose the two for comparison because 1) they were designed to measure Missing and False Neighbors, the pointwise version of Missing and False Groups and 2) nearestneighbor preservation has been used as the core evaluation criteria for evaluating MDP techniques previously [pezzotti2016hierarchical, van2014accelerating, fu2019atsne, lee2011shift, Moor19Topological]. For MRREs, in this section we use “MRRE [Missing]” for the one that quantifies Missing Neighbors, and “MRRE [False]” for the other that quantifies False Neighbors.
5.1 Sensitivity Analysis
We conducted four experiments to check whether Steadiness and Cohesiveness can sensitively measure intercluster reliability. We designed the first two experiments (A, B) to evaluate our metrics’ ability to quantify the intercluster distortion using the projections with synthetically generated False Groups (Experiment A) or Missing Groups (Experiment B) distortions respectively. The next two experiments (C, D) were conducted to investigate whether our metrics have the ability to properly assess the overall intercluster reliability difference of the projections.
5.1.1 Experimental Design
Experiment A: Identifying False Groups The goal of the first experiment was to evaluate whether and how Steadiness and previous local metrics (Continuity, MRRE [False]) identify False Groups. We first generated highdimensional data consisting of six 100dimensional spheres whose centers were equidistant from the origin. Each sphere consisted of 500 points. We then set the initial 2D projection of the dataset as six circles around the origin (the first projection on the first row of Figure 2). Note that this projection is the most faithful view of the original data as we made each circle correspond to one highdimensional sphere. To simulate False Groups distortion, we then distorted this groundtruth projection by overlapping the circles in pairs (the first row of Figure 2). For each pair of circles centered at , , respectively, we adjusted the degree of overlap by changing from to with an interval of , where is the origin. For each projection, we measured Steadiness and Cohesiveness (, 500 iterations), T&C (), and MRREs (). We used different values and used the mean of their results as the final score for soundness. Experiment B: Identifying Missing Groups To evaluate Cohesiveness and previous local metrics’ (Trustworthiness, MRRE [Missing]) ability to measure Missing Groups distortion, we used the same highdimensional dataset as Experiment A, but this time, we synthesized the initial projection consisting of 12 equally distant circles, where each consists of 250 points. We made a pair of nearby circles correspond to a single sphere in the original space (the second row of Figure 2). We then overlapped each pair of circles by adjusting from to with an interval of (the second row of Figure 2). Note that unlike Experiment A, the initial projection is the least faithful projection but becomes more faithful as the circles in each pair overlap more. We used the same metrics setting as Experiment A. Experiment C: Capturing quality degradation To test our metrics’ ability to capture the quality degradation of the projection, we computed our and previous metrics for the projections with different levels of quality degradation. We first created a 2D SNE projection of the MNIST dataset [lecun1998mnist] (the first projection on the third row of Figure 2). We then replaced a certain proportion of the projected points with random points. We varied the replacement rate from 0 to 100% with an interval of 5% (the third row of Figure 2). The intercluster reliability of the projections certainly degrade as the replacement rate increases. We checked whether the metrics can capture such quality degradation. We used the same metrics setting as Experiment A. Experiment D: Identifying the effect of projection hyperparameters The final experiment was conducted to evaluate the capability of our metrics to capture the intercluster reliability differences caused by the hyperparameter choices of an MDP technique. This experiment was inspired by an analysis from the UMAP paper [mcinnes2018umap] where the authors assessed the impact of a hyperparameter, the number of nearest neighbors , on the projection quality. Lower values drive UMAP to more local structures, while higher values make the projection to preserve the global structures rather than the local details. In the original analysis, the authors qualitatively analyzed how affects the UMAP projection of a randomly sampled 3dimensional RGB cube data. The authors concluded that since randomly sampled data have no manifold structure, larger values generate more appropriate projections than lower values. Lower values instead treat the noises from random sampling as finescale local manifold structures, generating an unreliable interpretation of the structure [mcinnes2018umap]. We tested whether our and previous metrics can quantitatively reproduce this conclusion. We first constructed a dataset of 4000 points randomly sampled from a 3dimensional RGB cube. UMAP projections of the dataset with different ( with an interval of 1, with an interval of 10) were then generated (the fourth row of Figure 2) and tested with the same metrics setting as Experiment A. We set another hyperparameter of UMAP, min_dist, to 0 because higher min_dist values tune projections to lose the local structure, reducing the difference between the projections generated with higher and lower values. Setting it at 0.0 prevents such an effect from affecting the experiment.
5.1.2 Results
Experiment A As we decreased the angle between each circle pair (i.e., increasing the amount of false overlap), both Steadiness (slope , ) and Cohesiveness (slope , ) decreased. The baseline local metrics: Trustworthiness (slope , ), Continuity (slope , ), MRRE [Missing] (slope , ), and MRRE [False] (slope , ), also decreased, but the slope was statistically gentle compared the our metrics ( for all). (Figure 3A) Experiment B As we decreased the angles between each circle pair (i.e., making projections more faithful), Cohesiveness drastically increased around (slope in range , ), which is the point where the circle pair starts to overlap. Other measures such as Steadiness (slope , ), Trustworthiness (slope , ), Continuity (slope , ), MRRE [Missing] (slope , ) and MRRE [False] (slope , ) did not changed significantly (Figure 3B). Experiment C As the replacement rate increased, Steadiness (slope , ), Cohesiveness (slope , ), Trustworthiness (slope , ), Continuity (slope , ), MRRE [Missing] (slope , ), and MRRE [False] (slope , ) all decreased. (Figure 3C) Experiment D As we increased , both Steadiness (slope , ) and Cohesiveness (slope , ) increased, while Trustworthiness (slope , ), MRRE [Missing] (slope , ) decreased. Continuity (slope , ) and MRRE [False] (slope , ) increased, though the slopes were statistically gentle compared to Steadiness ( for both). All baseline local metrics early saturated near the max score around . (Figure 3D).
5.1.3 Discussion
The result of Experiment A suggests that our metrics could identify a loss of the intercluster reliability caused by False Groups distortion, as Steadiness decreased when the overlap of circle pairs increased. Cohesiveness also decreased, which means that not only False but also Missing Groups distortions had occurred. This is because for point and in a circle, although their Euclidean distance is maintained while the circle is overlapping with another circle, the SNN similarity decreases as more points intervene between and . Continuity and MRRE [False] also captured the decrease in the intercluster reliability due to False Groups distortion, but slower compared to our metrics. For Experiment B, the result confirms that Cohesiveness correctly identifies Missing Groups distortion as the metric increased following the increasing overlap of circles that reduces Missing Groups distortion. Moreover, in Experiment B, the amount of Missing Groups distortion was captured only by Cohesiveness, which showed that our metrics have the ability to pinpoint the particular intercluster distortion type. In contrast, both Trustworthiness and MRRE [Missing] failed to capture this apparent Missing Groups distortion. The Reliability map further confirms the results of Experiment A and B as it showed that Steadiness and Cohesiveness accurately identified the place where False Groups and Missing Groups occurred (Figure 4). Reliability map located the False Groups distortion of Experiment A by highlighting the overlapped area in purple. For Experiment B, it was able to identify the Missing Groups relationship of two separated circles in a pair through the cluster selection interaction, as selecting the portion of one circle showed that the other circle was actually close to it; this result matches the ground truth. In contrast, CheckViz, which visualized the False and Missing Neighbors distortion of each point computed by T&C, did not show any pattern. In Experiments C and D, both Steadiness and Cohesiveness could capture the decrease (Experiment C) and the increase (Experiment D) in intercluster reliability. Moreover, Experiment D showed that our metrics also can be used to quantify the effect of a hyperparameter by reproducing the result of human observers’ qualitative analysis [mcinnes2018umap]. In contrast, local metrics barely captured the certain increase of intercluster reliability in Experiment D. Overall, the experiments proved that our metrics can properly measure intercluster reliability. On the contrary, local metrics failed for some cases even with the apparent intercluster distortion.


Distance Measurement  
clustering  SNNbased  Euclidean  
St  Co  St  Co  
HDBSCAN  
XMeans  
Means  
Means  
Means  

value and the score of Steadiness (St) and Cohesiveness (Co). Every regression analysis result satisfied
.5.2 Robustness Analysis
We also investigated the robustness of Steadiness and Cohesiveness against hyperparameters by conducting Experiment D using Steadiness and Cohesiveness with different hyperparameters. We tested Steadiness and Cohesiveness with simpler hyperparameters functions as hyperparameter functions can considerably change the behavior of our metrics. For the goal, we tested simpler clustering algorithms, XMeans [pelleg2000x] and Means [duda1973pattern] (number of clusters ), instead of the default HDBSCAN algorithm for clustering. We also tested the Euclidean distance as dist instead of the default SNNbased distance. While using Euclidean distance for the distance measurement between points, we also defined the distance between two clusters dist_cluster as the Euclidean distance between their centroids instead of the default definition based on SNN similarity to align with dist. For extract_cluster, we treated every traversed points as cluster members instead of using probability to weight the points with high SNN similarity. As a result (Table 1), Steadiness and Cohesiveness with simpler clustering hyperparameter functions both increased as nearest neighbors values increased, which confirms the ability to properly quantify intercluster reliability. This result shows that our metrics’ capability is not bound mainly by the selection of clustering but instead originates more from the power of randomness to analyze complex structures [tempo2012randomized]. What is interesting here is that the case of KMeans () showed the most similar results to the case of the default HDBSCAN hyperparameter both for Steadiness and Cohesiveness. This is because when clustering the extracted clusters in the opposite space, the intercluster structure composed of arbitrary shapes and sizes can be better represented by the finegrained KMeans clustering result than the coarsegrained result. However, Cohesiveness failed for all cases that used Euclidean distance as dist. This result shows that while designing hyperparameters functions for Steadiness and Cohesiveness, users should carefully consider the definition of the distance between two points.
6 Case Studies
We report two case studies that we conducted with two ML engineers (E2, E3). During the study, we demonstrated to the engineers how Steadiness, Cohesiveness, and the reliability map works, and they explored with us the original intercluster structure of MNIST and FashionMNIST [xiao2017fashion] test datasets, where both live in a 784dimensional space and consist of 10 classes. The case study showed that our metrics and the reliability map supports users in 1) selecting adequate projection techniques or hyperparameter settings that match the dataset and 2) preventing users’ misinterpretation that could potentially occur while conducting intercluster tasks (subsection 2.1). ML engineers agreed that such support is helpful in interpreting the intercluster structure of highdimensional data.
6.1 MNIST Exploration with Diverse MDP Techniques
To explore the intercluster structure of MNIST, we projected it with SNE, UMAP, PCA [pearson1901liii], Isomap [tenenbaum2000global], and LLE. We measured the Steadiness and Cohesiveness (, 500 iterations) of each projection and visualized the result with the reliability map (Figure 5). We first discovered that visualizing Steadiness and Cohesiveness can prevent users from misidentifying a cluster separation of the original space (T1). For instance, in the Isomap projection, we found the region with high False Groups distortion consists of categories #4 and #7 (red dotted circle in Figure 5). A similar region was also observed in the PCA projection (Appendix B). LLE also has the cluster with high False Groups distortion composed of categories #3, #6, #8, and #9. Without checking False Groups distortion, one could make the wrong interpretation that such a region belongs to same cluster; visualizing the distortion with the reliability map helped us avoid this misperception. Visualizing False Groups distortion also allowed additional reasoning beyond a mere quantitative score comparison to choose the proper projection technique. We found that False Groups distortions that occurred in Isomap, PCA (overlap of category #4 and #7), and LLE (overlap of category #3, #6, #8, #9) did not occur in SNE or UMAP. This finding explains why the Steadiness of SNE and UMAP are higher than other projections, advocating the use of SNE and UMAP in exploring the intercluster structure of a MNIST dataset. Still, as the SNE and UMAP projections also suffered from Missing Groups distortion, we critically interpreted that the clusters in these projections actually stay closer to each other than they look. E3 noted that this interpretation matches the ground truth that digits in MNIST stay much closer and mixed in the original space than their representations. Moreover, we found that by using cluster selection interaction, users can accurately estimate and compare cluster sizes and shapes (T3). As we selected the local area in LLE (blue dashed ellipse in Figure 5), the reliability map highlighted a much larger region around the selected region (black longdashed contour in Figure 5). This means that the original cluster containing the selected points was much larger than we can see in the projections and lost its portion as dispersed outliers (i.e., Missing Groups distortion occurred). We identified this problem through cluster selection interaction and escaped from the misinterpretation.
6.2 FashionMNIST Exploration with tSne
In the second case study, we explored FashionMNIST dataset using SNE projections with varying hyperparameters. We measured our metrics (, 500 iterations) on SNE projections generated with different perplexity values and visualized the result with the reliability map (Figure 6). Note that using a high value for makes the SNE focus more on preserving global structures [wattenberg2016how]. As a result, we found that our metrics and the reliability map can help in selecting adequate hyperparameter settings. For example, the projection with had both False and Missing Groups distortion distributed uniformly across the entire projection space. This finding, which aligns with the low score of the projection, showed that low values are not sufficient to capture the global intercluster structure, which matches its actual behavior. This result justifies that it is proper to select a higher value to investigate FashionMNIST. The fact that and projections earned higher scores for both Steadiness and Cohesiveness compared to and projections strengthens this interpretation. Thus, we subsequently analyzed the and projections and discovered that our metrics prevent users from making the wrong interpretation of the relations between clusters (T2). We first noticed that the projection has more compact clusters compared to the projection with , where clusters are slightly more disperser and closer to each other. As the projection achieved a relatively high Steadiness score, we could conclude that each compact cluster also exists as a cluster also in the original space. However, as the projection increased Cohesiveness, we were not able to believe the separation of the clusters depicted in the projection (T21). According to Cohesiveness, the distances between the clusters in the original spaces is better depicted in the projection. Therefore, it is more reliable to interpret the original intercluster structure as a set of subclusters that constitute one large cluster rather than as a set of separated clusters (T22). E2 paid particular attention to this result. She pointed that it is common to perceive that projections with welldivided clusters (e.g., projection) better reflects the intercluster structure, but this result shows that such a common perception could lead to a misinterpretation of intercluster structure.
7 Conclusion and Future Work
Although it is important to investigate the intercluster distortion in many MDP tasks, there were previously no metrics that directly measure such distortions. In this work, we first surveyed user tasks related to identifying the intercluster structures of MDP and elicited design considerations for the metrics to evaluate the intercluster reliability. Next, we presented Steadiness and Cohesiveness to evaluate intercluster reliability by measuring the False Groups and Missing Groups distortions and presented the reliability map to visualize the metrics. Through quantitative evaluations, we validated that our metrics adequately measure intercluster reliability. The qualitative case studies showed that our metrics can also help users select proper projection techniques or hyperparameter settings and perform intercluster tasks with fewer misperceptions, assisting them in interpreting the original space’s intercluster structure. As a future work, we plan to enhance the scalability of our algorithm. The algorithm currently computes the iterative partial distortion measurement sequentially. As each iteration works independently, we plan to accelerate the algorithm leveraging multiprocessing. We also plan to improve our metrics to consider the hierarchical aspect of intercluster structures and reduce the number of hyperparameters. Another interesting research direction would be to investigate how Steadiness, Cohesiveness, and their visualizations affect users’ perception of the original data, which will provide an indepth understanding of MDP, as an expansion of our case study.