 # Capacity Preserving Mapping for High-dimensional Data Visualization

We provide a rigorous mathematical treatment to the crowding issue in data visualization when high dimensional data sets are projected down to low dimensions for visualization. By properly adjusting the capacity of high dimensional balls, our method makes right enough room to prepare for the embedding. A key component of the proposed method is an estimation of the correlation dimension at various scales which reflects the data density variation. The proposed adjustment to the capacity applies to any distance (Euclidean, geodesic, diffusion) and can potentially be used in many existing methods to mitigate the crowding during the dimension reduction. We demonstrate the effectiveness of the new method using synthetic and real datasets.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Visualizing high dimensional data via their low dimension projections is particularly useful in facilitating data analysts to understand their datasets, detect underlying data patterns and create various hypotheses about the data. To achieve this goal, we project the high dimensional dataset down to low dimensions: () with or 3 and visualize the low dimensional embedding via a single scatter plot. Designing the mapping that yields a reliable visualization is the focus of this paper. A good design should take into account the special need of data visualization 1) as most human beings can only digest visual information in at most three dimensions (or four dimensions if video sequences are included) , the map is required to project data of any dimensionality down to 2 or 3. 2) as revealing the data pattern is the major objective, the map should be able to preserve the geometrical structure of the data.

Most existing dimensionality reduction techniques (i.e., MDS , LLE , Isomap ) are only designed to reduce the data to its intrinsic dimension, which is usually higher than three. While data visualization methods SNE , tNSE , PHATE , UMAP  can reduce data of any dimension to two or three, geometric relations such as cluster radii and relative distances between clusters are usually lost. We hereby formulate the main mathematical question in data visualization: how to map datasets with a wide range of dimensionality to 2 or 3 while minimizing the geometric distortion?

The geometric distortion we care about in this paper is the relative closeness/similarity between points. It is considered successfully preserved if close points remain close and far away points remain far away. The main obstacle in preserving the geometric relation is the so-called crowding issue. Simply put, the crowding issue arises from the fact that a higher dimensional body typically has a larger capacity than a lower dimensional one, hence reducing the dimensionality causes points to be crowded.

We propose a way to adjust the capacity of the high dimensional body before the dimension reduction. The method named Capacity Preserving Mapping (CPM) is essentially a class of methods that generalize many existing methods by redefining the distance they use. Compared to the popular methods such as tSNE and UMAP, our method can better preserve geometrical structures of the dataset and does not presume existences of clusters.

Our contribution is twofold: 1. we propose a way to define a dimension-aware distance to treat the crowding issue, which greatly assists high dimensional data visualization by better preserving geometrical structures of the dataset. 2. we propose a method to compute the multi-scale intrinsic dimensionality of a dataset that is necessary for the definition of the new distance in 1.

We note that although previous methods SNE, t-NSE, PHATE, UMAP also treat the crowding problem to some extent, it was done in a less rigorous manner.

## 2 Related work

Data visualization is an important task in data mining and is closely related to dimensionality reduction, graph learning, and data clustering. A number of early works studied the so-called table data visualization problem (see the review article ) that visualizes () attributes of data in a table using 2D or 3D plots. This is made possible either by using multiple pixels (or attributes) in the low dimension to represent one data point in the high dimension or using multiple plots from different angles to build up the high dimensional image. The drawbacks of these methods are that the visualization is not directly meaningful, needs human effort to understand and interpret, and the relation between data points (such as whether clusters exist or how close they are) is not immediately apparent.

In a separate line of research, various dimension reduction methods  can be used to visualize high dimensional data in one scatter plot. The methods can be categorized as linear ones (PCA, MDS , ICA …) and nonlinear ones (LLE , non-metric MDS , Isomap , Laplacian Eigenmap , Diffusion map , etc). As mentioned earlier, these methods can only reliably map the data down to its intrinsic dimension, while visualization requires the target dimension to be less than four. Hence most good visualization results are observed on artificial datasets with small intrinsic dimensions (e.g., swiss roll (2D), Helix (1D), Twin peaks (2D)).

The class of methods that are most relevant to ours includes SNE , tSNE , UMAP  and PHATE . They are directly designed for data visualization and able to reduce the dimension from any to two or three. However, in all these methods, a rigorous treatment of the crowding issue is missing. The purpose of this paper is to fill in this gap by proposing a distance correcting step that can work with any metric. We achieve this goal by a careful computation of the correlation dimension at various scales.

## 3 Motivation - the crowding phenomenon

As stated in , when high dimensional data is mapped to low dimensions, there is a tendency for non-overlapping groups to overlap. Theoretically, this is due to the difference in norm concentration between high and low dimensions: a ball in higher dimensions has a volume that grows faster with radius , where is the ball in with radius . As a result, the high dimensional points are more likely to be distributed near the surface. When mapped to low dimensions, since there is less room near the surface, the points are pushed towards the center.

Let us visualize the crowding phenomenon observed during dimension reduction via Multidimensional Scaling (MDS). Assume a high dimensional ball contains two classes of points. Class 1 lies inside the ball and Class 2 lies in a spherical shell right outside Class 1. When is two dimensional, a direct visualization (Figure: (a)a) shows the correct relation between the two classes. When is five dimensional and is mapped down to via MDS, its geometrical structure is distorted (Figure (b)b) and we see a severe crowding phenomenon: the second class is pushed towards the center. Finding a way to correct this type of distortion is our main objective.

### 3.1 Notation

Let and be two increasing functions of . We use to denote that there exist constants such that for all . means , i.e., grows slower than as increases, and

means the opposite. For a random variable

and its probability density function

, we use to denote the probability of the event under .

## 4 The methodology

Let us first define some mathematical terminology that helps to describe the crowding issue. Assuming each data point is independently drawn from an underlying manifold

according to a continuous probability distribution

.

###### Definition 4.1 (Relative capacity)

Let be a compact manifold and be a probability distribution defined on . For any point , we define the relative capacity of a neighbourhood of with radius as the probability that randomly drawn points according to fall inside this neighbourhood,

 C(p,r;M,f,D):=Pz∼f(M)(D(z,p)≤r)=Ez∼f(M)I(r−D(z,p)) (1)

where is a map that defines the pairwise distances on and is the step function used to model the neighbourhood of radius centred on (i.e., . The relative capacity of the manifold is the expectation of with respect to all ,

 C(r;M,f,D):=Ep∼f(M)C(p,r;M,f,D),

For simplicity, we write in short for , , respectively.

Intuitive, the “capacity” of a neighbourhood reflects the amount of data the neighbourhood can hold. The “relative capacity” is the normalized capacity, and hence is a probability measure. If many samples are generated, reflects the expected number of points in a neighbourhood. If is fixed, as a function of reflects how the data density varies across different locations on the manifold. If is fixed, as a function of reflects how fast the capacity grows as the neighbourhood expands. The relative capacity for the manifold is the percentage of pairs with a distance smaller than . Since by definition, and

are cumulative distribution functions, their partial derivatives

, are probability density functions.

###### Definition 4.2 (Relative density)

The relative density function is defined as the derivatives of the relative capacities

 ρ(p,r):=∂C(p,r)∂r,ρ(r):=dC(r)dr

We also have the reversed relation , and .

Exmaple: Suppose ,

is the uniform distribution on

, and is the Euclidean distance. Then , for any , and where was defined in Sect 3.1.. Taking expectation with respect to , one can verify that the capacity of the manifold is and the density is about .

In what follows, we make the following two assumptions on the data.
Assumption 1: The original data are drawn independently from some underlying manifold according to a continuous distribution . The embedded data , , are independent realizations of the embedded manifold , according to the induced distribution of under .
Assumption 2: The relative densities and of and can be fitted with the models

 ρ(r;M,f,∥⋅∥2)=cmnm(r)rnm(r)−1,ρ(r;S,fI,∥⋅∥2)=csns(r)rns(r)

where and are absolute constants, and , are both slowly varying dimension functions of .

###### Remark 4.3

and are closely related to the so-called correlation dimension  defined as . We can see that coincides with at , i.e., . Therefore, can be viewed as an extension of the correlation dimension to positive scales and be called as a dimension function.

###### Remark 4.4

Assumption 2 is mild as it allows the relative capacity to vary with scales.

###### Remark 4.5

The assumption that , are slowly varying functions of ensures that they are almost constants in small intervals and therefore can be estimated by counting points in these intervals (see Sect. 5).

###### Remark 4.6

The slowly varying assumption can be mathematically written as: , which implies on a small interval , and are close to constant thus we can integrate to get the increment of :

 C(r+dr;M,f,∥⋅∥2)−C(r;M,f,∥⋅∥2)=cmrnm(r)dr
 C(r+dr;S,fI,∥⋅∥2)−C(r;S,fI,∥⋅∥2)=csrns(r)dr.

### 4.1 A new distance that preserves the relative capacity

Let us explain the crowding issue with the relative capacity using the manifold endowed with the uniform distribution. Consider the ball as a union of concentric shells with infinitesimal thickness and increasing radii. If these shells are to be mapped to without changing their inclusion order, then the relative density as a function of that reflects how many points are lying on each shell, should not change after the embedding; otherwise points will shuffle around. More explicitly, if after embedding we have

 ρ(0,r;M,f,^D)≳ρ(0,r;S,fI,∥⋅∥2)

where (defined in Sect 3.1) means that the left hand side grows faster with than the right hand side, then has a larger capacity than and the crowding phenomenon will be observed. On the contrary, if

 ρ(0,r;M,f,^D)≲ρ(0,r;S,fI,∥⋅∥2),

then has a larger capacity than . Points and clusters will be pulled apart to a certain degree, but is usually less harmful than the crowding.

The relative density of the high dimensional manifold obeys due to Example 1. If we equip the embedded manifold with the same distribution (i.e.,uniform distribution) and the same distance (i.e., the Euclidean distance), then the relative density on is . Since it is not growing fast enough, the crowding phenomenon will occur.

To solve this problem, we need to either 1) endow non-uniform probability distributions on the low dimensional manifold to allow points with a larger norm to be selected with a higher probability, or 2) modify the distance in the definition of in (5) to allow the shells with larger radii to be mapped to larger shells in the low dimension. We go with the latter option in this paper and put the former as future research direction.

Explicitly, we want to design a new distance such that the relative densities of the original manifold and the embedded one can approximately match

 ρ(r;M,f,^D)≈ρ(r;S,fI,∥⋅∥2)

for all . Since is the derivative of , under the slowly varying assumption (Assumption 2), the above implies

 C(r;M,f,^D)≈C(r;S,fI,∥⋅∥2).
###### Theorem 4.7

Let be the high dimensional data manifold and be the embedded one. Let be the dimension reduction mapping. Assuming the pairwise distance in the embedded space is some function of that in the original space i.e., , for any ,and the scaling function is unknown. Under Assumption 1 and Assumption 2, we can define a new “distance” between any pair of points and as

 ^D(∥x−z∥2):=∥x−z∥nm(∥x−z∥2)ns(∥P(x)−P(z))∥2)2,x,z∈Rn. (2)

This distance allows a match in the relative capacity between the original and the embedded data

 C(r;M,f,^D)=C(r;S,fI,∥⋅∥2),

where and are the same as defined in Assumption 2.

###### Remark 4.8

One can verify that this no longer defines a norm. In particular, it does not preserve the monotonicity of distances (. However, the proof of Theorem 4.7 indicates that this definition of is both necessary and sufficient for a match of the relative capacity. Therefore we propose to bring back the monotonicity with the minimal change to . To that end, we start with the smallest pairwise distance in the high dimensional space and redefine to inductively. For the smallest pairwise distance, say , simply set . For the second smallest distance, say , we need to define as close to as possible while having the monotonicity . Thus we set . Assume we have defined that has monotonicity up to , set as follows guarantees to be monotonic up to

 ~D(σk+1)=max{^D(σk+1),~D(σk)}. (3)

Intuitively, the maximum in the above equation makes assign more room to the low dimensional space than necessary. As we discussed earlier, this is much less harmful than the crowding issue to the visualization.

###### Definition 4.9

We call the new “distance” defined in (2) as the Capacity Adjusted Distance (CAD), and the modified version defined in (3) as the modified Capacity Adjusted Distance (modified CAD).

###### Remark 4.10

Theorem 2 currently assumes that the high dimensional manifold is equipped with the Euclidean distance, and provides a way to modify the Euclidean distance to allow a match of the capacity. If the equipped distance for is the geodesic distance or the diffusion distance, we can also use Theorem 2 to modify them to allow a match of the capacity, by simply replacing the quantity in (2) by and . We make it an option to the user in our algorithm (Algorithm 1).

The definition of requires the knowledge of the dimension functions and at various scales. In the next section, we introduce a dimension estimation method that allows an estimate of from . However, we cannot estimate directly since is not yet known. To solve this problem, we make the assumption that the induced measure on the low dimension manifold is uniform, and in light of Example 1, this means , hence . Once the low dimensional embedding is obtained, one may use it to update to make it more accurate (note that we do not update in this paper).

### 4.2 Dimensionality reduction with the modified Capacity Adjusted Distance

Before introducing ways of estimating , let us first discuss the dimensionality reduction procedure based on the new distance

. As usual, we search for the low dimensional vectors

that best preserves

for all pairs of points. In CMP, we measure the dissimilarity using the Kullback–Leibler divergence. The low dimensional embedding

is the minimizer of the following optimization problem

 {^Yi}Ni=1=argminYi,i=1...,N∑i,jpi,jlogpi,jqi,j (4)

where

 pi,j=(ϵ+~D2(∥Xi−Xj∥))−1∑k,l,k≠l(ϵ+~D2(∥Xk−Xl∥))−1,qi,j=(1+∥Yi−Yj∥2)−1∑k,lk≠l(1+∥Yk−Yl∥2)−1, (5)

where is the distance measure in the high dimensional space, and is some small constant used to avoid taking the reciprocal of 0. Besides the KL divergence, one can use other (dis)similarity measures and obtain different formulations of the optimization problems, which we put as future work. Putting the defined in (5) into a matrix, . We can think of this as a normalized network probability matrix of the random network built based on the pairwise distance matrix . Explicitly, let the data points correspond to the nodes of a random graph, and the random edges are constructed as follows. is connected with by an edge with probability , where hides a universal constant (i.e., the denominator in (5)). This is to say, closer points are more likely to be connected by an edge in this random graph. The optimization (4) is hence trying to match the KL divergence between the network probabilities of the original and the embedded graphs. The small positive constant in the expression of and the 1 in the expression of are used to avoid dividing by 0.

Although the formulation (4) looks similar to that of t-SNE (see (6) below), their performances are completely different.

 t-SNE:{^Yi}Ni=1=argminYi,i=1...,N∑i,jpi|jlogpi|jqi|j=argminYi,i=1...,N∑i,j−pi|jlogqi|j (6)

where

 pi|j=e−∥Xi−Xj∥222σ2∑k,k≠je−∥Xk−Xj∥222σ2,qi|j=(1+∥Yi−Yj∥22)−1∑k,k≠j(1+∥Yk−Yj∥22)−1, (7)

Compared to t-SNE, our formulation (4) has the following merits.

1. Less over-streching

: t-SNE mitigates the crowding and promotes the formation of clusters by matching Gaussian distributions with t-distributions. Intuitively, this makes close points closer and far away points further, but the degree of squeezing and stretching (i.e., how much closer and how much further) are not precisely characterized and justified. As a result, t-SNE may produce fake clusters due to the over-stretching. In contrast, our method computes and performs the amount of stretching that is necessary to avoid the crowding. In other words, the stretching in our method is much milder. This can also be seen mathematically from the definition of

in (5) and the optimization (4), where we are fitting the -distribution with some other heavy tailed distribution (given the way how the modified capacity adjusted distance is defined), whereas t-SNE is fitting the -distribution with the light tailed Gaussian distribution, which creates more stretching. Consequently, although our method may also tear a manifold up, it happens much less often.

2. No tuning parameter: the performance of t-SNE heavily depends on the choice of the bandwidth parameter , whereas our method does not have a key tuning parameter (the small positive constant used to avoid dividing by 0 does not affect the results much as long as it is sufficiently small).

3. Preserving more geometry: the probability in t-SNE is a conditional probability, i.e., , which causes a loss of the geometric information due to the following reason. In the optimization (6), is the weight of which contains the variables and . The total weights put on logarithms containing for a given is then the sum (where we used and . The former is due to the definition of the conditional probability. The latter is due to the symmetrisation in t-SNE on the conditional probability matrix , which makes and ). Since for any the total weight on is , all points are given equal importance in the optimization. As a result, the variability among points or clusters is lost. For instance, clusters of different sizes become of similar sizes in the embedding, and clusters of different inter-cluster distances become of similar distances after embedding111We refer the readers to the website https://distill.pub/2016/misread-tsne/ for more such examples.. In contrast, the probabilities in our formulation are not conditional probabilities. Mathematically, instead of normalizing the probability matrix row by row as in t-SNE, we normalized all entries of by one constant. This allows variabilities to exist among rows, which carry the correct geometric information. It is worth noting that this universal normalization cannot be used in t-SNE because the fast decay of the Gaussian tail often causes the sum of some rows to be much smaller than others. Because these sums are the weights on the s in the optimization, those with too small of a weight cannot be correctly computed. In contrast, our formulation allows for both row-wise and universal normalizations. The row-wise normalization achieves a greater stability at the expense of losing more geometric information.

### 4.3 Comparisons to other methods

Besides t-SNE, our method is also related to the non-metric MDS (NMDS)  and the multi-scale SNE . Similar to our method, the non-metric MDS also aims at preserving the pairwise dissimilarity as closely as possible. It approaches this goal by minimizing a scaled distances between points. Let be the high dimensional distances, the NMDS embedding is obtained by solving the optimization problem

 min{Yi}Ni=1,f∈F∑i,j|∥Yi−Yj∥2−f(di,j)| (8)

where is the set of positive monotonically increasing functions. The scaling function plays the role of mitigating the crowding. Indeed, the scaled distance essentially corresponds to our capacity adjusted distance . From this perspective, our formulation provides an explicit way to compute which avoids the trouble of solving it from an optimization. Compared to the proposed method, the NMDS in its original form (8) fails to render the correct small-scale information of data (see Figure 3). While replacing the absolute value in (8) with a KL-divergence type of dissimilarity measure can help preserving the small-scale structure, the resulting optimization is difficult to solve due to the existence of the function variable .

Multi-scale SNE  is similar to our approach in the sense that both methods assume a scale-varying dimension of the data manifold. However, since it inherits the structure of SNE, the third drawback of t-SNE mentioned in Section 4.2 applies.

## 5 Dimensional estimation at various scales

We propose a way to estimate the dimension function from the data. To the best of our knowledge, no existing method calculates the multi-scale correlation dimension for all scales. The multi-scale dimension estimation method in  calculates the average dimension from scale 0 to scale instead of the instantaneous dimension at scale . Recall that in Assumption 2, is defined as the growth rate of the relative density

 ρ(r)≡ρ(r;M,f,∥⋅∥2)=cn(r)rn(r)−1 (9)

and is some unknown absolute constant. Therefore, it is possible to find by fitting the slope of

 n(r)≈log(ρ(r+δr))−log(ρ(r))log(r+δr)−log(r). (10)

However, this estimate is highly unstable especially when the data is insufficient, and may even output negative dimensions.

Therefore, we seek a way to estimated from , which can be more stably approximated from data by counting the number of points falling inside a neighbourhood of radius ,

 C(r)←^C(r)=1N⋅1N−1N∑i=1N∑j=1,j≠iχ{^D(Xi,Xj)≤r}. (11)

At scale zero, as mentioned in Remark 4.3, corresponds to the well-known correlation dimension, and can be estimated by counting points in the ball with various radius and then be solved from via a linear fitting procedure between and [3, 5, 12]. However, at non-zero scales, can only be estimated through counting points in a shell that is thin enough on which one can assume to be a constant. Unlike the counting number of a ball, the counting number of a shell, denoted as does not obey the simple relation , but the following relation as obtained in Remark 4.6.

 CS(r)≡C(r+Δr)−C(r)≈c(r+Δr)n(r)−crn(r). (12)

Assumption 3 For some fixed and , is a constant on the interval .

For illustration purposes, we introduce our method under Assumption 3, but the proposed method works for slowly varying dimensions as well.

Under Assumption 3, (12) becomes an equality, ,

 CS(r)≡C(r+Δr)−C(r)=c(r+Δr)n−crn for [r,r+Δr]⊆[r0,Δr0] (13)

where is the constant dimension on the small interval. The counting number of the shell on the left hand side of (13) can be estimated by its empirical version , where was defined in (11). Now the problem boils down to: given , solving for from (13) when is some small constant chosen in advance and has been estimated from the data (note that the constant is still unknown). As mentioned above, this new problem is easier to solve when than when , because when , we have which helps to reduce (13) to

 C(Δr)=c(Δr)n.

Then equals to , which can be computed numerically via the finite difference . However, this approach does not generalize to , as the presence of the second term in the right hand side of Equation (13) prevents taking logarithm from simplifying the equation. Without simplification, solving for from (13) is difficult as is unknown and this equation is highly non-linear. We propose to a way to reduce the nonlinearity to a degree that allows a meaningful solution to be found by fixed-point iterations. The following key observation (14) to be proven in Theorem 5.2 states that, with the help of the data, the term can be approximated by a linear function of ,

 crn≈ψ(r+Δr,r)(1n−Φ(r+Δr,r))=:p(n,Δr), (14)

where

 Φ(r′,r)=1^C(r′)−^C(r)∫r′r^C(s)−^C(r)sds,ψ(r′,r)=^C(r′)−^C(r)log(r′)−log(r),

are both empirical quantities that can be directly computed from the data. Here means asymptotically equal as , where is the number of samples. As a result, is a linear function in . Intuitively, this lower order can approximate the exponential quantity because the empirical quantities and have taken into account all the high order information. Plugging (14) into (13), we obtain

 ^C(r+Δr)−^C(r)+p(n(r),Δr)≈c(r+Δr)n (15)

or equivalently,

 n≈∂Δrlog[^C(r+Δr)−^C(r)+p(n,Δr)]∂Δrlog(r+Δr)∣∣Δr=0. (16)

where denotes the derivative with respect to . Let be the finite difference approximation of the right hand side

 q(n,Δr):=log[^C(r+Δr)−^C(r)+p(n,Δr)]−log[^C(r+Δr/2)−^C(r)+p(n,Δr/2)]log(r+Δr)−log(r+Δr/2)

The following Proposition states that this resulting equation

 n=q(n,Δr) (17)

is close to quadratic provided that is much smaller than .

###### Proposition 5.1

When and , the equation defined in (17) is approximately linear in .

Therefore, we initialize using the solution to the linear approximation and solve it via fixed point iterations .

###### Theorem 5.2

Assume and are the and defined in Assumption 2, and is a constant on the small interval . Let be the same as defined in (14), then for any , we have

 crn=limN→∞p(n,δr),

where is the number of samples.

Note that the multi-scale correlation dimension is the object of study here as it is directly linked to the relative capacity. Our method cannot estimate the actual intrinsic dimension of the manifold .

The Assumption 3 used in Theorem 5.2 can be relaxed from constant dimension to slowly varying ones (Assumption 2).

###### Theorem 5.3

Assume and are the and defined in Assumption 2, and is slowly varying on the small interval , i.e., . Let be the same as defined in (14), then

 crn(r)=limΔr→0limN→∞p(n(r),Δr),

where is the number of data points. In addition, n(r) = q(n(r)).

## 6 Numerical simulation

To evaluate the performance of the proposed Capacity Preserving Mapping (CPM) method, we first compare it with the landmark methods non-metric MDS, Isomap, and t-SNE on three datasets: 1) the motivating example introduced in Sect. 3, 2) the augmented Swiss roll to be defined shortly, and 3) the MNIST dataset.

The motivating example: in Sect. 3, we observed the crowding phenomenon when using MDS to reduce the dimension of two non-overlapping objects from 5 to 2, with one object being an ball and the other being a shell lying right outside the ball. Figure 2 shows that the same crowding problem arises in t-SNE (perplexity =), and Isomap. In contrast, CPM and non-metric MDS are able to mitigate the crowding and reveal the correct relation between the classes. To further compare CPM and NMDS, we plot the Shepard diagrams, which show the goodness-of-fit after the dimension reduction. Figure 3 displays the Shepard diagrams of NMDS, CPM and t-SNE (perplexity 30, other perplexity values produce similar or worse results) when they map 5 dimensional (bottom row) and 20 dimensional (top row) Gaussian point clouds (each containing 1000 i.i.d. sampled points according to , with , respectively) down to 2D. We see that t-SNE is only good at preserving small scale distances. NMDS is good at preserving medium and large distances, but is worse than t-SNE at small scales (as discussed in Sect. 4.3). CPM has a similar performance to t-SNE at small scales and a similar performance to NMDS at medium and large scales and its advantage becomes more visible as the dimension gets higher. Figure 3: Shepard diagrams of t-SNE, NMDS and CPM applied to Gaussian point clouds

Augmented Swiss roll: the Swiss roll is a popular synthetic test dataset which can evaluate a method’s ability in preserving the geometric structure. However, the fact that its intrinsic dimension is only 2 makes it less representative. In order to test the ability of our algorithm to map the data to below its intrinsic dimension, we construct the augmented Swiss roll dataset , where the first three coordinates are exactly the same as those in the original Swiss roll, and the rest of the coordinates are filled with i.i.d. Gaussian entries. Therefore the intrinsic dimension of is . Here we set and map the data to 2D. Explicitly, we set

 x1(t) =(t+1)cos(t), x2(t) =(t+1)sin(t), xj(t) =gj(t),  j=3,...,6

where takes values in a finite grid of the interval , for any and , and are independent among various values of and .

We apply to this dataset Isomap, NMDS with the geodesic distance, and CPM with the geodesic distance, all with the same number of neighbours (=10). From the second row of Figure 2, we see that in terms of unfolding the manifold, CPM did the best job among all. The result of t-SNE (perplexity 30) is also included for comparison.

MNIST dataset We repeat the previous experiment on the MNIST data set to test the method’s ability in preserving and revealing clusters. The MNIST dataset contains 60000 training images of handwritten digits. We apply t-SNE, Isomap, NMDS with geodesic distance and CPM with geodesic distance to a subset of 6000 randomly selected images from the training set. Figure 2

(last row) shows that CPM separates the clusters equally well as t-SNE, while revealing some new structures. For example, it shows that the cluster of the digit 1 has the smallest variance, which is consistent with our intuition that the handwritten digit 1 has the least variation among different writing styles. To confirm this observation, we computed the variance of each cluster based on the original data and obtained Table

1. We can see that the cluster of the digit 1 indeed has a much smaller variance than all other clusters. The second and third smallest clusters digit 7 and digit 9 also appear smaller than others in the visualization. Another piece of information conveyed by the CPM visualization is that the cluster of digit 1 is close to a lot of other clusters, which is aligned with the intuition that the digit 1 look similar to 2, 3, 7, and maybe 9.

### 6.1 Coil 20

The Coil 20 dataset contains images of 20 objects captured from different angles while they rotate. Previous methods are only aiming at separating the images into 20 clusters. While we also care about the formation of clusters, we are more interested in examining how well the geometric features are correctly displayed. In particular, we check the following properties of the embedding,

1. objects with large variances during rotations should correspond to clusters with large sizes in the visualization;

2. objects similar to each other should correspond to clusters close to each other in the visualization;

3. if an object is symmetric with respect to its center, then the corresponding data should form a trajectory similar to a folded circle in the visualization;

4. if an object is nearly isometric (looks similar from all angles), then its corresponding cluster should have a small size.

We now evaluate the performance of CPM based on these four criteria.

1. Table 2 summarizes the variance of each object as it rotates. The variance for the th object is computed using the formula

 Vi=1nini∑j=1∥Xj−¯Xi∥22

where is the number of points in the th cluster, , are vectorized images of the th object, and is the mean. Table 6.1 and Figure 4 together confirm that objects with large variances during rotations corresponding to clusters with large sizes in the visualization.

2. To characterize the preservation of the inter-cluster distances of the 20 classes/objects in Coil 20, we propose to use the distance index defined as follows. For a given cluster say Cluster , we compute its distances to all other 19 clusters (the distance between two clusters are defined as the averaged pairwise distances between one cluster and the other), rank the 19 distances and find the th percentile for some given . Those clusters within the th percentile are said to be close to Cluster , otherwise is said to be far to Cluster . After doing this for each , we build a proximity matrix . If Cluster is close to Cluster i, then , otherwise . We compute the proximity matrix of the original data (obtain ) and that of the embedded data (obtain ), and then compute the dissimilarity between and :

 E(p)=∑i,j1{Corigi,j=1,Cnewi,j=0}∑i,j1{Corigi,j=1}

That is to say, we compute the percentage of close clusters that are no longer close after the embedding. For any predefined percentile , such a dissimilarity can be computed. Let range from 0%-50%, we derive the percentile versus dissimilarity plot in Figure 5 for the three embedding methods CPM with Euclidean distance, t-SNE (perplexity 30) and MDS. Clearly, MDS is good at preserving large distances, our method is better at preserving smaller distances, and t-SNE is not good at preserving inter-cluster distances.

3. Table 5 shows that the CPM embedding indeed produces folded circles for symmetric objects.

4. Table 6.1 shows that the CPM embedding indeed produce small clusters for the isotropic objects. In addition, Table 6 also suggested the trajectory of these objects are (unfolded) circles.

Table 3: Objects with large variances Objects Labels 19 2 5 13 Variance 43.18 69.92 42.55 31.72 Table 4: Objects with small variances Objects Labels 17 15 12 16 Variance 2.97 2.03 3.45 1.90

## 7 Conclusion and future directions

In this paper, we asked the question of how to rigorously characterize and treat the intrinsic crowding issue in data visualization. After giving a mathematical notation to the capacity, we discussed two possible directions to mitigate the crowding: altering the density or altering the distance. We chose the latter for simplicity, but it will be interesting to explore the former as well.

After the Capacity Adjusted Distance was defined, we proposed to find the low dimensional embedding by matching the dissimilarity measured by the KL divergence. There are many other dissimilarity measures in the literature. The performance of the combination of the Capacity Adjusted Distance with other measures is yet to be explored.

The definition of requires the knowledge of the dimension functions which cannot be estimated directly from data since is not readily known. We made the assumption that the induced measure on the low dimension manifold is uniform and hence . A better way to obtain is to initialize it with , and update it once the low dimensional embedding is obtained. These two steps (updating and updating ) can be performed iteratively.

In the numerical experiments, we only tested CPM under Euclidean and Geodesic distances. It would be interesting to see how it works with other distances (e.g., the diffusion distance).

From the discussion in Sect. 4.2. The preservation of geometry essentially corresponds to allowing different row sums of the probability matrix. Our method achieves it by the universal normalization, but there are other ways to achieve it. That is to say, although t-SNE does not admit the universal normalization, there might be other remedies to bring back the geometric information.

## 8 References

 M. Belkin and P. Niyogi. Laplacian Eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, volume 14, pages 585–591, Cambridge, MA, USA, 2002. The MIT Press.

 F. Camastra and A. Vinciarelli. Estimating the intrinsic dimension of data with a fractal-based method. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 10, pp. 1404–1407, 2002.

 K. Falconer. Fractal Geometry - Mathematical Foundations and Applications, John Wiley and Sons, 3rd edition, 2014.

 M.C. Ferreira de Oliveira and H. Levkowitz. From visual data exploration to visual data mining: A survey. IEEE Transactions on Visualization and Computer Graphics,9(3): 378-394, 2003.

 P. Grassberger and I. Procaccia. Measuring the strangeness of strange attractors. Physica D. Nonlinear Phenomena, vol. 9, no. 1-2, pp. 189–208, 1983.

 G.E. Hinton and S.T. Roweis. Stochastic Neighbor Embedding. In Advances in Neural Information Processing Systems, volume 15, pages 833–840, Cambridge, MA, USA, 2002. The MIT Press.

 A. Hyvarinen and E. Oja. Independent Component Analysis: Algorithms and Applications.

Neural Networks,13(4-5):411-430, 2000.

 S. Lafon and A.B Lee. Diffusion maps and Coarse-Graining:A Unified Framework for Dimensionality Reduction,Graph Partitioning and Data Set Parametrization.IEEE Transactions on Pattern Analysis and Machine Intelligence,28(9):1393-1403, 2006.

 L. McInnes and J. Healy. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Arxiv e-prints, February 2018.

 Moon K.R. et al. (2017) PHATE: A Dimensionality Reduction Method for Visualizing Trajectory Structures in High-Dimensional Biological Data. bioRxiv Published online March 24, 2017. http://dx.doi.org/10.1101/120378

 S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by Locally Linear Embedding. Science, 290(5500):2323–2326, 2000.

 N. Tatti, T. Mielikäinen, A. Gionis, and H. Mannila. What is the dimension of your binary data? In Proceedings of the 6th International Conference on Data Mining (ICDM), pp. 603–612, December, 2016.

 J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.

 W.S. Torgerson. Multidimensional scaling I: Theory and method. Psychometrika, 17:401–419, 1952.

 L. van der Maaten, E. Postma,and J. van der Herik. Dimensionality Reduction: A Comparative Review. J Mach Learn Res,10:66-71, 2009.

 L. Van der Maaten and G. Hinton. Visualizing data using t-SNE.

Journal of Machine Learning Research

, 9(2579-2605): 85, 2008 .

 J. A. Lee, D. H. Peluffo-Ordóñez, and M. Verleysen . Multi-scale similarities in stochastic neighbour embedding: Reducing dimensionality while preserving both local and global structure.Neurocomputing, 169, 246-261, 2015.

 G. B. Rabinowitz. An introduction to nonmetric multidimensional scaling.American Journal of Political Science, 343-390, 1975.

## 9 Appendix

### 9.1 Proof of Theorem 4.7

#### Proof:

The assumption that the embedded pairwise distances can be written as a function of the pairwise distances in the original space, i.e., (for some ), allows us to define the new distance to be a function of only. We will show that if is defined as in the theorem, i.e., , then the capacity is preserved. Without any ambiguity from the context, we write , in short as and . By the definition of the relative capacity, for any , and (i.e., the interval is small), and can be treated as constants and

 C(~r+d~r;M,f,^D)−C(~r;M,f,^D) =Px,z∼f(M)(~r≤^D(∥x−z∥2)≤~r+d~r) =px,z∼f(M)(^D(∥x−z∥2)=~r)d~r =ρ(~r;M,f,^D)d~r

where denotes the probability density function with respect to the random variables and . The first and last equalities used the definitions of and . Hence the relative capacities match if the relative densities match. We will show that relative densities match by showing

 ρ(~r;M,f,^D)d~r=cns~rns−1d~r, (18)

because the right hand side is exactly the relative density for the low dimensional embedding by Assumption 2. To compute the left hand side, we define , then is equivalent to . Hence

 ρ(~r;M,f,^D)d~r=px,z∼f(M)(^D=~r)d~r=px,z∼f(M)(∥x−z∥2=r)dr (19)

This last term can be easily calculated using Assumption 2,

 px,y∼f(M)(∥x−z∥2=r)=ρ(r;M,f,∥⋅∥2)=cnmrnm−1

Inserting this into (19) we obtain

 ρ(~r;M,f,^D)=fx,y∼f(M)(∥x−z∥2=r)⋅drd~r=cnmrnm−1⋅drd~r=cnmrnm−1⋅nsnm~rnsnm−1=cns~rns−1

Hence we proved (18).

### 9.2 Proof of Theorem 5.2

#### Proof:

It is immediate to verify that the empirical counting number defined by

 ^C(r)=1N⋅1N−1N∑i=1N∑j=1,j≠iχ{^D(Xi,Xj)≤r}

is a consistent estimate of , i.e.,