1 Introduction and Motivation
tSNE (Maaten and Hinton, 2008) has been a successful and popular dimension reduction method for visualisation. It aims to preserve the similarities in the transformed lowdimensional space as those in the given highdimensional space, based on KL divergence. The original SNE (Hinton and Roweis, 2003) employs a Gaussian kernel to measure similarity in both the high and the lowdimensional spaces. tSNE replaces the Gaussian kernel with the distancebased similarity (where is the distance between instances and ) in lowdimensional space, while retaining the Gaussian Kernel for the highdimensional space. The distancebased similarity has a heavytailed distribution that alleviates issues related to far points and optimisation in SNE (Maaten and Hinton, 2008).
Because Gaussian Kernel is independent of data distribution, this requires tSNE to finetune a bandwidth of the Gaussian kernel centred at each point in the given dataset in order to adjust the similarity locally. In other words, tSNE must determine bandwidths for a dataset of
points. This is accomplished by using a heuristic search with a single global parameter called perplexity such that the Shannon entropy is fixed for all probability distributions at all points in adapting each bandwidth to the local density of the dataset.
As the perplexity can be interpreted as a smooth measure of the effective number of neighbours (Maaten and Hinton, 2008)
, the method can be interpreted as using a userspecified number of nearest neighbours (aka kNN) in order to determine the
bandwidths (see the discussion section for more on this point.)Whilst there is only one userspecified parameter, it does not hide the fact that (internal) parameters need to be determined. This becomes the first obstacle in dealing with large datasets. In addition, the relationship between bandwidth and local density is undefined in the current formulation. This is despite the fact that a few papers (Hinton and Roweis, 2003; Maaten and Hinton, 2008; Pezzotti et al., 2016) have mentioned that the local neighbourhood size or bandwidth for each data point depends on its local density. No clear relationship between the bandwidth and local density has been established thus far.
The contributions of this paper are:

Identifying a fundamental issue for both SNE and tSNE: the similarity of two points in highdimensional space is not defined. This underlines two key issues in tSNE that were unexplored until now, i.e., (a) the reference probability is set based on entropy which has an undefined relation with local density; and (b) the use of data independent kernel leads to the need to determine bandwidths for a dataset of points;

Establishing a principle in setting the reference probability in highdimensional space in tSNE via a datadependent kernel which has a welldefined kernel characteristic that linked directly to local density;

Proposing a generic solution based on the principle that simply replacing the data independent kernel with a datadependent kernel, leaving the rest of the procedure unchanged. The use of datadependent kernel resolves the two issues: (a) set the reference probability which is inversely proportional to local density; and (b) determine one global parameter only instead of local parameters. This addresses the fundamental issue as well as its two ensuing issues;

Analysing the advantages and disadvantages of the proposed solution.
Two net effects of using a datadependent kernel are that it enables tSNE to deal with:

highdimensional datasets more effectively, especially in sparse datasets.

large datasets because of its reliance on the number of parameters which is equal to the number of data points when a data independent kernel is used.
In a recent development, a datadependent kernel called Isolation Kernel Ting et al. (2018); Qin et al. (2019) has been shown to adapt its similarity to local density, i.e., two points in the sparse region are more similar than two points of equal interpoint distance in the dense region. We investigate the use of Isolation Kernel in tSNE, and examine its impact.
2 Impact of the issues in tSNE
The impact of the tSNE’s fundamental issue as well as its ensuing two issues can produce misleading mappings which do not reflect the structure of the given dataset in high dimension. Two examples are given as follows:

Misleading (dis)association between clusters of different subspaces. The first row in Table 1 highlights the impact when the Gaussian kernel is used: tSNE is unable to identify the joint component of the (first) three clusters in different subspaces which share the same mean only but nothing else.^{1}^{1}1
The synthetic 50dimensional dataset contains 5 subspace clusters. Each cluster has 250 points, sampled from a 10dimensional Gaussian distribution with the other 40 irrelevant attributes having zero values; but these
attributes are relevant to the other four Gaussian distributions. In other words, no clusters share a single relevant attribute. In addition, all clusters have significantly different variances (the variance of the 5th cluster is 625 times larger than that of the 1st cluster). The first three clusters share the same mean; but the last two have different means. The five clusters have distributions:
, , , and in each dimensionIn contrast, the same tSNE algorithm employing the proposed Isolation Kernel (instead of Gaussian Kernel) produces the mapping which depicts the scenario in high dimension well: the three clusters are well separated and yet they share some common pointm, shown in the second row in Table 1.
Gaussian Kernel
(a) (b) (c) Isolation Kernel
(d) (e) (f) Table 1: Visualisation results of the tSNE using Gaussian kernel and Isolation Kernel on a 50dimensional dataset with 5 subspace clusters. 
A highly concentrated cluster is depicted as having some other structure. Table 2 compares the visualisation results of the two kernels on a dataset having a highly concentrated cluster with 250 noise points in subspaces.^{2}^{2}2The 250attribute dataset contains 550 points located at the origin and 250 noise points. Each noise point has randomly selected 100 attributes having value 1, and the rest of the attributes having 0.
Using Gaussian kernel, points belonging to the concentrated cluster are in a ring. In this case, using a large value may help, as shown in Figure (c) in Table 2.
In contrast, Isolation Kernel always produces mappings which do not modify the structure of the concentrated cluster, independent of the parameter used, as shown in Figures (d)(f) in Table 2.
Gaussian Kernel
(a) (b) (c) Isolation Kernel
(d) (e) (f) Table 2: Visualisation results of tSNE with Gaussian kernel and Isolation Kernel on a 250dimensional dataset with one cluster (indicated as blue points) and noise points (indicated as red points).
The rest of the paper is organised as follows. Related work is described in Section 3. The current tSNE and its fundamental issue are provided in Section 4. We present the proposed change in tSNE using Isolation Kernel and a principle of setting the reference probability in Section 5 and Section 6, respectively. The empirical evaluation is given in Section 7, followed by discussion and conclusions in the last two sections.
3 Related work
SNE (Hinton and Roweis, 2003) and its variations have been widely applied in dimensionality reduction and visualisation. In addition to tSNE (Maaten and Hinton, 2008), which is one of the most famous visualisation methods, many other variations have been proposed to improve SNE in different aspects.
There are some improvements based on revised Gaussian Kernel functions in order to get better similarity measurements. Cook et al. (2007) proposes a symmetrised SNE, Yang et al. (2009) enable tSNE to accommodate various heavytailed embedding similarity functions; and Van Der Maaten and Weinberger (2012) propose an algorithm based on similarity triplets of the form “A is more similar to B than to C” to model the local structure of the data more effectively.
Based on SNE and the concept of information retrieval, NeRV (Venna et al., 2010)
uses a cost function to tradeoff between precision and recall of “making true similarities visible and avoiding false similarities”, when projecting data into 2dimensional space for visualising similarity relationships. Unlike SNE which relies on a single KullbackLeibler divergence, NeRV uses a weighted mixture of two dual KullbackLeibler divergences in neighbourhood retrieval. Furthermore, JSE
(Lee et al., 2013) enables tSNE to use a different mixture of KullbackLeibler divergences, a kind of generalised JensenShannon divergence, to improve the embedding result.To reduce the runtime of tSNE, Van Der Maaten (2014) explores treebased indexing schemes and uses the BarnesHut approximation to reduce the time complexity to . This gives a tradeoff between speed and mapping quality. To further reduce the time complexity to , Linderman et al. (2019)
utilise a fast Fourier transform to dramatically reduce the time of computing the gradient during each iteration. The method uses vantagepoint trees and approximate nearest neighbours in dissimilarity calculation with rigorous bounds on the approximation error.
There are some works focusing on analysing the heuristics methods for solving nonconvex optimisation problems for the embedding (Linderman and Steinerberger, 2017; Shaham and Steinerberger, 2017). Recently, Arora et al. (2018)
theoretically analyse this optimisation and provide a framework to make clusterable data visually identifiable in the 2dimensional embedding space. These works are not related to similarity measurements; therefore not directly relevant to work reported here.
All the above methods do not question the suitability of Gaussian kernel in SNE or tSNE. We argue that the issues, mentioned in Sections 1 and 2, have their root cause in Gaussian kernel. Since the aim is to have a datadependent kernel, they can be easily overcome by using a recently introduced datadependent kernel called Isolation Kernel, instead of spending effort in remaking data independent Gaussian kernel datadependent. We describe the current tSNE and the proposed change in tSNE in the next two sections.
4 Current tSNE
We describe the pertinent details of tSNE here.
Given a dataset in . The similarity between and is measured using a Gaussian Kernel as follows:
tSNE computes the conditional probability that would pick as its neighbour as follows:
The probability , a symmetry version of , is computed as:
Given a fixed value , tSNE performs a binary search for the best value of such that defined as
(1) 
Where is the Shannon entropy:
(2) 
The perplexity is a smooth measure of the effective number of neighbours, similar to the number of nearest neighbours used in KNN methods (Hinton and Roweis, 2003). Thus, is adapted to the density of the data, i.e., it becomes smaller for denser data since the nearest neighbourhood is smaller. In addition, Maaten and Hinton (2008) point out that there is a monotonically increasing relationship between perplexity and the bandwidth .
Notice that the similarity between two points in highdimensional space is not and cannot be defined based on the above formulation.
The aim of tSNE is to map to where such that the similarities between points are preserved as much as possible from the highdimensional space to the lowdimensional space. As tSNE is meant for a visualisation tool, usually.
The similarity between and in the low dimension space is measured as:
and the corresponding probability is defined as:
The distancebased similarity is used because it has heavytailed distribution, i.e., it approaches an inverse square law for large pairwise distances. This means mapped points which are far apart have which are almost invariant to changes in the scale of the lowdimensional space (Maaten and Hinton, 2008).
Note that the probability is set to ; so as .
The location of each point is determined by minimising a cost function based on the (nonsymmetric) KullbackLeibler divergence of the distribution from the distribution :
The use of the Gaussian kernel sharpens the cost function in retaining the local structure of the data when mapping from the highdimensional space to the lowdimensional space.
The procedure of tSNE is provided in Algorithm 1.
Because is data independent, its use necessitates to determine local bandwidths for points in order to adapt to the local structure of the data; and this search^{3}^{3}3‘A binary search for the value of that makes the entropy of the distribution over neighbours equal to , where is the effective number of local neighbours or “perplexity”.’ (Hinton and Roweis, 2003) Another view is: adjust all bandwidths such that all have the same entropy: . is the key component that determines the success or failure of tSNE. A gradient descent search has been used successfully to perform the search for parameters for small datasets (Maaten and Hinton, 2008). For large datasets, however, the need for parameter search poses a real limitation in terms of finding appropriate settings for the large number of parameters and the computational expense required.
While the determining local bandwidths is an issue, there is a more fundamental issue which will be presented in the next section.
4.1 A fundamental issue in both SNE and tSNE
A fundamental issue in SNE and tSNE is that the ‘learned’ similarity of any two points in highdimensional space is not and cannot be defined.
Both SNE and tSNE aim to make a data independent kernel datadependent by finding a local bandwidth of the Gaussian kernel for every point in a dataset. This fundamental issue is hidden for one key reason, i.e., the ‘learned’ similarity of two points in highdimensional space does not need to be computed. This is because the probability , which is a proxy to the ‘learned’ similarity between and , is resigned to a heuristic by summarising the asymmetry probability .
As a result, it would not be able to explain how the similarity is dependent on data distribution, i.e., the data dependency relationship cannot be established succinctly. This is not just a conceptual issue but also a practical one: it is unclear how the similarity of two points in highdimensional space can be computed, after all local bandwidths of Gaussian kernel have been determined.
Note that does not reflect the resultant datadependent similarity because the data dependency characteristic of cannot be ascertained.
This is troubling because the aim is purportedly based on the similarity which is represented by the conditional probability, i.e., “find a lowdimensional data representation that minimises the mismatch between and ” (Maaten and Hinton, 2008). Yet the ‘learned’ similarity between two points in highdimensional space is not defined and cannot be computed.
This fundamental issue underlines the two key issues in the procedure, described in Section 1: (a) setting the reference probability has no clear basis without a welldefined similarity, despite the use of entropy; and (b) the need to set bandwidths for a dataset of points.
We show in the next section that, by using a welldefined datadependent similarity called Isolation Kernel that addresses the fundamental issue due to the use of the Gaussian kernel, it resolves the two key issues.
5 The proposed change: tSNE with Isolation Kernel
Since tSNE needs a datadependent kernel, we propose to use a recent datadependent kernel called Isolation Kernel (Ting et al., 2018; Qin et al., 2019) to replace the data independent Gaussian kernel in tSNE.
Isolation Kernel is a perfect match for the task because a datadependent kernel, by definition, adapts to local distribution. The kernel replacement is conducted in the component in the highdimensional space only, leaving the other components of the tSNE procedure unchanged.
Let
be a dataset sampled from an unknown probability density function
. Moreover, let denote the set of all partitionings that are admissible under the dataset , where each covers the entire space of ; and each of the isolating partitions isolates one data point from the rest of the points in a random subset , and .For any two points , Isolation Kernel of and wrt is defined to be the expectation taken over the probability distribution on all partitionings that both and fall into the same isolating partition :
(3)  
where is an indicator function.
In practice, Isolation Kernel is constructed using a finite number of partitionings , where each is created using :
(4)  
where is a shorthand for . is the sharpness parameter and the only parameter of the Isolation kernel.
As Equation (4) is quadratic, is a valid kernel.
The larger is, the sharper the kernel distribution. It is a function similar to in the Gaussian kernel, i.e., the smaller is, the narrower the kernel distribution. The key difference is that Isolation Kernel adapts to local density distribution; but Gaussian kernel is independent of the data distribution.
The proposed tSNE is to simply replace with in defining , i.e.,
The rest of the procedure of tSNE remains unchanged.
The procedure of tSNE with Isolation Kernel is provided in Algorithm 2.
Note that the only difference between the two algorithms is the first two lines.
6 A principle of setting the reference probability
Here we provide a principle approach through Isolation Kernel which has the following welldefined characteristic: two points in a sparse region are more similar than two points of equal interpoint distance in a dense region (Ting et al., 2018).
Using a specific implementation of Isolation Kernel (see Appendix), Qin et al. (2019) have provided the following Lemma (see its proof in their paper):
(Qin et al., 2019) (sparse region) and (dense region) such that , the nearest neighbourinduced Isolation Kernel has the characteristic that for implies
(5) 
where is the distance between and ; and denotes the density at point .
Let be the probability that would pick as its neighbour.
We provide two corollaries from Lemma 1 as follows.
such that is more likely to pick as a neighbour than is to pick as a neighbour, i.e., .
This is because in the dense region is more likely to pick a point closer than as its neighbour, in comparison with picking as a neighbour in the sparse region, given that .
, where is a region in ; and is an average density of a region.
Using a datadependent kernel with a welldefined characteristic as specified in Lemma 1, we can establish that the probability that would pick , , is inversely proportional to the density of the local region.
This becomes the basis in setting a reference probability in highdimensional space.
It is interesting to note that the statement is also true when is a Gaussian kernel in tSNE. But, none of the statements in Colloraries 1 and 2 can be established. This is despite the fact that tSNE does intend to adjust the bandwidth of the Gaussian kernel locally. This is because the Gaussian kernel with local bandwidths, determined based on the entropy criterion, does not have a welldefined kernel characteristic that relates to local density.
Summary
In a nutshell, the fundamental issue of SNE and tSNE is: the ‘learned’ similarity of two points in highdimensional space is not defined, despite their use of Gaussian kernel having local bandwidth centred at point . In fact, it is unclear how the similarity of two points can be computed, after all local bandwidths of Gaussian have been determined. In addition, setting the reference probability has no clear basis without a welldefined similarity, despite the use of entropy.
The use of Isolation kernel in tSNE brings about two key benefits: (a) Improved visualisation quality; and (b) reduced runtime of step 1 in the tSNE algorithm. The first benefit is a direct result of better data dependency. The second is because Isolation Kernel produces a truly single global parameter only algorithm—this eliminates the need to tune
bandwidths (internally). For a large dataset, it is infeasible to estimate the large number of bandwidths with an appropriate degree of accuracy. We verify these two benefits in an empirical evaluation, reported in the next section.
7 Empirical Evaluation
We present the evaluation measures used in the first subsection. The empirical evaluation results and the runtime comparison are provided in the next two subsections.
7.1 Evaluation measures
We used a qualitative assessment to evaluate the preservation of ary neighbourhoods (Lee and Verleysen, 2009; Lee et al., 2013, 2015), defined as follows:
(6) 
where
and is the set of nearest neighbours of ; and is the corresponding lowdimensional (LD) point of the highdimensional (HD) point .
measures the ary neighbourhood agreement between the HD and corresponding LD spaces. ; and the higher score, the better the neighbourhoods preserved in LD space. In our experiments, we recorded the assessment with and produced the curve, i.e., vs .
To aggregate the performance over different ary neighbourhood, we calculate the area under the curve in the log plot (Lee et al., 2013) as:
(7) 
AUC assesses the average quality weighted by , i.e., errors in large neighbourhoods with large contribute less than that with small to the average quality.
In addition, the purpose of many methods of dimension reduction is to identify HD clusters in the LD space such as in a 2dimensional scatter plot. Since all the datasets we used for evaluation have ground truth (labels), we can use measures for clustering validation to evaluate whether all clusters can be correctly identified after they are projected into the LD space. Here we select two popular indices of cluster validation, i.e., DaviesBouldin (DB) index (Davies and Bouldin, 1979) and CalinskiHarabasz (CH) index (Caliński and Harabasz, 1974). Their details are given as follows.
Let be an instance in a cluster which has instances with the centre as . The DaviesBouldin (DB) index can be obtained as
(8) 
where is the number of clusters in the dataset.
CalinskiHarabasz (CH) index is calculated as
(9) 
where is the centre of dataset.
Both measures take the similarity of points within a cluster and the similarity between clusters into consideration, but in different ways. These measures assign the best score to the algorithm that produces clusters with low intracluster distances and high intercluster distances. Note that the higher the CH score, the better the cluster distribution; while the lower the DB score, the better the cluster distribution.
7.2 Evaluation results
We used 20 realworld datasets with different data sizes and dimensions to evaluate the use of Isolation Kernel and Gaussian Kernel in tSNE.^{4}^{4}4COIL20, HumanActivity and Isolet are from Li et al. (2016); News20.binary and Rcv1 are from Chang and Lin (2011)
; and all other realworld datasets are from UCI Machine Learning Repository
(Dua and Graff, 2017). All algorithms used in the experiments were implemented in Matlab 2018b and were run on a machine with eight cores (Intel Core i77820X 3.60GHz) and 32GB memory. All datasets were normalised using the  normalisation to yield each attribute to be in [0,1] before the experiments began. The same normalisation was used on the projected datasets before calculating CH and DB scores. We report the best performance of each algorithm with a systematic parameter search with the range shown in Table 3.^{5}^{5}5The original tSNE paper says that “the performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50” (Maaten and Hinton, 2008).Parameter with search range  

Gaussian Kernel  ; 
Isolation Kernel  ; 
Dataset  #Points  #Attr  Evaluation measure  
DB  CH  
GK  IK  GK  IK  GK  IK  
Wine  178  13  0.65  0.67  0.52  0.43  625  853 
Dermatology  358  34  0.68  0.684  0.47  0.40  3679  4532 
ForestType  523  27  0.70  0.71  0.91  0.89  467  478 
WDBC  569  30  0.64  0.67  0.70  0.58  821  1167 
ILPD  579  9  0.67  0.69  4.30  3.71  21  28 
Control  600  60  0.69  0.70  0.67  0.68  3847  6816 
Pima  768  8  0.70  0.71  3.74  2.98  44  72 
Parkinson  1040  26  0.70  0.74  8.40  6.35  13  22 
Biodeg  1055  41  0.74  0.77  2.04  2.12  154  146 
Mice  1080  83  0.79  0.82  0.32  0.18  8326  39085 
Messidor  1151  19  0.71  0.74  8.72  6.79  14  22 
Hill  1212  100  0.69  0.73  16.71  15.10  4  4.5 
COIL20  1440  1024  0.75  0.79  1.66  2.67  2352  3730 
HumanActivity  1492  561  0.78  0.79  2.87  2.86  1225  1631 
Isolet  1560  617  0.79  0.81  1.83  1.41  1746  2812 
Segment  2310  19  0.68  0.72  1.39  2.02  4052  5079 
Spam  4601  57  0.67  0.70  1.46  1.33  1626  1874 
News20.binary  9998  1355191  0.27  0.23  3.24  1.92  661  2320 
Rcv1  10121  47236  0.68  0.66  1.80  1.43  2421  4221 
Pendig  10992  16  0.69  0.693  1.14  1.10  6944  6777 
Average  0.68  0.70  3.15  2.75  1952  4084 
Table 4 shows the results of the two kernels used in tSNE. Isolation Kernel performs better on 18 of 20 datasets in terms of , which means that Isolation Kernel enables tSNE to preserve the local neighbourhoods much better than Gaussian kernel. With regard to the cluster quality, Isolation Kernel performs better than Gaussian kernel on 17 out of 20 datasets in terms of both DB and CH. Notice that when the Gaussian kernel is better, the performance gaps are usually small in all three measures. Overall, Isolation Kernel is better than Gaussian Kernel in 15 out of 20 datasets in all three measures; but no datasets in which the reverse is true.
It is worth mentioning that the extremely highdimensional datasets News20 and Rcv1 are very sparse where nearly 99% attribute have zero values. Although Isolation kernel performed slightly worse in terms of on these two datasets, it significantly improved the cluster structure obtained by Gaussian kernel in terms of both DB and CH.
We compare the visualisation results of News20 and Rcv1, i.e., the two datasets having the highest numbers of attributes, in Table 5 and Table 6, respectively. It is interesting to note that tSNE using Isolation kernel having a small produces better visualisation results having more separable clusters than those using Gaussian kernel with high perplexity.
Gaussian Kernel 


(a)  (b)  (c)  
Isolation Kernel 

(d)  (e)  (f) 
Gaussian Kernel 


(a)  (b)  (c)  
Isolation Kernel 

(d)  (e)  (f) 
Table 5(f) shows an interesting visualisation result which deserves further investigation. We selected the centre point in LD space as a reference point, and computed the average number of nonzero HD attributes of all points inside/outside of LD ball centred at . Figure 1 shows the results. It is interesting to note that, of all points within (and also outside) the LD ball, the average number of nonzero HD attributes generally increases as the radius of the LD ball increases. tSNE using Isolation Kernel produces a structure where points having a low number of nonzero attributes are clustered at the centre; and points away from the centre has an increasingly higher number of nonzero attributes^{6}^{6}6The attributes values in News20 are real values which are different from the synthetic dataset used in Table 2..
7.3 Runtime comparison
The computational complexities of the two kernels used in tSNE are shown in Table 7. Generally, all these kernels have quadratic time and space complexities. However, the Gaussian kernel in the original tSNE needs a large number of iterations for search the optima local bandwidth for each point.
Figure 2 presents the two runtime comparisons of tSNE comparing the two kernels on a synthetic dataset. Figure 2(a) shows that the Gaussian kernel is much slower than Isolation kernel in similarity calculations.This is mainly due to the search required to tune bandwidths in step 1 of the algorithm. Figure 2(b) shows the runtimes of the mapping process in step 4 of Algorithms 1 and 2 which is the same for both algorithms; and it is not surprising that the runtimes are about the same in this step, regardless of the kernel employed.
Time complexity  Space complexity  

Gaussian Kernel  
Isolation Kernel  
tSNE Mapping 
8 Discussion
The proposed idea can be applied to variants of stochastic neighbour embedding, e.g., NeRV (Venna et al., 2010) and JSE (Lee et al., 2013) since they employ the same algorithm procedure as tSNE. The only difference is the use of variants of the cost function, i.e., type 1 or type 2 mixture of KL divergences.
Recall that the first step of tSNE may be interpreted as using kNN to determine the bandwidths of Gaussian Kernel. There are existing kNN based datadependent kernels which adapt to local density, i.e.,

kNN kernel (Marin et al., 2018).
The kNN kernel is a binary function defined as:
(10) where is the set of nearest neighbours of .

Adaptive Gaussian Kernel (ZelnikManor and Perona, 2005).
The distance of
th NN has been used to set the bandwidth of Gaussian Kernel to make it adaptive to local density. This was proposed in spectral clustering as an adaptive means to adjust the similarity to perform dimensionality reduction before clustering.
Adaptive Gaussian Kernel is defined as:
(11) where is the distance between and ’s th nearest neighbour.
However, replacing the Gaussian kernel in tSNE with either of these kernels produce poor outcomes. For example, on the Segment and Spam datasets, the adaptive Gaussian kernel produced AUC scores of 0.35 and 0.22, respectively; and the kNN kernel yielded AUC scores of 0.38 and 0.28, respectively. They are significantly poorer than those produced using the Gaussian Kernel or Isolation Kernel shown in Table 4. We postulate that this is because a global is unable to make these kernels sufficiently adaptive to local distribution.
9 Conclusions
This paper identifies a fundamental issue in all algorithms which are a direct descendent of SNE, i.e., the ‘learned’ similarity between any two points in highdimensional space is not defined and cannot be computed. The root cause of this issue is the use of data independent kernel to produce a datadependent kernel implicitly.
Like many problems, once the root cause of the fundamental issue is identified, the solution is simple. In the case of tSNE, we show that the fundamental issue can be addressed by simply replacing the Gaussian kernel with a datadependent kernel called Isolation Kernel which has a welldefined characteristic.
We show that this significantly improves the quality of the final visualisation output of tSNE; and it removes one obstacle that prevents tSNE from processing large datasets.
Appendix: The nearest neighbour implementation of Isolation Kernel
We use an existing nearest neighbour method to implement Isolation Kernel (Qin et al., 2019). It produces each model (a Voronoi diagram) which consists of isolating partitions , given a subsample of points. Each isolating partition or Voronoi cell isolates one data point from the rest of the points in the subsample. The point which determines a cell is called the cell centre. The Voronoi cell centred at is given as:
where is a distance function and we use as Euclidean distance in this paper.
Figure 3 compares the contours of Isolation Kernel on two different data distributions. It shows that Isolation Kernel is adaptive to the local density. Under uniform data distribution in Figure 2(a), Isolation kernel’s contour is symmetric with respect to the reference point at (0.5, 0.5). In Figure 2(b), however, the contour shows that, for points having equal interpoint distance from the reference point at (0.5, 0.5), points in the spare region are more similar to than points in the dense region to .
While this implementation of Isolation Kernel produces its contour similar to that of an exponential kernel under uniform density distribution, different implementations have different contours. For example, using axisparallel partitionings to implement Isolation Kernel produce a contour (with the diamond shape) which is more akin to that of Laplacian kernel under uniform density distribution (Ting et al., 2018). Of course, both the exponential and Laplacian kernels, like Gaussian kernel, are data independent.
References
 Arora et al. (2018) Sanjeev Arora, Wei Hu, and Pravesh K Kothari. An analysis of the tSNE algorithm for data visualization. arXiv preprint arXiv:1803.01768, 2018.

Caliński and Harabasz (1974)
Tadeusz Caliński and Jerzy Harabasz.
A dendrite method for cluster analysis.
Communications in Statisticstheory and Methods, 3(1):1–27, 1974. 
Chang and Lin (2011)
ChihChung Chang and ChihJen Lin.
LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.  Cook et al. (2007) James Cook, Ilya Sutskever, Andriy Mnih, and Geoffrey Hinton. Visualizing similarity data with a mixture of maps. In Artificial Intelligence and Statistics, pages 67–74, 2007.
 Davies and Bouldin (1979) David L Davies and Donald W Bouldin. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, (2):224–227, 1979.
 Dua and Graff (2017) Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
 Hinton and Roweis (2003) Geoffrey E Hinton and Sam T Roweis. Stochastic neighbor embedding. In Advances in neural information processing systems, pages 857–864, 2003.
 Lee and Verleysen (2009) John A Lee and Michel Verleysen. Quality assessment of dimensionality reduction: Rankbased criteria. Neurocomputing, 72(79):1431–1443, 2009.
 Lee et al. (2013) John A Lee, Emilie Renard, Guillaume Bernard, Pierre Dupont, and Michel Verleysen. Type 1 and 2 mixtures of kullback–leibler divergences as cost functions in dimensionality reduction based on similarity preservation. Neurocomputing, 112:92–108, 2013.
 Lee et al. (2015) John A Lee, Diego H PeluffoOrdóñez, and Michel Verleysen. Multiscale similarities in stochastic neighbour embedding: Reducing dimensionality while preserving both local and global structure. Neurocomputing, 169:246–261, 2015.
 Li et al. (2016) Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Trevino Robert, Jiliang Tang, and Huan Liu. Feature selection: A data perspective. arXiv:1601.07996, 2016.
 Linderman and Steinerberger (2017) George C Linderman and Stefan Steinerberger. Clustering with tSNE, provably. arXiv preprint arXiv:1706.02582, 2017.

Linderman et al. (2019)
George C Linderman, Manas Rachh, Jeremy G Hoskins, Stefan Steinerberger, and
Yuval Kluger.
Fast interpolationbased tSNE for improved visualization of singlecell rnaseq data.
Nature methods, 16(3):243, 2019.  Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tSNE. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 Marin et al. (2018) D. Marin, M. Tang, I. Ben Ayed, and Y. Y. Boykov. Kernel clustering: density biases and solutions. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2018. ISSN 01628828. doi: 10.1109/TPAMI.2017.2780166.
 Pezzotti et al. (2016) Nicola Pezzotti, Boudewijn PF Lelieveldt, Laurens van der Maaten, Thomas Höllt, Elmar Eisemann, and Anna Vilanova. Approximated and user steerable tsne for progressive visual analytics. IEEE transactions on visualization and computer graphics, 23(7):1739–1752, 2016.
 Qin et al. (2019) Xiaoyu Qin, Kai Ming Ting, Ye Zhu, and CS Vincent Lee. Nearestneighbourinduced isolation similarity and its impact on densitybased clustering. In Thirtythird AAAI Conference on Artificial Intelligence, 2019.
 Shaham and Steinerberger (2017) Uri Shaham and Stefan Steinerberger. Stochastic neighbor embedding separates wellseparated clusters. arXiv preprint arXiv:1702.02670, 2017.
 Ting et al. (2018) Kai Ming Ting, Yue Zhu, and ZhiHua Zhou. Isolation kernel and its effect on SVM. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2329–2337. ACM, 2018.
 Van Der Maaten (2014) Laurens Van Der Maaten. Accelerating tSNE using treebased algorithms. The Journal of Machine Learning Research, 15(1):3221–3245, 2014.
 Van Der Maaten and Weinberger (2012) Laurens Van Der Maaten and Kilian Weinberger. Stochastic triplet embedding. In 2012 IEEE International Workshop on Machine Learning for Signal Processing, pages 1–6. IEEE, 2012.
 Venna et al. (2010) Jarkko Venna, Jaakko Peltonen, Kristian Nybo, Helena Aidos, and Samuel Kaski. Information retrieval perspective to nonlinear dimensionality reduction for data visualization. Journal of Machine Learning Research, 11(Feb):451–490, 2010.
 Yang et al. (2009) Zhirong Yang, Irwin King, Zenglin Xu, and Erkki Oja. Heavytailed symmetric stochastic neighbor embedding. In Advances in neural information processing systems, pages 2169–2177, 2009.
 ZelnikManor and Perona (2005) Lihi ZelnikManor and Pietro Perona. Selftuning spectral clustering. In Advances in neural information processing systems, pages 1601–1608, 2005.
Comments
There are no comments yet.