Improving Stochastic Neighbour Embedding fundamentally with a well-defined data-dependent kernel

06/24/2019 ∙ by Ye Zhu, et al. ∙ IEEE Federation University Australia 0

We identify a fundamental issue in the popular Stochastic Neighbour Embedding (SNE and t-SNE), i.e., the "learned" similarity of any two points in high-dimensional space is not defined and cannot be computed. It underlines two previously unexplored issues in the algorithm which have undermined the quality of its final visualisation output and its ability to process large datasets. The issues are:(a) the reference probability in high-dimensional space is set based on entropy which has undefined relation with local density; and (b) the use of data independent kernel which leads to the need to determine n bandwidths for a dataset of n points. This paper establishes a principle to set the reference probability via a data-dependent kernel which has a well-defined kernel characteristic that linked directly to local density. A solution based on a recent data-dependent kernel called Isolation Kernel addresses the fundamental issue as well as its two ensuing issues. As a result, it significantly improves the quality of the final visualisation output and removes one obstacle that prevents t-SNE from processing large datasets. The solution is extremely simple, i.e., simply replacing the existing data independent kernel with Isolation Kernel, leaving the rest of the t-SNE procedure unchanged.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Motivation

t-SNE (Maaten and Hinton, 2008) has been a successful and popular dimension reduction method for visualisation. It aims to preserve the similarities in the transformed low-dimensional space as those in the given high-dimensional space, based on KL divergence. The original SNE (Hinton and Roweis, 2003) employs a Gaussian kernel to measure similarity in both the high and the low-dimensional spaces. t-SNE replaces the Gaussian kernel with the distance-based similarity (where is the distance between instances and ) in low-dimensional space, while retaining the Gaussian Kernel for the high-dimensional space. The distance-based similarity has a heavy-tailed distribution that alleviates issues related to far points and optimisation in SNE (Maaten and Hinton, 2008).

Because Gaussian Kernel is independent of data distribution, this requires t-SNE to fine-tune a bandwidth of the Gaussian kernel centred at each point in the given dataset in order to adjust the similarity locally. In other words, t-SNE must determine bandwidths for a dataset of

points. This is accomplished by using a heuristic search with a single global parameter called perplexity such that the Shannon entropy is fixed for all probability distributions at all points in adapting each bandwidth to the local density of the dataset.

As the perplexity can be interpreted as a smooth measure of the effective number of neighbours (Maaten and Hinton, 2008)

, the method can be interpreted as using a user-specified number of nearest neighbours (aka kNN) in order to determine the

bandwidths (see the discussion section for more on this point.)

Whilst there is only one user-specified parameter, it does not hide the fact that (internal) parameters need to be determined. This becomes the first obstacle in dealing with large datasets. In addition, the relationship between bandwidth and local density is undefined in the current formulation. This is despite the fact that a few papers (Hinton and Roweis, 2003; Maaten and Hinton, 2008; Pezzotti et al., 2016) have mentioned that the local neighbourhood size or bandwidth for each data point depends on its local density. No clear relationship between the bandwidth and local density has been established thus far.

The contributions of this paper are:

  1. Identifying a fundamental issue for both SNE and t-SNE: the similarity of two points in high-dimensional space is not defined. This underlines two key issues in t-SNE that were unexplored until now, i.e., (a) the reference probability is set based on entropy which has an undefined relation with local density; and (b) the use of data independent kernel leads to the need to determine bandwidths for a dataset of points;

  2. Establishing a principle in setting the reference probability in high-dimensional space in t-SNE via a data-dependent kernel which has a well-defined kernel characteristic that linked directly to local density;

  3. Proposing a generic solution based on the principle that simply replacing the data independent kernel with a data-dependent kernel, leaving the rest of the procedure unchanged. The use of data-dependent kernel resolves the two issues: (a) set the reference probability which is inversely proportional to local density; and (b) determine one global parameter only instead of local parameters. This addresses the fundamental issue as well as its two ensuing issues;

  4. Analysing the advantages and disadvantages of the proposed solution.

Two net effects of using a data-dependent kernel are that it enables t-SNE to deal with:

  1. high-dimensional datasets more effectively, especially in sparse datasets.

  2. large datasets because of its reliance on the number of parameters which is equal to the number of data points when a data independent kernel is used.

In a recent development, a data-dependent kernel called Isolation Kernel Ting et al. (2018); Qin et al. (2019) has been shown to adapt its similarity to local density, i.e., two points in the sparse region are more similar than two points of equal inter-point distance in the dense region. We investigate the use of Isolation Kernel in t-SNE, and examine its impact.

2 Impact of the issues in t-SNE

The impact of the t-SNE’s fundamental issue as well as its ensuing two issues can produce misleading mappings which do not reflect the structure of the given dataset in high dimension. Two examples are given as follows:

  1. Misleading (dis)association between clusters of different subspaces. The first row in Table 1 highlights the impact when the Gaussian kernel is used: t-SNE is unable to identify the joint component of the (first) three clusters in different subspaces which share the same mean only but nothing else.111

    The synthetic 50-dimensional dataset contains 5 subspace clusters. Each cluster has 250 points, sampled from a 10-dimensional Gaussian distribution with the other 40 irrelevant attributes having zero values; but these

    attributes are relevant to the other four Gaussian distributions. In other words, no clusters share a single relevant attribute. In addition, all clusters have significantly different variances (the variance of the 5th cluster is 625 times larger than that of the 1st cluster). The first three clusters share the same mean; but the last two have different means. The five clusters have distributions:

    , , , and in each dimension

    In contrast, the same t-SNE algorithm employing the proposed Isolation Kernel (instead of Gaussian Kernel) produces the mapping which depicts the scenario in high dimension well: the three clusters are well separated and yet they share some common pointm, shown in the second row in Table 1.

    Gaussian Kernel

    (a) (b) (c)

    Isolation Kernel

    (d) (e) (f)
    Table 1: Visualisation results of the t-SNE using Gaussian kernel and Isolation Kernel on a 50-dimensional dataset with 5 subspace clusters.
  2. A highly concentrated cluster is depicted as having some other structure. Table 2 compares the visualisation results of the two kernels on a dataset having a highly concentrated cluster with 250 noise points in subspaces.222The 250-attribute dataset contains 550 points located at the origin and 250 noise points. Each noise point has randomly selected 100 attributes having value 1, and the rest of the attributes having 0.

    Using Gaussian kernel, points belonging to the concentrated cluster are in a ring. In this case, using a large value may help, as shown in Figure (c) in Table 2.

    In contrast, Isolation Kernel always produces mappings which do not modify the structure of the concentrated cluster, independent of the parameter used, as shown in Figures (d)-(f) in Table 2.

    Gaussian Kernel

    (a) (b) (c)

    Isolation Kernel

    (d) (e) (f)
    Table 2: Visualisation results of t-SNE with Gaussian kernel and Isolation Kernel on a 250-dimensional dataset with one cluster (indicated as blue points) and noise points (indicated as red points).

The rest of the paper is organised as follows. Related work is described in Section 3. The current t-SNE and its fundamental issue are provided in Section 4. We present the proposed change in t-SNE using Isolation Kernel and a principle of setting the reference probability in Section 5 and Section 6, respectively. The empirical evaluation is given in Section 7, followed by discussion and conclusions in the last two sections.

3 Related work

SNE (Hinton and Roweis, 2003) and its variations have been widely applied in dimensionality reduction and visualisation. In addition to t-SNE (Maaten and Hinton, 2008), which is one of the most famous visualisation methods, many other variations have been proposed to improve SNE in different aspects.

There are some improvements based on revised Gaussian Kernel functions in order to get better similarity measurements. Cook et al. (2007) proposes a symmetrised SNE, Yang et al. (2009) enable t-SNE to accommodate various heavy-tailed embedding similarity functions; and Van Der Maaten and Weinberger (2012) propose an algorithm based on similarity triplets of the form “A is more similar to B than to C” to model the local structure of the data more effectively.

Based on SNE and the concept of information retrieval, NeRV (Venna et al., 2010)

uses a cost function to trade-off between precision and recall of “making true similarities visible and avoiding false similarities”, when projecting data into 2-dimensional space for visualising similarity relationships. Unlike SNE which relies on a single Kullback-Leibler divergence, NeRV uses a weighted mixture of two dual Kullback-Leibler divergences in neighbourhood retrieval. Furthermore, JSE

(Lee et al., 2013) enables t-SNE to use a different mixture of Kullback-Leibler divergences, a kind of generalised Jensen-Shannon divergence, to improve the embedding result.

To reduce the runtime of t-SNE, Van Der Maaten (2014) explores tree-based indexing schemes and uses the Barnes-Hut approximation to reduce the time complexity to . This gives a trade-off between speed and mapping quality. To further reduce the time complexity to , Linderman et al. (2019)

utilise a fast Fourier transform to dramatically reduce the time of computing the gradient during each iteration. The method uses vantage-point trees and approximate nearest neighbours in dissimilarity calculation with rigorous bounds on the approximation error.

There are some works focusing on analysing the heuristics methods for solving non-convex optimisation problems for the embedding (Linderman and Steinerberger, 2017; Shaham and Steinerberger, 2017). Recently, Arora et al. (2018)

theoretically analyse this optimisation and provide a framework to make clusterable data visually identifiable in the 2-dimensional embedding space. These works are not related to similarity measurements; therefore not directly relevant to work reported here.

All the above methods do not question the suitability of Gaussian kernel in SNE or t-SNE. We argue that the issues, mentioned in Sections 1 and 2, have their root cause in Gaussian kernel. Since the aim is to have a data-dependent kernel, they can be easily overcome by using a recently introduced data-dependent kernel called Isolation Kernel, instead of spending effort in remaking data independent Gaussian kernel data-dependent. We describe the current t-SNE and the proposed change in t-SNE in the next two sections.

4 Current t-SNE

We describe the pertinent details of t-SNE here.

Given a dataset in . The similarity between and is measured using a Gaussian Kernel as follows:

t-SNE computes the conditional probability that would pick as its neighbour as follows:

The probability , a symmetry version of , is computed as:

Given a fixed value , t-SNE performs a binary search for the best value of such that defined as

(1)

Where is the Shannon entropy:

(2)

The perplexity is a smooth measure of the effective number of neighbours, similar to the number of nearest neighbours used in KNN methods (Hinton and Roweis, 2003). Thus, is adapted to the density of the data, i.e., it becomes smaller for denser data since the -nearest neighbourhood is smaller. In addition, Maaten and Hinton (2008) point out that there is a monotonically increasing relationship between perplexity and the bandwidth .

Notice that the similarity between two points in high-dimensional space is not and cannot be defined based on the above formulation.

The aim of t-SNE is to map to where such that the similarities between points are preserved as much as possible from the high-dimensional space to the low-dimensional space. As t-SNE is meant for a visualisation tool, usually.

The similarity between and in the low dimension space is measured as:

and the corresponding probability is defined as:

The distance-based similarity is used because it has heavy-tailed distribution, i.e., it approaches an inverse square law for large pairwise distances. This means mapped points which are far apart have which are almost invariant to changes in the scale of the low-dimensional space (Maaten and Hinton, 2008).

Note that the probability is set to ; so as .

The location of each point is determined by minimising a cost function based on the (non-symmetric) Kullback-Leibler divergence of the distribution from the distribution :

The use of the Gaussian kernel sharpens the cost function in retaining the local structure of the data when mapping from the high-dimensional space to the low-dimensional space.

The procedure of t-SNE is provided in Algorithm 1.

1: - Dataset ; - Perplexity
2:Determine for every based on
3:Compute based on Gaussian kernel
4:Set
5:Compute low-dimensional and which minimise the KL divergence
6:Output low-dimensional data representation
Algorithm 1 t-SNE

Because is data independent, its use necessitates to determine local bandwidths for points in order to adapt to the local structure of the data; and this search333‘A binary search for the value of that makes the entropy of the distribution over neighbours equal to , where is the effective number of local neighbours or “perplexity”.’ (Hinton and Roweis, 2003) Another view is: adjust all bandwidths such that all have the same entropy: . is the key component that determines the success or failure of t-SNE. A gradient descent search has been used successfully to perform the search for parameters for small datasets (Maaten and Hinton, 2008). For large datasets, however, the need for -parameter search poses a real limitation in terms of finding appropriate settings for the large number of parameters and the computational expense required.

While the determining local bandwidths is an issue, there is a more fundamental issue which will be presented in the next section.

4.1 A fundamental issue in both SNE and t-SNE

A fundamental issue in SNE and t-SNE is that the ‘learned’ similarity of any two points in high-dimensional space is not and cannot be defined.

Both SNE and t-SNE aim to make a data independent kernel data-dependent by finding a local bandwidth of the Gaussian kernel for every point in a dataset. This fundamental issue is hidden for one key reason, i.e., the ‘learned’ similarity of two points in high-dimensional space does not need to be computed. This is because the probability , which is a proxy to the ‘learned’ similarity between and , is resigned to a heuristic by summarising the asymmetry probability .

As a result, it would not be able to explain how the similarity is dependent on data distribution, i.e., the data dependency relationship cannot be established succinctly. This is not just a conceptual issue but also a practical one: it is unclear how the similarity of two points in high-dimensional space can be computed, after all local bandwidths of Gaussian kernel have been determined.

Note that does not reflect the resultant data-dependent similarity because the data dependency characteristic of cannot be ascertained.

This is troubling because the aim is purportedly based on the similarity which is represented by the conditional probability, i.e., “find a low-dimensional data representation that minimises the mismatch between and (Maaten and Hinton, 2008). Yet the ‘learned’ similarity between two points in high-dimensional space is not defined and cannot be computed.

This fundamental issue underlines the two key issues in the procedure, described in Section 1: (a) setting the reference probability has no clear basis without a well-defined similarity, despite the use of entropy; and (b) the need to set bandwidths for a dataset of points.

We show in the next section that, by using a well-defined data-dependent similarity called Isolation Kernel that addresses the fundamental issue due to the use of the Gaussian kernel, it resolves the two key issues.

5 The proposed change: t-SNE with Isolation Kernel

Since t-SNE needs a data-dependent kernel, we propose to use a recent data-dependent kernel called Isolation Kernel (Ting et al., 2018; Qin et al., 2019) to replace the data independent Gaussian kernel in t-SNE.

Isolation Kernel is a perfect match for the task because a data-dependent kernel, by definition, adapts to local distribution. The kernel replacement is conducted in the component in the high-dimensional space only, leaving the other components of the t-SNE procedure unchanged.

The pertinent details of Isolation Kernel (Ting et al., 2018; Qin et al., 2019) are provided below.

Let

be a dataset sampled from an unknown probability density function

. Moreover, let denote the set of all partitionings that are admissible under the dataset , where each covers the entire space of ; and each of the isolating partitions isolates one data point from the rest of the points in a random subset , and .

For any two points , Isolation Kernel of and wrt is defined to be the expectation taken over the probability distribution on all partitionings that both and fall into the same isolating partition :

(3)

where is an indicator function.

In practice, Isolation Kernel is constructed using a finite number of partitionings , where each is created using :

(4)

where is a shorthand for . is the sharpness parameter and the only parameter of the Isolation kernel.

As Equation (4) is quadratic, is a valid kernel.

The larger is, the sharper the kernel distribution. It is a function similar to in the Gaussian kernel, i.e., the smaller is, the narrower the kernel distribution. The key difference is that Isolation Kernel adapts to local density distribution; but Gaussian kernel is independent of the data distribution.

The proposed t-SNE is to simply replace with in defining , i.e.,

The rest of the procedure of t-SNE remains unchanged.

The procedure of t-SNE with Isolation Kernel is provided in Algorithm 2.

Note that the only difference between the two algorithms is the first two lines.

1: - Dataset ; - sharpness parameter of Isolation Kernel
2:Build a space partitioning model (see Appendix for details) for Isolation Kernel
3:Compute based on Isolation Kernel
4:Set
5:Compute low-dimensional and which minimise the KL divergence
6:Output low-dimensional data representation
Algorithm 2 IK-t-SNE

6 A principle of setting the reference probability

Here we provide a principle approach through Isolation Kernel which has the following well-defined characteristic: two points in a sparse region are more similar than two points of equal inter-point distance in a dense region (Ting et al., 2018).

Using a specific implementation of Isolation Kernel (see Appendix), Qin et al. (2019) have provided the following Lemma (see its proof in their paper):

(Qin et al., 2019) (sparse region) and (dense region) such that , the nearest neighbour-induced Isolation Kernel has the characteristic that for implies

(5)

where is the distance between and ; and denotes the density at point .

Let be the probability that would pick as its neighbour.

We provide two corollaries from Lemma 1 as follows.

such that is more likely to pick as a neighbour than is to pick as a neighbour, i.e., .

This is because in the dense region is more likely to pick a point closer than as its neighbour, in comparison with picking as a neighbour in the sparse region, given that .

, where is a region in ; and is an average density of a region.

Using a data-dependent kernel with a well-defined characteristic as specified in Lemma 1, we can establish that the probability that would pick , , is inversely proportional to the density of the local region.

This becomes the basis in setting a reference probability in high-dimensional space.

It is interesting to note that the statement is also true when is a Gaussian kernel in t-SNE. But, none of the statements in Colloraries 1 and 2 can be established. This is despite the fact that t-SNE does intend to adjust the bandwidth of the Gaussian kernel locally. This is because the Gaussian kernel with local bandwidths, determined based on the entropy criterion, does not have a well-defined kernel characteristic that relates to local density.

Summary

In a nutshell, the fundamental issue of SNE and t-SNE is: the ‘learned’ similarity of two points in high-dimensional space is not defined, despite their use of Gaussian kernel having local bandwidth centred at point . In fact, it is unclear how the similarity of two points can be computed, after all local bandwidths of Gaussian have been determined. In addition, setting the reference probability has no clear basis without a well-defined similarity, despite the use of entropy.

The use of Isolation kernel in t-SNE brings about two key benefits: (a) Improved visualisation quality; and (b) reduced runtime of step 1 in the t-SNE algorithm. The first benefit is a direct result of better data dependency. The second is because Isolation Kernel produces a truly single global parameter only algorithm—this eliminates the need to tune

bandwidths (internally). For a large dataset, it is infeasible to estimate the large number of bandwidths with an appropriate degree of accuracy. We verify these two benefits in an empirical evaluation, reported in the next section.

7 Empirical Evaluation

We present the evaluation measures used in the first subsection. The empirical evaluation results and the runtime comparison are provided in the next two subsections.

7.1 Evaluation measures

We used a qualitative assessment to evaluate the preservation of -ary neighbourhoods (Lee and Verleysen, 2009; Lee et al., 2013, 2015), defined as follows:

(6)

where

and is the set of nearest neighbours of ; and is the corresponding low-dimensional (LD) point of the high-dimensional (HD) point .

measures the -ary neighbourhood agreement between the HD and corresponding LD spaces. ; and the higher score, the better the neighbourhoods preserved in LD space. In our experiments, we recorded the assessment with and produced the curve, i.e., vs .

To aggregate the performance over different -ary neighbourhood, we calculate the area under the curve in the log plot (Lee et al., 2013) as:

(7)

AUC assesses the average quality weighted by , i.e., errors in large neighbourhoods with large contribute less than that with small to the average quality.

In addition, the purpose of many methods of dimension reduction is to identify HD clusters in the LD space such as in a 2-dimensional scatter plot. Since all the datasets we used for evaluation have ground truth (labels), we can use measures for clustering validation to evaluate whether all clusters can be correctly identified after they are projected into the LD space. Here we select two popular indices of cluster validation, i.e., Davies-Bouldin (DB) index (Davies and Bouldin, 1979) and Calinski-Harabasz (CH) index (Caliński and Harabasz, 1974). Their details are given as follows.

Let be an instance in a cluster which has instances with the centre as . The Davies-Bouldin (DB) index can be obtained as

(8)

where is the number of clusters in the dataset.

Calinski-Harabasz (CH) index is calculated as

(9)

where is the centre of dataset.

Both measures take the similarity of points within a cluster and the similarity between clusters into consideration, but in different ways. These measures assign the best score to the algorithm that produces clusters with low intra-cluster distances and high inter-cluster distances. Note that the higher the CH score, the better the cluster distribution; while the lower the DB score, the better the cluster distribution.

7.2 Evaluation results

We used 20 real-world datasets with different data sizes and dimensions to evaluate the use of Isolation Kernel and Gaussian Kernel in t-SNE.444COIL20, HumanActivity and Isolet are from Li et al. (2016); News20.binary and Rcv1 are from Chang and Lin (2011)

; and all other real-world datasets are from UCI Machine Learning Repository

(Dua and Graff, 2017). All algorithms used in the experiments were implemented in Matlab 2018b and were run on a machine with eight cores (Intel Core i7-7820X 3.60GHz) and 32GB memory. All datasets were normalised using the - normalisation to yield each attribute to be in [0,1] before the experiments began. The same normalisation was used on the projected datasets before calculating CH and DB scores. We report the best performance of each algorithm with a systematic parameter search with the range shown in Table 3.555The original t-SNE paper says that “the performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50” (Maaten and Hinton, 2008).

Parameter with search range
Gaussian Kernel ;
Isolation Kernel ;
Table 3: Parameters and their search ranges for each kernel function.
Dataset #Points #Attr Evaluation measure
DB CH
GK IK GK IK GK IK
Wine 178 13 0.65 0.67 0.52 0.43 625 853
Dermatology 358 34 0.68 0.684 0.47 0.40 3679 4532
ForestType 523 27 0.70 0.71 0.91 0.89 467 478
WDBC 569 30 0.64 0.67 0.70 0.58 821 1167
ILPD 579 9 0.67 0.69 4.30 3.71 21 28
Control 600 60 0.69 0.70 0.67 0.68 3847 6816
Pima 768 8 0.70 0.71 3.74 2.98 44 72
Parkinson 1040 26 0.70 0.74 8.40 6.35 13 22
Biodeg 1055 41 0.74 0.77 2.04 2.12 154 146
Mice 1080 83 0.79 0.82 0.32 0.18 8326 39085
Messidor 1151 19 0.71 0.74 8.72 6.79 14 22
Hill 1212 100 0.69 0.73 16.71 15.10 4 4.5
COIL20 1440 1024 0.75 0.79 1.66 2.67 2352 3730
HumanActivity 1492 561 0.78 0.79 2.87 2.86 1225 1631
Isolet 1560 617 0.79 0.81 1.83 1.41 1746 2812
Segment 2310 19 0.68 0.72 1.39 2.02 4052 5079
Spam 4601 57 0.67 0.70 1.46 1.33 1626 1874
News20.binary 9998 1355191 0.27 0.23 3.24 1.92 661 2320
Rcv1 10121 47236 0.68 0.66 1.80 1.43 2421 4221
Pendig 10992 16 0.69 0.693 1.14 1.10 6944 6777
Average 0.68 0.70 3.15 2.75 1952 4084
Table 4: Evaluation results on real-world datasets. For each dataset, the best performer, GK (Gaussian Kernel) or IK (Isolation Kernel) w.r.t. each evaluation measure is boldfaced.

Table 4 shows the results of the two kernels used in t-SNE. Isolation Kernel performs better on 18 of 20 datasets in terms of , which means that Isolation Kernel enables t-SNE to preserve the local neighbourhoods much better than Gaussian kernel. With regard to the cluster quality, Isolation Kernel performs better than Gaussian kernel on 17 out of 20 datasets in terms of both DB and CH. Notice that when the Gaussian kernel is better, the performance gaps are usually small in all three measures. Overall, Isolation Kernel is better than Gaussian Kernel in 15 out of 20 datasets in all three measures; but no datasets in which the reverse is true.

It is worth mentioning that the extremely high-dimensional datasets News20 and Rcv1 are very sparse where nearly 99% attribute have zero values. Although Isolation kernel performed slightly worse in terms of on these two datasets, it significantly improved the cluster structure obtained by Gaussian kernel in terms of both DB and CH.

We compare the visualisation results of News20 and Rcv1, i.e., the two datasets having the highest numbers of attributes, in Table 5 and Table 6, respectively. It is interesting to note that t-SNE using Isolation kernel having a small produces better visualisation results having more separable clusters than those using Gaussian kernel with high perplexity.

Gaussian Kernel

(a) (b) (c)

Isolation Kernel

(d) (e) (f)
Table 5: Visualisation result of t-SNE on News20.

Gaussian Kernel

(a) (b) (c)

Isolation Kernel

(d) (e) (f)
Table 6: Visualisation result of t-SNE on Rcv1.
(a) IK-t-SNE Visualisation
(b) Avg #non-zero attributes
Figure 1: Average number of non-zero HD attributes of points inside/outside the ball in LD space, centred at the red point in Figure (a) on the News20 dataset. Isolation Kernel uses .

Table 5(f) shows an interesting visualisation result which deserves further investigation. We selected the centre point in LD space as a reference point, and computed the average number of non-zero HD attributes of all points inside/outside of LD ball centred at . Figure 1 shows the results. It is interesting to note that, of all points within (and also outside) the LD ball, the average number of non-zero HD attributes generally increases as the radius of the LD ball increases. t-SNE using Isolation Kernel produces a structure where points having a low number of non-zero attributes are clustered at the centre; and points away from the centre has an increasingly higher number of non-zero attributes666The attributes values in News20 are real values which are different from the synthetic dataset used in Table 2..

7.3 Runtime comparison

The computational complexities of the two kernels used in t-SNE are shown in Table 7. Generally, all these kernels have quadratic time and space complexities. However, the Gaussian kernel in the original t-SNE needs a large number of iterations for search the optima local bandwidth for each point.

Figure 2 presents the two runtime comparisons of t-SNE comparing the two kernels on a synthetic dataset. Figure 2(a) shows that the Gaussian kernel is much slower than Isolation kernel in similarity calculations.This is mainly due to the search required to tune bandwidths in step 1 of the algorithm. Figure 2(b) shows the runtimes of the mapping process in step 4 of Algorithms 1 and 2 which is the same for both algorithms; and it is not surprising that the runtimes are about the same in this step, regardless of the kernel employed.

Time complexity Space complexity
Gaussian Kernel
Isolation Kernel
t-SNE Mapping
Table 7: Time and space complexities of computing the similarity using different kernels in t-SNE. is the number of iterations used for bandwidth search for Gaussian kernel; and is the number of iterations in t-SNE mapping which is the same regardless of the kernel employed.
(a) Runtime for Step 1
(b) Mapping Time
Figure 2: Runtime comparison of Gaussian Kernel and Isolation Kernel used in t-SNE on a 2-dimensional synthetic dataset.

8 Discussion

The proposed idea can be applied to variants of stochastic neighbour embedding, e.g., NeRV (Venna et al., 2010) and JSE (Lee et al., 2013) since they employ the same algorithm procedure as t-SNE. The only difference is the use of variants of the cost function, i.e., type 1 or type 2 mixture of KL divergences.

Recall that the first step of t-SNE may be interpreted as using kNN to determine the bandwidths of Gaussian Kernel. There are existing kNN based data-dependent kernels which adapt to local density, i.e.,

  1. kNN kernel (Marin et al., 2018).

    The kNN kernel is a binary function defined as:

    (10)

    where is the set of nearest neighbours of .

  2. Adaptive Gaussian Kernel (Zelnik-Manor and Perona, 2005).

    The distance of

    -th NN has been used to set the bandwidth of Gaussian Kernel to make it adaptive to local density. This was proposed in spectral clustering as an adaptive means to adjust the similarity to perform dimensionality reduction before clustering.

    Adaptive Gaussian Kernel is defined as:

    (11)

    where is the distance between and ’s -th nearest neighbour.

However, replacing the Gaussian kernel in t-SNE with either of these kernels produce poor outcomes. For example, on the Segment and Spam datasets, the adaptive Gaussian kernel produced AUC scores of 0.35 and 0.22, respectively; and the kNN kernel yielded AUC scores of 0.38 and 0.28, respectively. They are significantly poorer than those produced using the Gaussian Kernel or Isolation Kernel shown in Table 4. We postulate that this is because a global is unable to make these kernels sufficiently adaptive to local distribution.

9 Conclusions

This paper identifies a fundamental issue in all algorithms which are a direct descendent of SNE, i.e., the ‘learned’ similarity between any two points in high-dimensional space is not defined and cannot be computed. The root cause of this issue is the use of data independent kernel to produce a data-dependent kernel implicitly.

Like many problems, once the root cause of the fundamental issue is identified, the solution is simple. In the case of t-SNE, we show that the fundamental issue can be addressed by simply replacing the Gaussian kernel with a data-dependent kernel called Isolation Kernel which has a well-defined characteristic.

We show that this significantly improves the quality of the final visualisation output of t-SNE; and it removes one obstacle that prevents t-SNE from processing large datasets.

Appendix: The nearest neighbour implementation of Isolation Kernel

We use an existing nearest neighbour method to implement Isolation Kernel (Qin et al., 2019). It produces each model (a Voronoi diagram) which consists of isolating partitions , given a subsample of points. Each isolating partition or Voronoi cell isolates one data point from the rest of the points in the subsample. The point which determines a cell is called the cell centre. The Voronoi cell centred at is given as:

where is a distance function and we use as Euclidean distance in this paper.

Figure 3 compares the contours of Isolation Kernel on two different data distributions. It shows that Isolation Kernel is adaptive to the local density. Under uniform data distribution in Figure 2(a), Isolation kernel’s contour is symmetric with respect to the reference point at (0.5, 0.5). In Figure 2(b), however, the contour shows that, for points having equal inter-point distance from the reference point at (0.5, 0.5), points in the spare region are more similar to than points in the dense region to .

While this implementation of Isolation Kernel produces its contour similar to that of an exponential kernel under uniform density distribution, different implementations have different contours. For example, using axis-parallel partitionings to implement Isolation Kernel produce a contour (with the diamond shape) which is more akin to that of Laplacian kernel under uniform density distribution (Ting et al., 2018). Of course, both the exponential and Laplacian kernels, like Gaussian kernel, are data independent.

(a) Uniform density distribution
(b) Parkinson dataset (12th vs 21th attributes)
Figure 3: Contours of Isolation Kernel () with reference to point (0.5, 0.5) on 2-dimensional datasets.

References

  • Arora et al. (2018) Sanjeev Arora, Wei Hu, and Pravesh K Kothari. An analysis of the t-SNE algorithm for data visualization. arXiv preprint arXiv:1803.01768, 2018.
  • Caliński and Harabasz (1974) Tadeusz Caliński and Jerzy Harabasz.

    A dendrite method for cluster analysis.

    Communications in Statistics-theory and Methods, 3(1):1–27, 1974.
  • Chang and Lin (2011) Chih-Chung Chang and Chih-Jen Lin.

    LIBSVM: A library for support vector machines.

    ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
  • Cook et al. (2007) James Cook, Ilya Sutskever, Andriy Mnih, and Geoffrey Hinton. Visualizing similarity data with a mixture of maps. In Artificial Intelligence and Statistics, pages 67–74, 2007.
  • Davies and Bouldin (1979) David L Davies and Donald W Bouldin. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, (2):224–227, 1979.
  • Dua and Graff (2017) Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Hinton and Roweis (2003) Geoffrey E Hinton and Sam T Roweis. Stochastic neighbor embedding. In Advances in neural information processing systems, pages 857–864, 2003.
  • Lee and Verleysen (2009) John A Lee and Michel Verleysen. Quality assessment of dimensionality reduction: Rank-based criteria. Neurocomputing, 72(7-9):1431–1443, 2009.
  • Lee et al. (2013) John A Lee, Emilie Renard, Guillaume Bernard, Pierre Dupont, and Michel Verleysen. Type 1 and 2 mixtures of kullback–leibler divergences as cost functions in dimensionality reduction based on similarity preservation. Neurocomputing, 112:92–108, 2013.
  • Lee et al. (2015) John A Lee, Diego H Peluffo-Ordóñez, and Michel Verleysen. Multi-scale similarities in stochastic neighbour embedding: Reducing dimensionality while preserving both local and global structure. Neurocomputing, 169:246–261, 2015.
  • Li et al. (2016) Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Trevino Robert, Jiliang Tang, and Huan Liu. Feature selection: A data perspective. arXiv:1601.07996, 2016.
  • Linderman and Steinerberger (2017) George C Linderman and Stefan Steinerberger. Clustering with t-SNE, provably. arXiv preprint arXiv:1706.02582, 2017.
  • Linderman et al. (2019) George C Linderman, Manas Rachh, Jeremy G Hoskins, Stefan Steinerberger, and Yuval Kluger.

    Fast interpolation-based t-SNE for improved visualization of single-cell rna-seq data.

    Nature methods, 16(3):243, 2019.
  • Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning research, 9(Nov):2579–2605, 2008.
  • Marin et al. (2018) D. Marin, M. Tang, I. Ben Ayed, and Y. Y. Boykov. Kernel clustering: density biases and solutions. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2018. ISSN 0162-8828. doi: 10.1109/TPAMI.2017.2780166.
  • Pezzotti et al. (2016) Nicola Pezzotti, Boudewijn PF Lelieveldt, Laurens van der Maaten, Thomas Höllt, Elmar Eisemann, and Anna Vilanova. Approximated and user steerable tsne for progressive visual analytics. IEEE transactions on visualization and computer graphics, 23(7):1739–1752, 2016.
  • Qin et al. (2019) Xiaoyu Qin, Kai Ming Ting, Ye Zhu, and CS Vincent Lee. Nearest-neighbour-induced isolation similarity and its impact on density-based clustering. In Thirty-third AAAI Conference on Artificial Intelligence, 2019.
  • Shaham and Steinerberger (2017) Uri Shaham and Stefan Steinerberger. Stochastic neighbor embedding separates well-separated clusters. arXiv preprint arXiv:1702.02670, 2017.
  • Ting et al. (2018) Kai Ming Ting, Yue Zhu, and Zhi-Hua Zhou. Isolation kernel and its effect on SVM. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2329–2337. ACM, 2018.
  • Van Der Maaten (2014) Laurens Van Der Maaten. Accelerating t-SNE using tree-based algorithms. The Journal of Machine Learning Research, 15(1):3221–3245, 2014.
  • Van Der Maaten and Weinberger (2012) Laurens Van Der Maaten and Kilian Weinberger. Stochastic triplet embedding. In 2012 IEEE International Workshop on Machine Learning for Signal Processing, pages 1–6. IEEE, 2012.
  • Venna et al. (2010) Jarkko Venna, Jaakko Peltonen, Kristian Nybo, Helena Aidos, and Samuel Kaski. Information retrieval perspective to nonlinear dimensionality reduction for data visualization. Journal of Machine Learning Research, 11(Feb):451–490, 2010.
  • Yang et al. (2009) Zhirong Yang, Irwin King, Zenglin Xu, and Erkki Oja. Heavy-tailed symmetric stochastic neighbor embedding. In Advances in neural information processing systems, pages 2169–2177, 2009.
  • Zelnik-Manor and Perona (2005) Lihi Zelnik-Manor and Pietro Perona. Self-tuning spectral clustering. In Advances in neural information processing systems, pages 1601–1608, 2005.