2-D Embedding of Large and High-dimensional Data with Minimal Memory and Computational Time Requirements

02/04/2019 ∙ by Witold Dzwinel, et al. ∙ AGH Dalhousie University 0

In the advent of big data era, interactive visualization of large data sets consisting of M*10^5+ high-dimensional feature vectors of length N (N 10^3+), is an indispensable tool for data exploratory analysis. The state-of-the-art data embedding (DE) methods of N-D data into 2-D (3-D) visually perceptible space (e.g., based on t-SNE concept) are too demanding computationally to be efficiently employed for interactive data analytics of large and high-dimensional datasets. Herein we present a simple method, ivhd (interactive visualization of high-dimensional data tool), which radically outperforms the modern data-embedding algorithms in both computational and memory loads, while retaining high quality of N-D data embedding in 2-D (3-D). We show that DE problem is equivalent to the nearest neighbor nn-graph visualization, where only indices of a few nearest neighbors of each data sample has to be known, and binary distance between data samples -- 0 to the nearest and 1 to the other samples -- is defined. These improvements reduce the time-complexity and memory load from O(M log M) to O(M), and ensure minimal O(M) proportionality coefficient as well. We demonstrate high efficiency, quality and robustness of ivhd on popular benchmark datasets such as MNIST, 20NG, NORB and RCV1.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the age of data science interactive visualization of large high-dimensional datasets is an essential tool in exploratory data analysis and knowledge extraction. It allows for a better insight into data topology and its interactive exploration by direct manipulation on a whole or on a fragment of dataset. For example, one can remove irrelevant data samples, identify the outliers or zoom-in a particular fragment of data. By changing visualization modes (e.g. the form of the cost function), multi-scale structure of the data can be investigated. Summarizing, interactive visualization allows for:

  1. formulation and instant verification of a number of hypotheses;

  2. precise matching of data mining tools to the properties of data investigated;

  3. optimize their parameters.

As the famous Visual Analytics Mantra by Keim et al. [17] says: ‘analyze first, show the important, zoom, filter and analyze further, details on demand’. Herein, we focus on application of data embedding (DE) methods [34] in interactive visualization of large () high-dimensional datasets.

Data embedding (or shortly, embedding) is defined as a transformation of -dimensional (-D) dataset into its -dimensional ‘shadow’ , where and is the number of samples. The mapping should be understood as a lossy compression of -dimensional dataset in -dimensional space. The mapping is realized by minimization of a cost function , where is a measure of topological dissimilarity between and . Particularly, should be sensitive to the multiscale cluster structure of . Due to the complexity of low-dimensional manifold – immersed in -D feature space – on which data samples Y are located, perfect embedding of in dimensions, with 0 compression error, is possible only for trivial cases. In fact, the embedding task is a simplification of a more generic problem, where is a mapping of a given structure from a source space into a corresponding structure in a target space, assuming the smallest reconstruction error.

In the context of -dimensional data embedding in visually perceptible Euclidean space, we assume here that , having in mind that 3-D data visualization can be realized in a similar way. As shown in a large body of papers (see, e.g., [34, 32, 12]), DV of large high-dimensional datasets ( and ) that is both sufficiently precise in reconstruction of -D data topology and computationally affordable, is an algorithmic challenge. To preserve topological properties of in both the classical MDS (multidimensional scaling) methods and the state-of-the-art clones of the stochastic neighbor embedding (SNE) concept [14], require computiing and storing of all the distances between data samples both for the source dataset and its 2-D embedding . This makes the most precise state-of-the-art vizualization algorithms, such as t-SNE [21] and its clones, suffer from time and memory complexity. The time complexity can be decreased to by using the approximate versions of t-SNE, such as bh-SNE [31] and other its variants and simplifications [35, 15, 26, 28, 20]. It is worth to mention here that, in fact, the time-complexity of DE is greater, and depends also on the algorithm which is used for the cost function minimization, i.e., on the number of learning cycles required to obtain the global minimum of . In the rest of this paper – as it is in all the works about DE – as the time-complexity we understand here a single learning cycle performed on the full dataset. Summing up, as interactive and visual data exploration involves very strict time performance regimes, sophisticated data structures and high memory load, the state-of-the-art DE algorithms can be too slow for visual exploration of data consisting of samples.

In this paper we propose a novel and very fast method for visualization of high-dimensional data - interactive visualization of high-dimensional data technique (ivhd) - with linear-time and linear-memory complexities. It outperforms the state-of-the-art DV algorithms in computational speed retaining high quality of data embedding. To this end, we made the following contributions:

  1. We demonstrate that only a small fraction of distances (proximities) between -D data samples are required – just a few nearest and random distances – to obtain a high quality 2-D embedding.

  2. Moreover, we show that these distances can be binary, i.e., equal to 0 to the nearest and 1 to the random neighbors of each . Then, instead of using floating point distances, one need to find only the indices of a few nearest neighbors () for each data sample .111Here, instead of index representing usually the number of the nearest neighbors, we use to be consistent with the other notation used in this paper. Consequently, the requirements of memory load shrinks from floating points (two full distances arrays and ) to barely integers, where .

  3. This way we have reduced also the overall time complexity of DE method to .

  4. We clearly show that data embedding (DE) is equivalent to visualization of -graph built for input data.

Summarizing, the principal contribution of this paper is the essential improvement of time/memory complexity of data embedding (with a minor deterioration of data embedding quality) what allows for interactive data manipulation. The method preserves very well the coarse grained cluster structure of the original -D data and enables interactive visualization of data samples on a laptop. Consequently, our improvements open a new perspective to visual exploration of much bigger data on more powerful computer systems.

In the following section we demonstrate the details of our approach, while in the rest of the paper we discuss its advantages and disadvantages in the context of both the state-of-the-art DE methods and interactive visualization of truly large datasets. In support to our concept we present the results of 2-D embeddings of popular benchmark datasets such as small NORB, MNIST, twenty newsgroups (20NG) and RCV1. We summarize our findings in the Conclusions section.

2 IVHD data embedding

Our method represents a radical simplification of the force-directed implementation of the multidimensional scaling (MDS) mapping [6, 23, 24, 25]. Because it contrasts to the modern SNE concept, exploited in the state-of-the-art DE papers [14, 21, 31, 35, 15, 26, 28], first, we present shortly these two approaches.

2.1 Background

We assume that each -D feature vector is represented in 2-D Euclidean space by a corresponding point and . We define two distance matrices: and , in the source and target spaces, respectively. The value of is a measure of dissimilarity (proximity) between feature vectors and while in 2-D Euclidean space: . In general, the source space has not to be Cartesian one, and can be solely defined by a proximity matrix between any two data objects of a (possibly) different data representation than the vector one (e.g., shapes, graphs etc.).

The classical multidimensional scaling (MDS), which performs the mapping , consists in minimization of the cost (stress) function

(1)

which represents the error between dissimilarities and corresponding distances , where: , are weights and , are the parameters. Herein, we assume that and . However, there are many other forms of this stress function, which are defined, e.g., in [34, 32, 12]. One can easily adopt our approach to particular (, ) parameters choice. Anyway, finding the global minimum of this multidimensional and multimodal cost function (1) in respect to , is not a trivial problem. To this end, we use the force-directed approach presented in [6, 23, 24, 25].

We assume that the set of points is treated as an ensemble of interacting particles . The particles evolve in space with discrete time scaled by the timestep , according to the Newtonian equations of motion discretized by using the leapfrog numerical scheme. Here we use their simplified form [9]:

(2)
(3)
(4)

what resembles well known momentum minimization method. The value of represents friction, and it is equal to 1 in the absence of energy dissipation. Meanwhile, parameterize interparticle forces. The proper balance of and is crucial for the particle system convergence speed to a stable and good (close to the global one) minimum of the stress function (1). We demonstrated in [6, 23, 24, 25] that this formulation of the classical MDS algorithm produces acceptable embeddings for low-dimensional () and rather small datasets (

) in sublinear time complexity, i.e., similar to widely used stochastic gradient descent (SGD) algorithms and its clones employed, e.g., in the original t-SNE implementation 

[21]. The principal problems with MDS for

are both the ‘curse of dimensionality’ effect, which produces bad quality embeddings, and high computational and storage complexity for

.

As shown in [2], under the broad set of conditions, as dimensionality increases, the distance to the nearest feature vector approaches the distance to the furthest one for as few as 10-15 dimensions. Consequently, this low distance contrast causes that MDS completely fails in mapping of highdimensional data to 2-D space, producing meaningless sphere of 2-D points for most of high-dimensional datasets. As shown in [14, 21] this adverse effect can be radically mitigated by change of the definition of proximity measure and the cost function. The group of methods based on stochastic neighbor embedding, like the most popular t-SNE [21], defines the similarity of two samples and (both in the source and target

spaces) in terms of probabilities (

and , respectively) that i would pick j as its neighbor and vice versa. Then the cost function is defined as the K-L divergence:

(5)

where, for t-SNE algorithm,

is approximated by a Gaussian with variance

centered in , while

is defined by the heavy-tailed Cauchy distribution in

, to avoid both the crowding and the optimization problems of older SNE method. The respective probabilities are defined as follows:

(6)

As demonstrated in [21], t-SNE outperforms other data reduction techniques producing very precise embeddings for all kind of datasets, including high-dimensional ones. However, unlike in the classical MDS presented above, we cannot use in t-SNE (and its clones) the force-directed approach for minimization due to partition function in the denominator in Eqs. (6). The gradient descent (GD) methods (such as stochastic GD or asynchronous stochastic) more often stuck in local minima, what causes that the quality of embedding is very sensitive on the choice of parameters [28]. Moreover, similar to MDS, at least distances in , and the same number of corresponding distances in , have to be calculated. While is computed only once at the beginning of simulation, matrix has to be computed in every timestep. Moreover, the two distance matrices has to be kept in the operational memory. This results in prohibitively high time complexity (in respect to a single training cycle of minimization method applied) and storage. This excludes the two methods as acceptable candidates for embedding engine of interactive visualization of large datasets. Moreover, in terms of parallel computations on multicore CPUs, the maximal number of samples which can be embedded can scale, at most, with the square root of the number of CPU (or GPU) threads involved in computations [24, 25]. Anyway, the quadratic memory load remains the key problem and puts the upper limit on . The application of Burnes-Hut approximation in both MDS and t-SNE (bh-SNE) algorithms [31] can reduce their computational complexity to but at the cost of increase of algorithmic complexity and, consequently, decrease of parallelization efficiency. Despite of reduction of the number of distances computations in bh-SNE, the method still requires full distances table with floating points. Do we really need so many distances to embed -D data in 2-D Euclidean space?

2.2 Key concept

The linear-time and memory complexity of data embedding can be achieved by using only a limited number of distances from and corresponding distances from (such as in [6, 23, 24, 25, 16]). In the context of, so called, theory of structural rigidity [30], all distances between the samples from a -D dataset are not needed to retain rigidity of its shape in a -D space. The term ‘rigidity’ can be understood as a property of a -D structure made of rods (distances) and joints (data vectors) that it does not bend or flex under an applied force. Therefore, to ensure the rigidity of (and its 2-D embedding ) only a fraction of distances (joints) from (and ) is required. What is a minimal number of distances for which both the original and target datasets remain rigid?

As shown in [30, 3], minimal -rigid structure, which consists of joints (vectors) in -dimensional space, requires at least:

(7)

rigid rods (distances). It means that, in the source space, the number of distances can ensure lossless reconstruction of data structure in -D. Particularly, in 2-D the structural rigidity can be preserved for . However, defines only the lowest bound of required to retain structural rigidity. Therefore, herein we are trying to answer the following questions:

  1. What minimal number of distances in should be known to obtain the reasonable reconstruction of data in , simultaneously, preserving structural rigidity of ? Is it closer to or rather to 2?

  2. Which distances should be retained?

In general, cannot be the Euclidean matrix. It may represent proximity of samples in an abstract space. Particularly, the samples can occupy a complicated -D manifold for which (see Figure 1). Then, one can assume, that only the distances of each sample to their nearest neighbors are Euclidean. As shown in [29, 9], the nearest neighbor graph (-graph), in which vertices represent data samples while edges the connections to their nearest neighbors, can be treated as an approximation of this low-dimensional manifold, immersed in a high-dimensional feature space (see Figure 2). In CDA and Isomap [34, 29] DE algorithms, the proximities of more distant graph vertices (samples) are calculated as the lengths of the shortest paths between them in a respective -graph. However, though this way we can more precisely approximate the real distances on the low-dimensional manifold, the full distances matrix has to be calculated by very time consuming Dijkstra’s (or Floyd-Warshall [11]) algorithm and, what is even more demanding, has to be stored in the operational memory. In fact, the calculation of full is not necessary, and the problem of data embedding can be replaced by the congruent problem of the nearest neighbor graph (-graph) visualization.

The methodology of graph visualization (GV) and DE shares many common concepts. For example, the metaphor of particle system and the force directed method, used for the cost function minimization, were introduced independently to GV and DE [6, 13]. Summing up, as shown in Figure 1, the first step in our DE method (ivhd) consists in construction of -graph, which approximate -dimensional non-Cartesian manifold immersed in . Then we can use the fast procedure of graph visualization in 2-D space presented in our recent work [11]. In the rest of this paper we demonstrate that the location of graph vertices in 2-D space will represent good quality embedding.

Figure 1: The idea of data embedding by means of the nearest neighbor graph.

Let us assume that is the nearest neighbor -graph constructed for high-dimensional dataset . The data samples correspond to the graph vertices , while is the set of edges connecting each with their nearest neighbors. Because, we have assumed (see Figure 1) that this graph is an approximation of -dimensional manifold, consequently, we presume that its topology preserving embedding in 2-D Euclidean space will produce similar results as DE algorithms. According to definition formulated in [27] that ‘topology is preserved if a connectivity algorithm, such as k-nearest neighbors, can easily recover the edges of the input graph from only the coordinates of the nodes after embedding’. It means that unlike in DE algorithms, which tend to retain the neighbors order, the ordering of nearest neighbors (usually a few ones) is irrelevant in that case. In fact, the requirement of preserving the order of small number of nearest neighbors for each , which are subject to uncertainty and measurement errors, is secondary in terms of large data visualization. The crucial problem for formulating ad hoc

hypotheses and to decide about the use of particular machine learning tools in further data analytics is revealing data topology, i.e., its multi-scale cluster structure.

Because we are not interested in nearest neighbors ordering for each , to get a good approximation of data manifold, to the sake of simplicity, the distances between connected vertices of -graph should be the same and as close to 0 as possible. Conversely, the unconnected vertices of has to be more distant, to increase the contrast between both types of distances. It is obvious that -graphs (for small ) are largely unconnected and not rigid. Therefore, graph has to be augmented by additional edges. Here, we assume that for each () apart from edges to the nearest neighbors, additional randomly selected neighbors are connected. The value of should ensure rigidity of this augmented graph. Thus, and are the sets of indices of nearest (connected) neighbors and (unconnected) random neighbors of a feature vector () in , respectively. We assume also that . However, this assumption is superfluous because the probability of picking the nearest neighbor as a random neighbor is negligible low for large . In fact, the random neighbors are very distant from . Therefore, having in mind the effect of the ‘curse of dimensionality’, we can radically simplify data similarity to the binary one {‘nearest’ (0), ‘random’ (1)}, i.e.:

(8)

This way instead of matrix we have as the input data a lists of the nearest neighbors of (-graph edges). In the rest of this paper, the respective distances from Eq. (8) we call shortly and (i.e., and ).

Summarizing, we can presume that identification of a few nearest neighbors and random neighbors for each , where is small, is sufficient to assure rigidity of augmented graph. This is rather obvious, because it is well known [4] that any graph can be embedded preserving its topology (in the context of definition presented above), at most, in 3-D Euclidean space while planar graphs in 2-D. In the most cases -graphs for small and are planar, though for the probability of non-planarity rapidly increases with  [33]. Thus the total number of connections required for making the augmented -graph rigid is of order . In fact, we need to store only indices of the nearest neighbors for every , while the indices of random neighbors can be generated ad hoc during embedding process.

Additionally, we presume that by imposing a high contrast on these two types of distances, we will be able to preserve in the multi-scale cluster structure of , by employing MDS mapping only on those binary distances. This way we could decrease both the computational complexity of data embedding to with and its computational load to only integers.

To obtain 2-D embedding of we minimize the following stress function:

(9)

Consequently, the interparticle force from Eq. (3) in minimization procedure simplifies to:

(10)

To explain our concept in terms of data embedding, let us follow a toy example.

We assume that consists of samples located in the vertices of two identical but translated and mutually rotated, regular 18-dimensional hypertetrahedrons. In fact, data create an unconnected graph, which vertices are joined by the hypertetrahedron edges. As shown in Figure 2a, by employing classical MDS with the cost function (1) and the dataset defined by the full distances table ( distances), we have obtained embedding shown in Figure 2a. One can observe that the result is rather not satisfactory. Though it demonstrates well data separability, the local structure of remains very unclear. Let us assume now that we know only distances from to random neighbors for every ( distances). As one can expect, the final embedding from Figure 2b shows that data separation is still visible but it lacks any fine grained structure. As shown in Figure 2c, this flaw can be partially eliminated taking into account also distances to nearest neighbors of , i.e., in this particular case and were set ( distances). Additionally, we have reduced the strong bias caused by the long-range interactions between samples and their random neighbors, by decreasing 10 times the scaling factor of the respective ‘forces’ ( in Eq. (10)). Unlike in Figure 2b, in Figure 2c we observe much better reconstruction of local structure but at the cost of worse data separation. The embeddings of two regular hypertetrahedrons overlap each other.

Figure 2: The results of 2-D embedding of two identical regular hypertetrahedrons for (a) original MDS setting and (b-d) highly reduced number of distances in the cost function (1).

Let us assume now that all the distances between all samples are equal to 1. To increase the contrast between the nearest and random neighbors of , we assume that and . Because now we use binary distances {0.5, 1}, we do not need to store floating point matrix but just remember indices (integers) of each to their nearest neighbors (here 70 integers in total). As shown in Figure 2d, the 2-D embedding clearly reconstructs the local and global structure of -graph representing original 18-dimensional data. Because and were small in this example the value of , was sufficient to properly contrast and distances. However, as we show above (see Eq. (8)), for larger and , should be equal to 0 to reduce the ‘curse of dimensionality’ effect. It is worth to mention here that the neighborhood relation is not symmetrical, thus can be smaller than for . Consequently, it may produce non-rigid and deformed (2-D) embeddings. Moreover, by assuming that , one can get meaningless 2-D embeddings even though is much greater than 2. Summarizing, we come to the following conjectures.

  1. The number of distances from , sufficient to obtain 2-D embedding of original -D dataset , which preserves both local and global properties of , can be surprisingly small and close to .

  2. For each , their vicinity consisting of a few nearest neighbors should be preserved while the global cluster structure is controlled by a few (often one) randomly selected neighbors.

  3. The distances between and their nearest and random neighbors, and respective forces in optimization procedure should be properly contrasted. For high dimensional data, binary distance can be used. This can radically reduce the memory load from floating points (two distances arrays and ) to integers, where .

In the following section we will demonstrate that these simple improvements can make the classical MDS competitive, both in terms of efficiency and embedding quality, competitive to the state-of-the-art DE algorithms. Particularly, for the interactive visualization of large datasets.

3 Implementation and tests

On the basis of DE algorithm described above, we have developed an integrated tool for interactive visualization of high-dimensional data (ivhd). The code was written in C++, while GUI was created by using OpenGL™ graphics system integrated with the standard Qt GUI interface. All the computations and visualizations were performed on a notebook computer with Intel® Core™ i5 7200U 2.5-2.71GHz with 16GB of memory and equipped with Nvidia GeForce 940MX 2GB graphic board.

The DE initialization, which computes the nearest neighbors for the source dataset, was written in CUDA™ and uses two very fast GPU implementations of brute force and approximated (based on LargeVis idea [28]) algorithms for the nearest neighbors search [18]. This search is performed only once at the beginning of simulation and, as shown in Table 1, it requires relatively long computational time for MNIST data set with , and . Pay attention that GPU implementation of brute-force algorithm is faster for larger and than LargeVis approximated algorithm (see [18] for details). The CPU version of these algorithms (MPI, two cores) is about 10 times slower. The nearest neighbors are cached on the fast SSD disk for further use. To evaluate ivhd embedding concept we have performed experiments on a few highdimensional and multi-class datasets. The tests have been carried out for four high-dimensional datasets, which are specified in Table 2.

Data set Brute-force Approx.
MNIST (, , ) 1m 27s 1m 35s
MNIST (, , ) 1m 17s 39s
MNIST (, , ) 20m 16s 2m 49s
MNIST (, , ) 20m 29s 32m 06s
Table 1: Selected timings for initialization and nearest neighbors search.
Name Short description
MNIST 784 70 000 10 Well balanced set of gray-scale handwritten digit images (http://yann.lecun.com/exdb/mnist).
NORB 2048 43 600 5 Small NORB dataset (NYU Object Recognition Benchmark) contains stereo image pairs of 50 uniform-colored toys under 18 azimuths, 9 elevations, and 6 lighting conditions (http://www.cs.nyu.edu/ylclab/ data/norbv1.0).
20NG 2000 18 759 20 A balanced collection of documents from 20 various newsgroups. Each vertex stands for a text document represented by bag of words (BOW) feature vector. (http://qwone.com/jason/ 20Newsgroups).
RCV1 30 804 409 103 Strongly imbalanced text corpus known as RCV1. We transformed an original dataset from https://old.datahub.io/dataset/rcv1-v2-lyrl2004 from to by using PCA.
Table 2: The list of data sets referenced in this paper.

Before finding 2-D embeddings we have transformed 20NG and Reuters to 30-dimensional space by using PCA transformation. We have checked that this procedure does not change visibly the results of embeddings, but it substantially decreased computational time needed in the initialization phase of finding the nearest (or the near) neighbors. We retained the high-dimensionality for MNIST () and NORB (). We used the cosine distance for NORB, Reuters and 20NG while the Euclidean one only for MNIST. The results obtained by ivhd were compared with the results generated by two the state-of-art methods: t-SNE (its faster bh-SNE version [21]) and its improved and recently published clone Large-vis [26]. We used the serial codes published in GitHub (http://alexanderfabisch.github.io/t-sne-inscikit-learn.html and https://github.com/lferry007/LargeVis, respectively).

Because of extremely high computational load required for computing precision/recall coefficients, to compare data separability and class purity, we define the following leave one-out coefficients:

(11)

where is the number of the nearest neighbors from , which belong to the same class as . The value of is computed for arbitrary defined value of dependent on the number of points in classes (here we assume that , because the smallest class consists of about 1000 samples). The value of for well separated and pure classes, while for random points from classes.

Figure 3: The embeddings of MNIST dataset for two collections of , and parameters.

In Figure 3 we demonstrate 2-D embeddings of MNIST dataset for two sets of {, , } parameters. In Figure 3a one can see a typical effect of the ‘curse of dimensionality’. Due to the lack of sufficient contrast of distances the classes are mixed up. Consequently, we show in Figure 3b that by assuming binary distance between data vectors the configuration of classes is very similar to existing visualizations of MNIST (see e.g. L.vd.Maaten web page https://lvdmaaten.github.io).

Figure 4: The confrontation of MNIST embeddings for three various algorithms (bh-SNE, Large-vis and ivhd). For ivhd we used binary distance and the following lists of parameters: , and . The average values of coefficients are as follows: a) , b) , c) .

In Figure 4 we present the confrontation of our method with bhSNE and LargeVis algorithms. Unfortunately, we were not able to generate ten separate MNIST classes by using these competitive methods. For the two, the cluster 6 divides into two separate classes (similar result can be found in http://alexanderfabisch.github.io/t-sne-in-scikit-learn.html). One can find also the proper 10 class embeddings in the Internet (see https://lvdmaaten.github.io/tsne) but there are mainly visualizations of the original dataset after PCA transformation to 30 dimensions. It seems that this is the problem of choice of a proper parameter set. Moreover, the stochastic gradient descent method used for minimization of K-L cost function (Eq. (5)) can stuck in a local minimum, so the minimization procedure should be repeated a few times starting from various initial configurations.

On the other hand, comparing visually the quality of embeddings, one can see that ivhd produces the most blurred solution (it has also distinctly lower value see Figures 3 and 4). However, the coarse grained embeddings are very similar. For example, groups of similar clusters {3, 5, 8} and {4, 7, 9} can be observed both in Figure 4b and 4c. Though these groups cannot be found in Figure 4a, many pictures of MNIST embedding generated by t-SNE algorithm, which can be found in the Internet, approve this observation. Anyway, ivhd is unbeatable in terms of computational time. The generation of MNIST mapping needs from 12 to 40 seconds (this difference depends on the number of the nearest and random neighbors used for computations and the values of parameters of minimization procedure) while in the case of Large-vis and bh-SNE for a single run, without repetition for finding a better solution, it is 10 and 20 minutes, respectively. The approximate timings are given without the computational time needed for initialization of the nearest neighbors (see Table 1). However, the high efficiency of ivhd is obtained at the sacrifice of a quality of reconstruction of the fine grained cluster structure and local neighborhood. Similarly to SNE [14], ivhd suffers the problem of crowding. The most of samples are located in the center of clusters in 2-D embeddings.

Better local precision of t-SNE and Large-vis visualizations can be clearly seen in Figure 5. Although ivhd better separates the classes, the sub-clusters representing the same toys in various conditions are blurred. Meanwhile, in Figures 5a and 5b one can discern the streaks of points representing the same toys. By assuming in ivhd, this local structure of NORB also becomes more sharp (the force from the nearest neighbors dominates over that from random neighbors), however, at the cost of even greater crowding effect. Computational time spent by ivhd in NORB embedding 30-60 sec. is two times greater than that for MNIST (because ), while by using Large-vis and tSNE it grows to about 15 and 20 minutes, respectively.

Figure 5: The confrontation of NORB embeddings for three various algorithms (bh-SNE, Large-vis and ivhd). For ivhd we used the following lists of parameters: , and . The average values of coefficients are as follows: a) , b) , c) .
Figure 6: The confrontation of 20NG dataset embeddings for three algorithms (bh-SNE, Large-vis and ivhd). For ivhd we used the following lists of parameters: , and . The average values of coefficients are as follows: a) , b) , c) .

For larger number of overlapped classes and relatively small number of data points – as it is in the twenty newsgroups (20NG) dataset – the embeddings obtained by the three algorithms look very messy with relatively low coefficients (see Figure 6). In Figure 6 ten most visible clusters are marked with circles. As it is demonstrated in the web page https://lvdmaaten.github.io/tsne, much better cluster separation (but of low class purity) can be obtained by using different text representation than BOW.

Figure 7: The results of ivhd embedding of 20NG dataset a) all clusters, b) ten best separable clusters c) all clusters for doc2vec representation. The average values of coefficients are as follows: a) , b) , c) .
Figure 8: The results of RCV1 2-dimensional embedding by using ivhd: a) All classes visualized () while only the largest 7 are visible; b) After removal of the two largest classes (ECAT 224 149; C151 185 582) the 5 classes (E212 150 163; M131 135 566; M143 43 920; C12 30 505; GDIS 23 345) are separated better. We used the following list of parameters: , and . The average values of coefficients are as follows: a) , b) .

In Figure 7 we compare the ivhd embeddings of 20NG datasets for BOW and doc2vec [19] representations. For the latter, as shown in Figure 7c, more clusters than those marked in Figure 7a can be visually recognized. This is reflected also by the values of coefficient. The average computational time required for 20NG embedding is about 5 sec. It is more than an order of magnitude lower than those obtained for competitive algorithms.

Unlike in the previous examples, the RCV1 dataset consisting of samples from scores of partly overlapped and overlapped classes, is highly imbalanced. Two separate classes ECAT (Economics) and C151 (Accounts/Earnings) represent 64% of the whole dataset. This fact is mainly responsible for high value (see Figure 8), despite rather poor separation of other categories. The value of taking into account the 2-D embedding for only these two categories. However, by excluding them from the dataset, we obtain very good embedding (see Figure 8b) of the rest 6 classes with high value. We were not able to obtain both t-SNE and Large-Vis embeddings in a reasonable time, on the laptop with technical parameters specified at the beginning of this section. Meanwhile, ivhd needs about 7-8 minutes to produce the result shown in Figure 8a, and about 5-6 minutes of that from Figure 8b.

The comparison of plots from Figure 9, presenting the values with increasing for original -dimensional datasets (dashed lines) and their mappings (solid line), confirm high quality of ivhd embeddings in terms of class separation and their purity. The values for embeddings are more stable with then original -dimensional datasets. Only 20NG dataset, assuming all 20 classes gives rather poor embedding for BOW tf-idf representation (as shown in Figure 6, doc2vec text representation gives a better results). This is mainly due to high overlap between the newsgroups categories. Much better results are obtained considering 20NG embeddings for selected 10 classes (see Figure 9b).

Figure 9: The plots showing the value of with for various datasets for the source and the target datasets.

4 Related work

Recently, a large body of work has been devoted to data embedding techniques, in the context of visual and exploratory analysis of high-dimensional and large datasets. In general, they can be roughly divided onto two groups: the classical DE algorithms (such as MDS, CCA, LLE, CDA, Isomap, etc. [34]) and those based on the concept of stochastic neighbor embedding (SNE) [14]. The SNE idea can be treated as the breakthrough and as the milestone concept in visual analysis of large datasets.

There exists many clones of SNE in which the authors:

  • diversify the definition of the neighborhood in the target and source spaces (t-SNE [21]),

  • develop more precise cost functions based on more complex divergence schemes (ws-SNE [35]),

  • combine the algorithms in hierarchical structures (hierarchical-SNE [26]),

  • increase their computational efficiency (bh-SNE [31], qSNE [15], LargeVis [28], Fit-SNE [20], CE [1], tripletembedding [22]).

All of these SNE based methods produce fantastic and very accurate 2-D embeddings (see, e.g., lvdmaaten.github.io/tsne/). SNE based algorithms allow for off-line reconstruction of the large sets () of high-dimensional () data in 2-D (3-D) spaces with a precision not previously available. However, from the perspectives of both the big data () analytics and interactive visualization of large data (, ), due to the high memory load and time complexity they become still too demanding computationally. As shown in [15, 26, 28, 20, 22, 7] the efficiency of SNE algorithms can be increased, e.g., by employing more sophisticated data structures, and by using approximate nearest neighbor search methods as it is in bh-SNE and qSNE methods, respectively. However, these t-SNE clones suffer at least time complexity. The efficiency of embedding can be further increased by easier identification of clusters [20] and by decreasing the number of calculated distances what decreases the time complexity to . As shown in [1], by using a special type of data sampling, called compressive embedding acceleration (CE), it appears that samples are sufficient to diffuse the information to all data points. However, in this situation, the increase of efficiency is at the expense of serious deterioration of embedding quality [1].

Just such the approach represents the LargeVis DE algorithm [17], which is an approximated version of t-SNE [31] which exploits both DE/GV duality and data sampling. First, the LargeVis method constructs the approximate -graph for data prior its visualization (embedding). According to [28], this approximated procedure should increase the efficiency of -graph construction assuming that its high accuracy is guaranteed. However, as shown in our tests [18], the approximated nearest neighbor method developed in [28] is accurate only provided that the length of lists are at least of the same order as data dimensionality. Otherwise, the accuracy drops rapidly (e.g. for 100-D data the accuracy drops from 95% () to 5% ()). It means, that for truly high-dimensional data the lists should be very long what considerably deteriorates the embedding quality and decreases its computational efficiency. This is the main reason that LargeVis results are very unstable and need very careful parameter tuning. From the point of view of interactive visualization of large and high-dimensional data, also the timings reported in [28, 20] are not encouraging.

In our position paper [7], published earlier than [28], we reported preliminary results of the linear-time -MDS DE method, in which we postulate using only a few the nearest and random neighbors . Similar to LargeVis, to mitigate the ‘curse of dimensionality’ effect, we assumed that all nearest neighbors of each data sample should be kept close to it and are equal to 0. The distances to a few random neighbors remain the original ones. As shown in [7], this approximation appeared to be extremely fast giving very precise 2-D embeddings. In [8] we demonstrated that this method is very robust and resistant on imprecise calculation of the nearest neighbors. Simultaneously, we developed the graph visualization method, ivga, focused for interactive exploration of social complex networks [9] based on the idea from [8]. This method appeared to be extremely efficient and amazingly accurate [10]. It allows for reconstructing of global and local topology of big complex networks with a few million edges on a regular laptop. We also developed a new graph visualization technique based on, so called B-matrices [5], which merged with ivga allows for much deeper exploration of graphs properties. In the current paper we integrate our DE and GV visualization approaches to develop a new, very efficient computationally, embedding method allowing for interactive visualization of large datasets on a laptop computer.

5 Conclusions

In this paper we present the method of 2-D embedding of large and high-dimensional data with minimal memory and computational time requirements. Its main concept consists in: increasing arbitrary the distance contrast between data vectors in the source dataset and 2-D visualization of -graph constructed for . We have shown that substitution of the huge and computationally demanding distances matrix with the nearest neighbor -graph data structure (with small ), and by assuming binary distance between data vectors, allow for decreasing radically the time and memory complexity of data embedding problem from to with a small value of coefficient. We also have paid attention on the theory of structure rigidity, which could be useful for further theoretical investigations, e.g., for finding minimal rigid and uniquely realizable graphs required for data embedding.

We have demonstrated that ivhd outperforms, in terms of computational time, the state-of-the-art DE algorithms more than one order of magnitude on the standard DE benchmark data. Data embeddings obtained by ivhd are also very effective in reconstructing data separation in high-dimensional and large datasets. This is the principal requirement for knowledge mining, because just the multi-scale clusters of data represent basic ‘granules of knowledge’. Due to simplicity of ivhd algorithm – it is in fact a clone of the classical MDS algorithm – its efficiency can be further increased by implementing its parallel versions in GPU and MIC environment, what we did already with classical MDS method (see [23, 24, 25]).

It is worth to mention, that ivhd yields less impressive results than recently published SNE clones in reconstructing the local neighborhood. The ‘crowding problem’, similar to that in SNE, causes that the most of datapoints in 2-D embeddings gather in the center of clusters. Nevertheless, the accurate reconstruction of the nearest neighbors is rather the secondary requirement in interactive visualization of big data. The exact lists (or their approximations) are well known prior to data embedding. They can be presented anytime visually, e.g., as the datapoints’ spines in . Moreover, the exact lists of the nearest neighbors are not reliable from the context of both measurement errors and the ‘curse of dimensionality’ principle. Besides, if more accurate reconstruction of data locality would be required, the ivhd can be used for generation of initial configuration for much slower SNE based DE algorithms. Unlike SNE competitors, ivhd has only three free parameters which should be fit to data: , and . Because the method is fast, they can be easily matched interactively, although ‘universal’ set, i.e, , and (or ), worked very well for most of data (and networks) visualized by the Authors of this paper. Summing up, we highly recommend ivhd as a good candidate for efficient data embedding engine in interactive visualization of truly big data. The recent Win32 version of our software is available here: https://goo.gl/6HzEd3.

References

  • [1] Amid, E., and Warmuth, M. K. Transformation invariant and outlier revealing dimensionality reduction using triplet embedding.
  • [2] Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. When is “nearest neighbor” meaningful? In International conference on database theory (1999), Springer, pp. 217–235.
  • [3] Borcea, C., and Streinu, I. The number of embeddings of minimally rigid graphs. Discrete & Computational Geometry 31, 2 (2004), 287–303.
  • [4] Cohen, R. F., Eades, P., Lin, T., and Ruskey, F. Three-dimensional graph drawing. Algorithmica 17, 2 (1997), 199–208.
  • [5] Czech, W., Mielczarek, W., and Dzwinel, W. Distributed computing of distance-based graph invariants for analysis and visualization of complex networks. Concurrency and Computation: Practice and Experience 29, 9 (2017).
  • [6] Dzwinel, W. Virtual particles and search for global minimum. Future Generation Computer Systems 12, 5 (1997), 371–389.
  • [7] Dzwinel, W., and Wcisło, R. Very fast interactive visualization of large sets of high-dimensional data. Procedia Computer Science 51 (2015), 572–581.
  • [8] Dzwinel, W., and Wcisło, R. ivhd: A robust linear-time and memory efficient method for visual exploratory data analysis. In

    International Conference on Machine Learning and Data Mining in Pattern Recognition

    (2017), Springer, pp. 345–360.
  • [9] Dzwinel, W., Wcisło, R., and Czech, W. ivga: A fast force-directed method for interactive visualization of complex networks. Journal of Computational Science 21 (2017), 448–459.
  • [10] Dzwinel, W., Wcisło, R., and Strzoda, M. ivga: Visualization of the network of historical events. In International Conference on Internet of Things and Machine Learning (2017).
  • [11] Floyd, R. W. Algorithm 97: shortest path. Communications of the ACM 5, 6 (1962), 345.
  • [12] France, S. L., and Carroll, J. D. Two-way multidimensional scaling: A review. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 41, 5 (2011), 644–661.
  • [13] Fruchterman, T. M., and Reingold, E. M. Graph drawing by force-directed placement. Software: Practice and experience 21, 11 (1991), 1129–1164.
  • [14] Hinton, G. E., and Roweis, S. T. Stochastic neighbor embedding. In Advances in neural information processing systems (2003), pp. 857–864.
  • [15] Ingram, S., and Munzner, T. Dimensionality reduction for documents with nearest neighbor queries. Neurocomputing 150 (2015), 557–569.
  • [16] Ingram, S., Munzner, T., and Olano, M. Glimmer: Multilevel mds on the gpu. IEEE Transactions on Visualization and Computer Graphics 15, 2 (2009), 249–261.
  • [17] Keim, D. A., Mansmann, F., and Thomas, J. Visual analytics: how much visualization and how much analytics? ACM SIGKDD Explorations Newsletter 11, 2 (2010), 5–8.
  • [18] Kłusek, A., and Dzwinel, W. Multi-gpu k-nearest neighbor search in the context of data embedding.
  • [19] Le, Q., and Mikolov, T. Distributed representations of sentences and documents. In International Conference on Machine Learning (2014), pp. 1188–1196.
  • [20] Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S., and Kluger, Y. Efficient algorithms for t-distributed stochastic neighborhood embedding. arXiv preprint arXiv:1712.09005 (2017).
  • [21] Maaten, L. v. d., and Hinton, G. Visualizing data using t-sne. Journal of machine learning research 9, Nov (2008), 2579–2605.
  • [22] Paratte, J., Perraudin, N., and Vandergheynst, P. Compressive embedding and visualization using graphs. arXiv preprint arXiv:1702.05815 (2017).
  • [23] Pawliczek, P., and Dzwinel, W. Interactive data mining by using multidimensional scaling. Procedia Computer Science 18 (2013), 40–49.
  • [24] Pawliczek, P., Dzwinel, W., and Yuen, D. A. Visual exploration of data by using multidimensional scaling on multicore cpu, gpu, and mpi cluster. Concurrency and Computation: Practice and Experience 26, 3 (2014), 662–682.
  • [25] Pawliczek, P., Dzwinel, W., and Yuen, D. A. Visual exploration of data with multithread mic computer architectures. In

    International Conference on Artificial Intelligence and Soft Computing

    (2015), Springer, pp. 25–35.
  • [26] Pezzotti, N., Höllt, T., Lelieveldt, B., Eisemann, E., and Vilanova, A. Hierarchical stochastic neighbor embedding. In Computer Graphics Forum (2016), vol. 35, Wiley Online Library, pp. 21–30.
  • [27] Shaw, B., and Jebara, T. Structure preserving embedding. In Proceedings of the 26th Annual International Conference on Machine Learning (2009), ACM, pp. 937–944.
  • [28] Tang, J., Liu, J., Zhang, M., and Mei, Q. Visualizing large-scale and high-dimensional data. In Proceedings of the 25th International Conference on World Wide Web (2016), International World Wide Web Conferences Steering Committee, pp. 287–297.
  • [29] Tenenbaum, J. B., De Silva, V., and Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. science 290, 5500 (2000), 2319–2323.
  • [30] Thorpe, M. F., and Duxbury, P. M. Rigidity theory and applications. Springer Science & Business Media, 1999.
  • [31] Van Der Maaten, L. Accelerating t-sne using tree-based algorithms. Journal of machine learning research 15, 1 (2014), 3221–3245.
  • [32] Van Der Maaten, L., Postma, E., and Van den Herik, J. Dimensionality reduction: a comparative. J Mach Learn Res 10 (2009), 66–71.
  • [33] Wilkinson, L., Anand, A., and Grossman, R. Graph-theoretic scagnostics.
  • [34] Yang, L. Data embedding techniques and applications.

    ACM Proceedings of the 2nd international workshop on Computer vision meets databases

    (2005), 29–33.
  • [35] Yang, Z., Peltonen, J., and Kaski, S. Optimization equivalence of divergences improves neighbor embedding. In International Conference on Machine Learning (2014), pp. 460–468.