1 Introduction
In the age of data science interactive visualization of large highdimensional datasets is an essential tool in exploratory data analysis and knowledge extraction. It allows for a better insight into data topology and its interactive exploration by direct manipulation on a whole or on a fragment of dataset. For example, one can remove irrelevant data samples, identify the outliers or zoomin a particular fragment of data. By changing visualization modes (e.g. the form of the cost function), multiscale structure of the data can be investigated. Summarizing, interactive visualization allows for:

formulation and instant verification of a number of hypotheses;

precise matching of data mining tools to the properties of data investigated;

optimize their parameters.
As the famous Visual Analytics Mantra by Keim et al. [17] says: ‘analyze first, show the important, zoom, filter and analyze further, details on demand’. Herein, we focus on application of data embedding (DE) methods [34] in interactive visualization of large () highdimensional datasets.
Data embedding (or shortly, embedding) is defined as a transformation of dimensional (D) dataset into its dimensional ‘shadow’ , where and is the number of samples. The mapping should be understood as a lossy compression of dimensional dataset in dimensional space. The mapping is realized by minimization of a cost function , where is a measure of topological dissimilarity between and . Particularly, should be sensitive to the multiscale cluster structure of . Due to the complexity of lowdimensional manifold – immersed in D feature space – on which data samples Y are located, perfect embedding of in dimensions, with 0 compression error, is possible only for trivial cases. In fact, the embedding task is a simplification of a more generic problem, where is a mapping of a given structure from a source space into a corresponding structure in a target space, assuming the smallest reconstruction error.
In the context of dimensional data embedding in visually perceptible Euclidean space, we assume here that , having in mind that 3D data visualization can be realized in a similar way. As shown in a large body of papers (see, e.g., [34, 32, 12]), DV of large highdimensional datasets ( and ) that is both sufficiently precise in reconstruction of D data topology and computationally affordable, is an algorithmic challenge. To preserve topological properties of in both the classical MDS (multidimensional scaling) methods and the stateoftheart clones of the stochastic neighbor embedding (SNE) concept [14], require computiing and storing of all the distances between data samples both for the source dataset and its 2D embedding . This makes the most precise stateoftheart vizualization algorithms, such as tSNE [21] and its clones, suffer from time and memory complexity. The time complexity can be decreased to by using the approximate versions of tSNE, such as bhSNE [31] and other its variants and simplifications [35, 15, 26, 28, 20]. It is worth to mention here that, in fact, the timecomplexity of DE is greater, and depends also on the algorithm which is used for the cost function minimization, i.e., on the number of learning cycles required to obtain the global minimum of . In the rest of this paper – as it is in all the works about DE – as the timecomplexity we understand here a single learning cycle performed on the full dataset. Summing up, as interactive and visual data exploration involves very strict time performance regimes, sophisticated data structures and high memory load, the stateoftheart DE algorithms can be too slow for visual exploration of data consisting of samples.
In this paper we propose a novel and very fast method for visualization of highdimensional data  interactive visualization of highdimensional data technique (ivhd)  with lineartime and linearmemory complexities. It outperforms the stateoftheart DV algorithms in computational speed retaining high quality of data embedding. To this end, we made the following contributions:

We demonstrate that only a small fraction of distances (proximities) between D data samples are required – just a few nearest and random distances – to obtain a high quality 2D embedding.

Moreover, we show that these distances can be binary, i.e., equal to 0 to the nearest and 1 to the random neighbors of each . Then, instead of using floating point distances, one need to find only the indices of a few nearest neighbors () for each data sample .^{1}^{1}1Here, instead of index representing usually the number of the nearest neighbors, we use to be consistent with the other notation used in this paper. Consequently, the requirements of memory load shrinks from floating points (two full distances arrays and ) to barely integers, where .

This way we have reduced also the overall time complexity of DE method to .

We clearly show that data embedding (DE) is equivalent to visualization of graph built for input data.
Summarizing, the principal contribution of this paper is the essential improvement of time/memory complexity of data embedding (with a minor deterioration of data embedding quality) what allows for interactive data manipulation. The method preserves very well the coarse grained cluster structure of the original D data and enables interactive visualization of data samples on a laptop. Consequently, our improvements open a new perspective to visual exploration of much bigger data on more powerful computer systems.
In the following section we demonstrate the details of our approach, while in the rest of the paper we discuss its advantages and disadvantages in the context of both the stateoftheart DE methods and interactive visualization of truly large datasets. In support to our concept we present the results of 2D embeddings of popular benchmark datasets such as small NORB, MNIST, twenty newsgroups (20NG) and RCV1. We summarize our findings in the Conclusions section.
2 IVHD data embedding
Our method represents a radical simplification of the forcedirected implementation of the multidimensional scaling (MDS) mapping [6, 23, 24, 25]. Because it contrasts to the modern SNE concept, exploited in the stateoftheart DE papers [14, 21, 31, 35, 15, 26, 28], first, we present shortly these two approaches.
2.1 Background
We assume that each D feature vector is represented in 2D Euclidean space by a corresponding point and . We define two distance matrices: and , in the source and target spaces, respectively. The value of is a measure of dissimilarity (proximity) between feature vectors and while in 2D Euclidean space: . In general, the source space has not to be Cartesian one, and can be solely defined by a proximity matrix between any two data objects of a (possibly) different data representation than the vector one (e.g., shapes, graphs etc.).
The classical multidimensional scaling (MDS), which performs the mapping , consists in minimization of the cost (stress) function
(1) 
which represents the error between dissimilarities and corresponding distances , where: , are weights and , are the parameters. Herein, we assume that and . However, there are many other forms of this stress function, which are defined, e.g., in [34, 32, 12]. One can easily adopt our approach to particular (, ) parameters choice. Anyway, finding the global minimum of this multidimensional and multimodal cost function (1) in respect to , is not a trivial problem. To this end, we use the forcedirected approach presented in [6, 23, 24, 25].
We assume that the set of points is treated as an ensemble of interacting particles . The particles evolve in space with discrete time scaled by the timestep , according to the Newtonian equations of motion discretized by using the leapfrog numerical scheme. Here we use their simplified form [9]:
(2) 
(3) 
(4) 
what resembles well known momentum minimization method. The value of represents friction, and it is equal to 1 in the absence of energy dissipation. Meanwhile, parameterize interparticle forces. The proper balance of and is crucial for the particle system convergence speed to a stable and good (close to the global one) minimum of the stress function (1). We demonstrated in [6, 23, 24, 25] that this formulation of the classical MDS algorithm produces acceptable embeddings for lowdimensional () and rather small datasets (
) in sublinear time complexity, i.e., similar to widely used stochastic gradient descent (SGD) algorithms and its clones employed, e.g., in the original tSNE implementation
[21]. The principal problems with MDS forare both the ‘curse of dimensionality’ effect, which produces bad quality embeddings, and high computational and storage complexity for
.As shown in [2], under the broad set of conditions, as dimensionality increases, the distance to the nearest feature vector approaches the distance to the furthest one for as few as 1015 dimensions. Consequently, this low distance contrast causes that MDS completely fails in mapping of highdimensional data to 2D space, producing meaningless sphere of 2D points for most of highdimensional datasets. As shown in [14, 21] this adverse effect can be radically mitigated by change of the definition of proximity measure and the cost function. The group of methods based on stochastic neighbor embedding, like the most popular tSNE [21], defines the similarity of two samples and (both in the source and target
spaces) in terms of probabilities (
and , respectively) that i would pick j as its neighbor and vice versa. Then the cost function is defined as the KL divergence:(5) 
where, for tSNE algorithm,
is approximated by a Gaussian with variance
centered in , whileis defined by the heavytailed Cauchy distribution in
, to avoid both the crowding and the optimization problems of older SNE method. The respective probabilities are defined as follows:(6) 
As demonstrated in [21], tSNE outperforms other data reduction techniques producing very precise embeddings for all kind of datasets, including highdimensional ones. However, unlike in the classical MDS presented above, we cannot use in tSNE (and its clones) the forcedirected approach for minimization due to partition function in the denominator in Eqs. (6). The gradient descent (GD) methods (such as stochastic GD or asynchronous stochastic) more often stuck in local minima, what causes that the quality of embedding is very sensitive on the choice of parameters [28]. Moreover, similar to MDS, at least distances in , and the same number of corresponding distances in , have to be calculated. While is computed only once at the beginning of simulation, matrix has to be computed in every timestep. Moreover, the two distance matrices has to be kept in the operational memory. This results in prohibitively high time complexity (in respect to a single training cycle of minimization method applied) and storage. This excludes the two methods as acceptable candidates for embedding engine of interactive visualization of large datasets. Moreover, in terms of parallel computations on multicore CPUs, the maximal number of samples which can be embedded can scale, at most, with the square root of the number of CPU (or GPU) threads involved in computations [24, 25]. Anyway, the quadratic memory load remains the key problem and puts the upper limit on . The application of BurnesHut approximation in both MDS and tSNE (bhSNE) algorithms [31] can reduce their computational complexity to but at the cost of increase of algorithmic complexity and, consequently, decrease of parallelization efficiency. Despite of reduction of the number of distances computations in bhSNE, the method still requires full distances table with floating points. Do we really need so many distances to embed D data in 2D Euclidean space?
2.2 Key concept
The lineartime and memory complexity of data embedding can be achieved by using only a limited number of distances from and corresponding distances from (such as in [6, 23, 24, 25, 16]). In the context of, so called, theory of structural rigidity [30], all distances between the samples from a D dataset are not needed to retain rigidity of its shape in a D space. The term ‘rigidity’ can be understood as a property of a D structure made of rods (distances) and joints (data vectors) that it does not bend or flex under an applied force. Therefore, to ensure the rigidity of (and its 2D embedding ) only a fraction of distances (joints) from (and ) is required. What is a minimal number of distances for which both the original and target datasets remain rigid?
As shown in [30, 3], minimal rigid structure, which consists of joints (vectors) in dimensional space, requires at least:
(7) 
rigid rods (distances). It means that, in the source space, the number of distances can ensure lossless reconstruction of data structure in D. Particularly, in 2D the structural rigidity can be preserved for . However, defines only the lowest bound of required to retain structural rigidity. Therefore, herein we are trying to answer the following questions:

What minimal number of distances in should be known to obtain the reasonable reconstruction of data in , simultaneously, preserving structural rigidity of ? Is it closer to or rather to 2?

Which distances should be retained?
In general, cannot be the Euclidean matrix. It may represent proximity of samples in an abstract space. Particularly, the samples can occupy a complicated D manifold for which (see Figure 1). Then, one can assume, that only the distances of each sample to their nearest neighbors are Euclidean. As shown in [29, 9], the nearest neighbor graph (graph), in which vertices represent data samples while edges the connections to their nearest neighbors, can be treated as an approximation of this lowdimensional manifold, immersed in a highdimensional feature space (see Figure 2). In CDA and Isomap [34, 29] DE algorithms, the proximities of more distant graph vertices (samples) are calculated as the lengths of the shortest paths between them in a respective graph. However, though this way we can more precisely approximate the real distances on the lowdimensional manifold, the full distances matrix has to be calculated by very time consuming Dijkstra’s (or FloydWarshall [11]) algorithm and, what is even more demanding, has to be stored in the operational memory. In fact, the calculation of full is not necessary, and the problem of data embedding can be replaced by the congruent problem of the nearest neighbor graph (graph) visualization.
The methodology of graph visualization (GV) and DE shares many common concepts. For example, the metaphor of particle system and the force directed method, used for the cost function minimization, were introduced independently to GV and DE [6, 13]. Summing up, as shown in Figure 1, the first step in our DE method (ivhd) consists in construction of graph, which approximate dimensional nonCartesian manifold immersed in . Then we can use the fast procedure of graph visualization in 2D space presented in our recent work [11]. In the rest of this paper we demonstrate that the location of graph vertices in 2D space will represent good quality embedding.
Let us assume that is the nearest neighbor graph constructed for highdimensional dataset . The data samples correspond to the graph vertices , while is the set of edges connecting each with their nearest neighbors. Because, we have assumed (see Figure 1) that this graph is an approximation of dimensional manifold, consequently, we presume that its topology preserving embedding in 2D Euclidean space will produce similar results as DE algorithms. According to definition formulated in [27] that ‘topology is preserved if a connectivity algorithm, such as knearest neighbors, can easily recover the edges of the input graph from only the coordinates of the nodes after embedding’. It means that unlike in DE algorithms, which tend to retain the neighbors order, the ordering of nearest neighbors (usually a few ones) is irrelevant in that case. In fact, the requirement of preserving the order of small number of nearest neighbors for each , which are subject to uncertainty and measurement errors, is secondary in terms of large data visualization. The crucial problem for formulating ad hoc
hypotheses and to decide about the use of particular machine learning tools in further data analytics is revealing data topology, i.e., its multiscale cluster structure.
Because we are not interested in nearest neighbors ordering for each , to get a good approximation of data manifold, to the sake of simplicity, the distances between connected vertices of graph should be the same and as close to 0 as possible. Conversely, the unconnected vertices of has to be more distant, to increase the contrast between both types of distances. It is obvious that graphs (for small ) are largely unconnected and not rigid. Therefore, graph has to be augmented by additional edges. Here, we assume that for each () apart from edges to the nearest neighbors, additional randomly selected neighbors are connected. The value of should ensure rigidity of this augmented graph. Thus, and are the sets of indices of nearest (connected) neighbors and (unconnected) random neighbors of a feature vector () in , respectively. We assume also that . However, this assumption is superfluous because the probability of picking the nearest neighbor as a random neighbor is negligible low for large . In fact, the random neighbors are very distant from . Therefore, having in mind the effect of the ‘curse of dimensionality’, we can radically simplify data similarity to the binary one {‘nearest’ (0), ‘random’ (1)}, i.e.:
(8) 
This way instead of matrix we have as the input data a lists of the nearest neighbors of (graph edges). In the rest of this paper, the respective distances from Eq. (8) we call shortly and (i.e., and ).
Summarizing, we can presume that identification of a few nearest neighbors and random neighbors for each , where is small, is sufficient to assure rigidity of augmented graph. This is rather obvious, because it is well known [4] that any graph can be embedded preserving its topology (in the context of definition presented above), at most, in 3D Euclidean space while planar graphs in 2D. In the most cases graphs for small and are planar, though for the probability of nonplanarity rapidly increases with [33]. Thus the total number of connections required for making the augmented graph rigid is of order . In fact, we need to store only indices of the nearest neighbors for every , while the indices of random neighbors can be generated ad hoc during embedding process.
Additionally, we presume that by imposing a high contrast on these two types of distances, we will be able to preserve in the multiscale cluster structure of , by employing MDS mapping only on those binary distances. This way we could decrease both the computational complexity of data embedding to with and its computational load to only integers.
To obtain 2D embedding of we minimize the following stress function:
(9) 
Consequently, the interparticle force from Eq. (3) in minimization procedure simplifies to:
(10) 
To explain our concept in terms of data embedding, let us follow a toy example.
We assume that consists of samples located in the vertices of two identical but translated and mutually rotated, regular 18dimensional hypertetrahedrons. In fact, data create an unconnected graph, which vertices are joined by the hypertetrahedron edges. As shown in Figure 2a, by employing classical MDS with the cost function (1) and the dataset defined by the full distances table ( distances), we have obtained embedding shown in Figure 2a. One can observe that the result is rather not satisfactory. Though it demonstrates well data separability, the local structure of remains very unclear. Let us assume now that we know only distances from to random neighbors for every ( distances). As one can expect, the final embedding from Figure 2b shows that data separation is still visible but it lacks any fine grained structure. As shown in Figure 2c, this flaw can be partially eliminated taking into account also distances to nearest neighbors of , i.e., in this particular case and were set ( distances). Additionally, we have reduced the strong bias caused by the longrange interactions between samples and their random neighbors, by decreasing 10 times the scaling factor of the respective ‘forces’ ( in Eq. (10)). Unlike in Figure 2b, in Figure 2c we observe much better reconstruction of local structure but at the cost of worse data separation. The embeddings of two regular hypertetrahedrons overlap each other.
Let us assume now that all the distances between all samples are equal to 1. To increase the contrast between the nearest and random neighbors of , we assume that and . Because now we use binary distances {0.5, 1}, we do not need to store floating point matrix but just remember indices (integers) of each to their nearest neighbors (here 70 integers in total). As shown in Figure 2d, the 2D embedding clearly reconstructs the local and global structure of graph representing original 18dimensional data. Because and were small in this example the value of , was sufficient to properly contrast and distances. However, as we show above (see Eq. (8)), for larger and , should be equal to 0 to reduce the ‘curse of dimensionality’ effect. It is worth to mention here that the neighborhood relation is not symmetrical, thus can be smaller than for . Consequently, it may produce nonrigid and deformed (2D) embeddings. Moreover, by assuming that , one can get meaningless 2D embeddings even though is much greater than 2. Summarizing, we come to the following conjectures.

The number of distances from , sufficient to obtain 2D embedding of original D dataset , which preserves both local and global properties of , can be surprisingly small and close to .

For each , their vicinity consisting of a few nearest neighbors should be preserved while the global cluster structure is controlled by a few (often one) randomly selected neighbors.

The distances between and their nearest and random neighbors, and respective forces in optimization procedure should be properly contrasted. For high dimensional data, binary distance can be used. This can radically reduce the memory load from floating points (two distances arrays and ) to integers, where .
In the following section we will demonstrate that these simple improvements can make the classical MDS competitive, both in terms of efficiency and embedding quality, competitive to the stateoftheart DE algorithms. Particularly, for the interactive visualization of large datasets.
3 Implementation and tests
On the basis of DE algorithm described above, we have developed an integrated tool for interactive visualization of highdimensional data (ivhd). The code was written in C++, while GUI was created by using OpenGL™ graphics system integrated with the standard Qt GUI interface. All the computations and visualizations were performed on a notebook computer with Intel® Core™ i5 7200U 2.52.71GHz with 16GB of memory and equipped with Nvidia GeForce 940MX 2GB graphic board.
The DE initialization, which computes the nearest neighbors for the source dataset, was written in CUDA™ and uses two very fast GPU implementations of brute force and approximated (based on LargeVis idea [28]) algorithms for the nearest neighbors search [18]. This search is performed only once at the beginning of simulation and, as shown in Table 1, it requires relatively long computational time for MNIST data set with , and . Pay attention that GPU implementation of bruteforce algorithm is faster for larger and than LargeVis approximated algorithm (see [18] for details). The CPU version of these algorithms (MPI, two cores) is about 10 times slower. The nearest neighbors are cached on the fast SSD disk for further use. To evaluate ivhd embedding concept we have performed experiments on a few highdimensional and multiclass datasets. The tests have been carried out for four highdimensional datasets, which are specified in Table 2.
Data set  Bruteforce  Approx. 

MNIST (, , )  1m 27s  1m 35s 
MNIST (, , )  1m 17s  39s 
MNIST (, , )  20m 16s  2m 49s 
MNIST (, , )  20m 29s  32m 06s 
Name  Short description  

MNIST  784  70 000  10  Well balanced set of grayscale handwritten digit images (http://yann.lecun.com/exdb/mnist). 
NORB  2048  43 600  5  Small NORB dataset (NYU Object Recognition Benchmark) contains stereo image pairs of 50 uniformcolored toys under 18 azimuths, 9 elevations, and 6 lighting conditions (http://www.cs.nyu.edu/ylclab/ data/norbv1.0). 
20NG  2000  18 759  20  A balanced collection of documents from 20 various newsgroups. Each vertex stands for a text document represented by bag of words (BOW) feature vector. (http://qwone.com/jason/ 20Newsgroups). 
RCV1  30  804 409  103  Strongly imbalanced text corpus known as RCV1. We transformed an original dataset from https://old.datahub.io/dataset/rcv1v2lyrl2004 from to by using PCA. 
Before finding 2D embeddings we have transformed 20NG and Reuters to 30dimensional space by using PCA transformation. We have checked that this procedure does not change visibly the results of embeddings, but it substantially decreased computational time needed in the initialization phase of finding the nearest (or the near) neighbors. We retained the highdimensionality for MNIST () and NORB (). We used the cosine distance for NORB, Reuters and 20NG while the Euclidean one only for MNIST. The results obtained by ivhd were compared with the results generated by two the stateofart methods: tSNE (its faster bhSNE version [21]) and its improved and recently published clone Largevis [26]. We used the serial codes published in GitHub (http://alexanderfabisch.github.io/tsneinscikitlearn.html and https://github.com/lferry007/LargeVis, respectively).
Because of extremely high computational load required for computing precision/recall coefficients, to compare data separability and class purity, we define the following leave oneout coefficients:
(11) 
where is the number of the nearest neighbors from , which belong to the same class as . The value of is computed for arbitrary defined value of dependent on the number of points in classes (here we assume that , because the smallest class consists of about 1000 samples). The value of for well separated and pure classes, while for random points from classes.
In Figure 3 we demonstrate 2D embeddings of MNIST dataset for two sets of {, , } parameters. In Figure 3a one can see a typical effect of the ‘curse of dimensionality’. Due to the lack of sufficient contrast of distances the classes are mixed up. Consequently, we show in Figure 3b that by assuming binary distance between data vectors the configuration of classes is very similar to existing visualizations of MNIST (see e.g. L.vd.Maaten web page https://lvdmaaten.github.io).
In Figure 4 we present the confrontation of our method with bhSNE and LargeVis algorithms. Unfortunately, we were not able to generate ten separate MNIST classes by using these competitive methods. For the two, the cluster 6 divides into two separate classes (similar result can be found in http://alexanderfabisch.github.io/tsneinscikitlearn.html). One can find also the proper 10 class embeddings in the Internet (see https://lvdmaaten.github.io/tsne) but there are mainly visualizations of the original dataset after PCA transformation to 30 dimensions. It seems that this is the problem of choice of a proper parameter set. Moreover, the stochastic gradient descent method used for minimization of KL cost function (Eq. (5)) can stuck in a local minimum, so the minimization procedure should be repeated a few times starting from various initial configurations.
On the other hand, comparing visually the quality of embeddings, one can see that ivhd produces the most blurred solution (it has also distinctly lower value see Figures 3 and 4). However, the coarse grained embeddings are very similar. For example, groups of similar clusters {3, 5, 8} and {4, 7, 9} can be observed both in Figure 4b and 4c. Though these groups cannot be found in Figure 4a, many pictures of MNIST embedding generated by tSNE algorithm, which can be found in the Internet, approve this observation. Anyway, ivhd is unbeatable in terms of computational time. The generation of MNIST mapping needs from 12 to 40 seconds (this difference depends on the number of the nearest and random neighbors used for computations and the values of parameters of minimization procedure) while in the case of Largevis and bhSNE for a single run, without repetition for finding a better solution, it is 10 and 20 minutes, respectively. The approximate timings are given without the computational time needed for initialization of the nearest neighbors (see Table 1). However, the high efficiency of ivhd is obtained at the sacrifice of a quality of reconstruction of the fine grained cluster structure and local neighborhood. Similarly to SNE [14], ivhd suffers the problem of crowding. The most of samples are located in the center of clusters in 2D embeddings.
Better local precision of tSNE and Largevis visualizations can be clearly seen in Figure 5. Although ivhd better separates the classes, the subclusters representing the same toys in various conditions are blurred. Meanwhile, in Figures 5a and 5b one can discern the streaks of points representing the same toys. By assuming in ivhd, this local structure of NORB also becomes more sharp (the force from the nearest neighbors dominates over that from random neighbors), however, at the cost of even greater crowding effect. Computational time spent by ivhd in NORB embedding 3060 sec. is two times greater than that for MNIST (because ), while by using Largevis and tSNE it grows to about 15 and 20 minutes, respectively.
For larger number of overlapped classes and relatively small number of data points – as it is in the twenty newsgroups (20NG) dataset – the embeddings obtained by the three algorithms look very messy with relatively low coefficients (see Figure 6). In Figure 6 ten most visible clusters are marked with circles. As it is demonstrated in the web page https://lvdmaaten.github.io/tsne, much better cluster separation (but of low class purity) can be obtained by using different text representation than BOW.
In Figure 7 we compare the ivhd embeddings of 20NG datasets for BOW and doc2vec [19] representations. For the latter, as shown in Figure 7c, more clusters than those marked in Figure 7a can be visually recognized. This is reflected also by the values of coefficient. The average computational time required for 20NG embedding is about 5 sec. It is more than an order of magnitude lower than those obtained for competitive algorithms.
Unlike in the previous examples, the RCV1 dataset consisting of samples from scores of partly overlapped and overlapped classes, is highly imbalanced. Two separate classes ECAT (Economics) and C151 (Accounts/Earnings) represent 64% of the whole dataset. This fact is mainly responsible for high value (see Figure 8), despite rather poor separation of other categories. The value of taking into account the 2D embedding for only these two categories. However, by excluding them from the dataset, we obtain very good embedding (see Figure 8b) of the rest 6 classes with high value. We were not able to obtain both tSNE and LargeVis embeddings in a reasonable time, on the laptop with technical parameters specified at the beginning of this section. Meanwhile, ivhd needs about 78 minutes to produce the result shown in Figure 8a, and about 56 minutes of that from Figure 8b.
The comparison of plots from Figure 9, presenting the values with increasing for original dimensional datasets (dashed lines) and their mappings (solid line), confirm high quality of ivhd embeddings in terms of class separation and their purity. The values for embeddings are more stable with then original dimensional datasets. Only 20NG dataset, assuming all 20 classes gives rather poor embedding for BOW tfidf representation (as shown in Figure 6, doc2vec text representation gives a better results). This is mainly due to high overlap between the newsgroups categories. Much better results are obtained considering 20NG embeddings for selected 10 classes (see Figure 9b).
4 Related work
Recently, a large body of work has been devoted to data embedding techniques, in the context of visual and exploratory analysis of highdimensional and large datasets. In general, they can be roughly divided onto two groups: the classical DE algorithms (such as MDS, CCA, LLE, CDA, Isomap, etc. [34]) and those based on the concept of stochastic neighbor embedding (SNE) [14]. The SNE idea can be treated as the breakthrough and as the milestone concept in visual analysis of large datasets.
There exists many clones of SNE in which the authors:
All of these SNE based methods produce fantastic and very accurate 2D embeddings (see, e.g., lvdmaaten.github.io/tsne/). SNE based algorithms allow for offline reconstruction of the large sets () of highdimensional () data in 2D (3D) spaces with a precision not previously available. However, from the perspectives of both the big data () analytics and interactive visualization of large data (, ), due to the high memory load and time complexity they become still too demanding computationally. As shown in [15, 26, 28, 20, 22, 7] the efficiency of SNE algorithms can be increased, e.g., by employing more sophisticated data structures, and by using approximate nearest neighbor search methods as it is in bhSNE and qSNE methods, respectively. However, these tSNE clones suffer at least time complexity. The efficiency of embedding can be further increased by easier identification of clusters [20] and by decreasing the number of calculated distances what decreases the time complexity to . As shown in [1], by using a special type of data sampling, called compressive embedding acceleration (CE), it appears that samples are sufficient to diffuse the information to all data points. However, in this situation, the increase of efficiency is at the expense of serious deterioration of embedding quality [1].
Just such the approach represents the LargeVis DE algorithm [17], which is an approximated version of tSNE [31] which exploits both DE/GV duality and data sampling. First, the LargeVis method constructs the approximate graph for data prior its visualization (embedding). According to [28], this approximated procedure should increase the efficiency of graph construction assuming that its high accuracy is guaranteed. However, as shown in our tests [18], the approximated nearest neighbor method developed in [28] is accurate only provided that the length of lists are at least of the same order as data dimensionality. Otherwise, the accuracy drops rapidly (e.g. for 100D data the accuracy drops from 95% () to 5% ()). It means, that for truly highdimensional data the lists should be very long what considerably deteriorates the embedding quality and decreases its computational efficiency. This is the main reason that LargeVis results are very unstable and need very careful parameter tuning. From the point of view of interactive visualization of large and highdimensional data, also the timings reported in [28, 20] are not encouraging.
In our position paper [7], published earlier than [28], we reported preliminary results of the lineartime MDS DE method, in which we postulate using only a few the nearest and random neighbors . Similar to LargeVis, to mitigate the ‘curse of dimensionality’ effect, we assumed that all nearest neighbors of each data sample should be kept close to it and are equal to 0. The distances to a few random neighbors remain the original ones. As shown in [7], this approximation appeared to be extremely fast giving very precise 2D embeddings. In [8] we demonstrated that this method is very robust and resistant on imprecise calculation of the nearest neighbors. Simultaneously, we developed the graph visualization method, ivga, focused for interactive exploration of social complex networks [9] based on the idea from [8]. This method appeared to be extremely efficient and amazingly accurate [10]. It allows for reconstructing of global and local topology of big complex networks with a few million edges on a regular laptop. We also developed a new graph visualization technique based on, so called Bmatrices [5], which merged with ivga allows for much deeper exploration of graphs properties. In the current paper we integrate our DE and GV visualization approaches to develop a new, very efficient computationally, embedding method allowing for interactive visualization of large datasets on a laptop computer.
5 Conclusions
In this paper we present the method of 2D embedding of large and highdimensional data with minimal memory and computational time requirements. Its main concept consists in: increasing arbitrary the distance contrast between data vectors in the source dataset and 2D visualization of graph constructed for . We have shown that substitution of the huge and computationally demanding distances matrix with the nearest neighbor graph data structure (with small ), and by assuming binary distance between data vectors, allow for decreasing radically the time and memory complexity of data embedding problem from to with a small value of coefficient. We also have paid attention on the theory of structure rigidity, which could be useful for further theoretical investigations, e.g., for finding minimal rigid and uniquely realizable graphs required for data embedding.
We have demonstrated that ivhd outperforms, in terms of computational time, the stateoftheart DE algorithms more than one order of magnitude on the standard DE benchmark data. Data embeddings obtained by ivhd are also very effective in reconstructing data separation in highdimensional and large datasets. This is the principal requirement for knowledge mining, because just the multiscale clusters of data represent basic ‘granules of knowledge’. Due to simplicity of ivhd algorithm – it is in fact a clone of the classical MDS algorithm – its efficiency can be further increased by implementing its parallel versions in GPU and MIC environment, what we did already with classical MDS method (see [23, 24, 25]).
It is worth to mention, that ivhd yields less impressive results than recently published SNE clones in reconstructing the local neighborhood. The ‘crowding problem’, similar to that in SNE, causes that the most of datapoints in 2D embeddings gather in the center of clusters. Nevertheless, the accurate reconstruction of the nearest neighbors is rather the secondary requirement in interactive visualization of big data. The exact lists (or their approximations) are well known prior to data embedding. They can be presented anytime visually, e.g., as the datapoints’ spines in . Moreover, the exact lists of the nearest neighbors are not reliable from the context of both measurement errors and the ‘curse of dimensionality’ principle. Besides, if more accurate reconstruction of data locality would be required, the ivhd can be used for generation of initial configuration for much slower SNE based DE algorithms. Unlike SNE competitors, ivhd has only three free parameters which should be fit to data: , and . Because the method is fast, they can be easily matched interactively, although ‘universal’ set, i.e, , and (or ), worked very well for most of data (and networks) visualized by the Authors of this paper. Summing up, we highly recommend ivhd as a good candidate for efficient data embedding engine in interactive visualization of truly big data. The recent Win32 version of our software is available here: https://goo.gl/6HzEd3.
References
 [1] Amid, E., and Warmuth, M. K. Transformation invariant and outlier revealing dimensionality reduction using triplet embedding.
 [2] Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. When is “nearest neighbor” meaningful? In International conference on database theory (1999), Springer, pp. 217–235.
 [3] Borcea, C., and Streinu, I. The number of embeddings of minimally rigid graphs. Discrete & Computational Geometry 31, 2 (2004), 287–303.
 [4] Cohen, R. F., Eades, P., Lin, T., and Ruskey, F. Threedimensional graph drawing. Algorithmica 17, 2 (1997), 199–208.
 [5] Czech, W., Mielczarek, W., and Dzwinel, W. Distributed computing of distancebased graph invariants for analysis and visualization of complex networks. Concurrency and Computation: Practice and Experience 29, 9 (2017).
 [6] Dzwinel, W. Virtual particles and search for global minimum. Future Generation Computer Systems 12, 5 (1997), 371–389.
 [7] Dzwinel, W., and Wcisło, R. Very fast interactive visualization of large sets of highdimensional data. Procedia Computer Science 51 (2015), 572–581.

[8]
Dzwinel, W., and Wcisło, R.
ivhd: A robust lineartime and memory efficient method for visual
exploratory data analysis.
In
International Conference on Machine Learning and Data Mining in Pattern Recognition
(2017), Springer, pp. 345–360.  [9] Dzwinel, W., Wcisło, R., and Czech, W. ivga: A fast forcedirected method for interactive visualization of complex networks. Journal of Computational Science 21 (2017), 448–459.
 [10] Dzwinel, W., Wcisło, R., and Strzoda, M. ivga: Visualization of the network of historical events. In International Conference on Internet of Things and Machine Learning (2017).
 [11] Floyd, R. W. Algorithm 97: shortest path. Communications of the ACM 5, 6 (1962), 345.
 [12] France, S. L., and Carroll, J. D. Twoway multidimensional scaling: A review. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 41, 5 (2011), 644–661.
 [13] Fruchterman, T. M., and Reingold, E. M. Graph drawing by forcedirected placement. Software: Practice and experience 21, 11 (1991), 1129–1164.
 [14] Hinton, G. E., and Roweis, S. T. Stochastic neighbor embedding. In Advances in neural information processing systems (2003), pp. 857–864.
 [15] Ingram, S., and Munzner, T. Dimensionality reduction for documents with nearest neighbor queries. Neurocomputing 150 (2015), 557–569.
 [16] Ingram, S., Munzner, T., and Olano, M. Glimmer: Multilevel mds on the gpu. IEEE Transactions on Visualization and Computer Graphics 15, 2 (2009), 249–261.
 [17] Keim, D. A., Mansmann, F., and Thomas, J. Visual analytics: how much visualization and how much analytics? ACM SIGKDD Explorations Newsletter 11, 2 (2010), 5–8.
 [18] Kłusek, A., and Dzwinel, W. Multigpu knearest neighbor search in the context of data embedding.
 [19] Le, Q., and Mikolov, T. Distributed representations of sentences and documents. In International Conference on Machine Learning (2014), pp. 1188–1196.
 [20] Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S., and Kluger, Y. Efficient algorithms for tdistributed stochastic neighborhood embedding. arXiv preprint arXiv:1712.09005 (2017).
 [21] Maaten, L. v. d., and Hinton, G. Visualizing data using tsne. Journal of machine learning research 9, Nov (2008), 2579–2605.
 [22] Paratte, J., Perraudin, N., and Vandergheynst, P. Compressive embedding and visualization using graphs. arXiv preprint arXiv:1702.05815 (2017).
 [23] Pawliczek, P., and Dzwinel, W. Interactive data mining by using multidimensional scaling. Procedia Computer Science 18 (2013), 40–49.
 [24] Pawliczek, P., Dzwinel, W., and Yuen, D. A. Visual exploration of data by using multidimensional scaling on multicore cpu, gpu, and mpi cluster. Concurrency and Computation: Practice and Experience 26, 3 (2014), 662–682.

[25]
Pawliczek, P., Dzwinel, W., and Yuen, D. A.
Visual exploration of data with multithread mic computer
architectures.
In
International Conference on Artificial Intelligence and Soft Computing
(2015), Springer, pp. 25–35.  [26] Pezzotti, N., Höllt, T., Lelieveldt, B., Eisemann, E., and Vilanova, A. Hierarchical stochastic neighbor embedding. In Computer Graphics Forum (2016), vol. 35, Wiley Online Library, pp. 21–30.
 [27] Shaw, B., and Jebara, T. Structure preserving embedding. In Proceedings of the 26th Annual International Conference on Machine Learning (2009), ACM, pp. 937–944.
 [28] Tang, J., Liu, J., Zhang, M., and Mei, Q. Visualizing largescale and highdimensional data. In Proceedings of the 25th International Conference on World Wide Web (2016), International World Wide Web Conferences Steering Committee, pp. 287–297.
 [29] Tenenbaum, J. B., De Silva, V., and Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. science 290, 5500 (2000), 2319–2323.
 [30] Thorpe, M. F., and Duxbury, P. M. Rigidity theory and applications. Springer Science & Business Media, 1999.
 [31] Van Der Maaten, L. Accelerating tsne using treebased algorithms. Journal of machine learning research 15, 1 (2014), 3221–3245.
 [32] Van Der Maaten, L., Postma, E., and Van den Herik, J. Dimensionality reduction: a comparative. J Mach Learn Res 10 (2009), 66–71.
 [33] Wilkinson, L., Anand, A., and Grossman, R. Graphtheoretic scagnostics.

[34]
Yang, L.
Data embedding techniques and applications.
ACM Proceedings of the 2nd international workshop on Computer vision meets databases
(2005), 29–33.  [35] Yang, Z., Peltonen, J., and Kaski, S. Optimization equivalence of divergences improves neighbor embedding. In International Conference on Machine Learning (2014), pp. 460–468.
Comments
There are no comments yet.