# Compressive Embedding and Visualization using Graphs

Visualizing high-dimensional data has been a focus in data analysis communities for decades, which has led to the design of many algorithms, some of which are now considered references (such as t-SNE for example). In our era of overwhelming data volumes, the scalability of such methods have become more and more important. In this work, we present a method which allows to apply any visualization or embedding algorithm on very large datasets by considering only a fraction of the data as input and then extending the information to all data points using a graph encoding its global similarity. We show that in most cases, using only O((N)) samples is sufficient to diffuse the information to all N data points. In addition, we propose quantitative methods to measure the quality of embeddings and demonstrate the validity of our technique on both synthetic and real-world datasets.

## Authors

• 3 publications
• 22 publications
• 37 publications
• ### Extending Scatterplots to Scalar Fields

Embedding high-dimensional data into a 2D canvas is a popular strategy f...
08/20/2016 ∙ by Shenghui Cheng, et al. ∙ 0

• ### Linear tSNE optimization for the Web

The t-distributed Stochastic Neighbor Embedding (tSNE) algorithm has bec...
05/28/2018 ∙ by Nicola Pezzotti, et al. ∙ 0

• ### Analyzing Hypersensitive AI: Instability in Corporate-Scale Machine Learning

Predictive geometric models deliver excellent results for many Machine L...
07/17/2018 ∙ by Michaela Regneri, et al. ∙ 0

• ### Gatherplots: Generalized Scatterplots for Nominal Data

Overplotting of data points is a common problem when visualizing large d...
08/27/2017 ∙ by Deokgun Park, et al. ∙ 0

• ### ShapeVis: High-dimensional Data Visualization at Scale

We present ShapeVis, a scalable visualization technique for point cloud ...
01/15/2020 ∙ by Nupur Kumari, et al. ∙ 0

We propose a new method with Nadaraya-Watson that maps one N-dimensional...
06/15/2020 ∙ by Hana Alghamdi, et al. ∙ 22

• ### Barnes-Hut-SNE

The paper presents an O(N log N)-implementation of t-SNE -- an embedding...
01/15/2013 ∙ by Laurens van der Maaten, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Background

##### Graph nomenclature

Let us define as an undirected weighted graph where is the set of vertices and the set of edges representing connections between nodes in . The vertices of the graph are ordered from to . The matrix , which is symmetric and positive, is called the weighted adjacency matrix of the graph . The weight represents the weight of the edge between vertices and and a value of 0 means that the two vertices are not connected. The degree of a node is defined as the sum of the weights of all its edges

. Finally, a graph signal is defined as a vector of scalar values over the set of vertices

where the -th component of the vector is the value of the signal at vertex .

##### Spectral theory

The combinatorial Laplacian operator can be defined from the weighted adjacency matrix as with being the degree matrix defined as a diagonal matrix with . One alternative and often used Laplacian definition is the normalized Laplacian . Since the weight matrix is symmetric positive semi-definite, so is by construction. By application of the spectral theorem, we know that

can be decomposed into an orthonormal basis of eigenvectors noted

. The ordering of the eigenvectors is given by the eigenvalues noted

sorted in ascending order . In a matrix form we can write this decomposition as with the matrix of eigenvectors and the diagonal matrix containing the eigenvalues in ascending order. Given a graph signal

, its graph Fourier transform is thus defined as

, and the inverse transform . It is called a Fourier transform by analogy to the continuous Laplacian whose spectral components are Fourier modes, and the matrix is sometimes referred to as the graph Fourier matrix (see e.g., [chung1997spectral]). By the same analogy, the set is often seen as the set of graph frequencies [shuman2013vertex].

##### Graph filtering

In traditional signal processing, filtering can be carried out by a pointwise multiplication in Fourier. Thus, since the graph Fourier transform is defined, it is natural to consider a filtering operation on the graph using a multiplication in the graph Fourier domain. To this end, we define a graph filter as a continuous fonction directly in the graph Fourier domain. If we consider the filtering of a signal , whose graph Fourier transform is written , by a filter the operation in the spectral domain is a simple multiplication , with and the filtered signal and its graph Fourier transform respectively. Using the graph Fourier matrix to recover the vertex-based signals we get the explicit matrix formulation for graph filtering:

 \x′=\Ug(Λ)\U∗\x,

where . The graph filtering operator is often used to reformulate the graph filtering equation as a simple vector-matrix operation . Since the filtering equation defined above involves the full set of eigenvectors , it implies the diagonalization of the Laplacian which is costly for large graphs. To circumvent this problem, one can represent the filter as a polynomial approximation, since polynomial filtering only involves the multiplication of the signal by a power of of the same order as the polynomial. Filtering using good polynomial approximations can be done using Chebyshev or Lanczos polynomials [hammond2011wavelets, susnjara2015accelerated].

##### Localization operator

The concept of translation, which is well defined in traditional signal processing cannot be directly applied to graphs, as they can be irregular. However, inspired by the notion of translation, we can define the localization of a function defined on the graph spectrum as a convolution with a Kronecker delta , where is called the localization operator, and means localization at vertex . Going back to the vertex domain, we get :

 \Tig[n]=F−1(g⋅^δi)[n]=N−1∑ℓ=0g(λℓ)u∗ℓ[i]uℓ[n]=(g(\La))in.

The reason for calling a localization operator comes from the fact that for smooth functions , is localized around the vertex . The proof of this result and more information on the localization operator can be found in [shuman2016vertex]. The localization of filters is quite naturally called atoms as a filtering operation of a signal using a filter can be expressed as .

We use for the induced norm of the matrix and for the Froebenius norm. The maximum eigenvalue of a matrix is written . We reserve the number notation for vectors. For example, we write the Euclidean norm as and the uniform (sup) norm . We abusively use the to count the number of non-zero elements in a vector. Furthermore, when an univariate function is applied to a vector , we mean . As a result, is the number of eigenvalues where . Given a kernel , we define as a matrix made of the columns of where . Similarly, we denote the diagonal matrix containing the associated eigenvalues. Note that we have

 g(\L)=\Ug(\bLambda)\U∗=\Ukg(\bLambdak)\U∗k=\Uk\Uk∗g(\L).

## Random sampling on graphs

In this section, we first define a graph sampling schemes and then prove related theoretical limits. In particular, it is of particular interest to understand the number of samples needed in order to diffuse energy on every node by localizing filters on the samples. We will prove that the number of samples needed is direclty linked with the rank of the filter.

Let us define the probability distribution represented by a vector . We use two different sampling schemes. Uniform sampling is given by the probability vector

 \bpi=1N,

and adapted sampling is given by

 \bpi=\norm\Tig22\normg(\bλ)22.

Remember that we have , implying that . Let us associate the matrix

 P:=diag(p)∈\RbbN×N

to Then, we draw independently (with replacement) indices from the set according to the probability distribution . We have

For any signal defined on the vertices of the graph, its sampled version satisfies

 \byj:=\bxωj∀j∈{1,…,M}.

Finally, the downsampling matrix is defined as

for all and Note that .

### Embedding Theorems

The first theorem shows that given enough samples, the random projection conserves the energy contained in . In this sense, given enough samples, it is an embedding of . Given a graph and a kernel with a given rank , given and using the sampling scheme of Section Document, if

 M≥21δ2\normg(\blambda)22\normg(\blambda)2∞(1+δ3)log(2kϵ)

we have with a probability of for all :

 ∣∣ ∣ ∣∣1M\norm\M\lx@paragraphsign−12g(\L)\x22−\normg(\L)\x22\normg(\blambda)2∞∣∣ ∣ ∣∣≤δ∥\U∗k\x∥22≤δ∥\x∥22. ()

Note that the above expression is normalized by in order to remove the scaling factor of the kernel . Let us now analyze the most important term of the bound:

 \normg(\blambda)22\normg(\blambda)2∞=∑ℓg2(\lel)maxℓg2(\lel). ()

It is a measure of concentration of the kernel on its support. It is maximized with the value when is a rectangle. In general, it will be small for concentrated kernels. For example, a rapidly decreasing kernel such as the heat kernel () will lead to a very small ratio. Note that contrarily to almost all bound available in the literature this bound does not require the kernel to be low rank but only concentrated. For a comparison [puy2016random, Corollary 2.3] requires

 M≥3δ2klog(2kϵ).
##### Optimality of the sampling scheme.

Although we have no formal proof of optimality, the sampling scheme presented in Section Document is a good candidate. Indeed, when reading the proof of Theorem Document, the reader may notice that it minimizes the number of samples . Building on top of Theorem Document, we establish a lower bound on the number of samples required by Algorithm Document to capture enough information from each node with a given confidence level. It will ensure that the information diffused from the samples can reach all nodes. Using the sampling scheme described in Section Document, for , a graph and a kernel such that , each node is guaranteed with a probability to have

 1M\norm\M\lx@paragraphsign−12\Tig22\norm\Tig22≥1−δ,

given that the number of samples satisfies

 M≥2aδ2(1+δ3)log(kϵ),

where . Theorem Optimality of the sampling scheme. warrants that given enough samples , Algorithm Document captures with some probability (close to ), at least a good percentage of the energy at node . The factor is always greater than and varies depending on the shape of the kernel and of the graph eigenvectors. However it is and exactly equal to if is a rectangular kernel. Indeed, a simple transformation shows that

 a=\normg(\blambda)22\normg(\blambda)2∞\norm\Uk∗\bδi42\norm\Tig42=∑ℓg2(\lel)maxℓ∣∣g2(λℓ)∣∣(maxℓ∣∣g2(λℓ)∣∣∑ℓ∈K\bu2ℓ[i]∑ℓg2(λℓ)\bu2ℓ[i])2.

The first term is smaller than but is usually close to for a kernel close to a rectangle. The second term is greater than but close to given that the kernel is close to a rectangle. Problematically, this bound becomes loose if the kernel has a large rank because of the term . To cope with this problem we can use another kernel that is a low-rank approximation of . Given a graph , let (with ) to be the rank approximation of the kernel , i.e.,

Using the sampling scheme described in Section Document with the kernel , for , each node is assured with a probability to have

 1M\norm\M\lx@paragraphsign12\Tig22\norm\Tig22≥1−δ−\norm\Ti(|g′|−|g|)22\norm\Tig22

providing the number of samples satisfies111Note that .

 M≥21δ2\normg′(\blambda)22\normg′(\blambda)2∞\norm\Uk∗\bδi42\norm\Tig42(1+δ3)log(kϵ).

Using Theorem Optimality of the sampling scheme., the number of samples required can be highly reduced. Indeed, when the kernel is well concentrated but not low rank, we trade some approximation error encoded by (which will be low if is concentrated) but we will need a smaller number of samples due to the fact that is low rank. This theorem can be interesting for a heat kernel for example.

## Metrics based on localized filters

Before moving on to the information diffusion from the samples, we need to take a closer look to localized filters and in particular see how they can be used to measure distances or correlations between nodes.

### Localized Kernel Distance

Since localized filters are proven to be concentrated in the vertex domain (see [shuman2013vertex, Theorem 1]), it seems natural to use them to get geodesic measures or correlations between nodes. To this end, we introduce the Localized Kernel Distance (LKD), which is defined as :

 \lkd(i,j)=1−\Tig2[j]∥\Tig∥∥\Tjg∥. ()

Let us now examine its properties by stating the following theorem: The space with the vertex set of a graph and as defined in is a pseudosemimetric space, that is, for every :

First, let us derive an alternative form of eq:lkd_definition :

 \lkd(x,y)=1−⟨\Txg,\Tyg⟩∥\Txg∥∥\Tyg∥ ()

This can be derived as follows :

 \lkd(x,y) = 1−\Txg2[y]∥\Txg∥∥\Tyg∥ = 1−∑ℓg(\bλℓ)2\bu∗ℓ[x]\buℓ[y]∥\Txg∥∥\Tyg∥ = 1−∑ℓ(g(\bλℓ)\bu∗ℓ[x])(g(\bλℓ)\buℓ[y])∥\Txg∥∥\Tyg∥ = 1−∑ℓ(g(\bλℓ)\bu∗ℓ[x])(g(\bλℓ)\bu∗ℓ[y])∑n\buℓ[n]2∥\Txg∥∥\Tyg∥ = 1−∑n∑ℓ(g(\bλℓ)\bu∗ℓ[x]\buℓ[n])(g(\bλℓ)\bu∗ℓ[y]\buℓ[n])∥\Txg∥∥\Tyg∥ = 1−⟨\Txg,\Tyg⟩∥\Txg∥∥\Tyg∥

Now let us verify the properties one by one :

1. We have using lkd_definition_2 :

 \lkd(x,y) = 1−⟨\Txg,\Tyg⟩∥\Txg∥∥\Tyg∥ ≥ 0

where the last inequality stands because (Cauchy-Schwartz inequality).

2. Let us verify that :

 \lkd(x,y) = \lkd(x,x) = 1−\Txg2[x]∥\Txg∥∥\Txg∥ = 1−∑ℓg(\bλℓ)2\bu∗ℓ[x]\buℓ[x]∥\Txg∥2 = 1−∑ℓ(g(\bλℓ)\buℓ[x])2∑n\buℓ[n]2∥\Txg∥2 = 1−∑n∑ℓ(g(\bλℓ)\buℓ[x]\buℓ[n])2∥\Txg∥2 = 1−∥\Txg∥2∥\Txg∥2 = 0
3. Finally, we have

 \lkd(x,y) = 1−\Txg2[y]∥\Txg∥∥\Tyg∥ = 1−∑ℓg(\bλℓ)2\bu∗ℓ[x]\buℓ[y]∥\Txg∥∥\Tyg∥ = 1−∑ℓg(\bλℓ)2\bu∗ℓ[y]\buℓ[x]∥\Txg∥∥\Tyg∥ = 1−\Tyg2[x]∥\Txg∥∥\Tyg∥ = \lkd(y,x)

The space with the vertex set of a graph and as defined in , with constant, is a semimetric space, that is, for every :

Properties 1 and 3, as well as the backward implication are still valid as stated in Theorem . Now let us check that . We want to do it by contradiction and thus search any , for which , implying :

 ⟨\Txg,\Tyg⟩=∥\Txg∥∥\Tyg∥ ()

We can rewrite this equality as :

 ∑ℓg(\bλℓ)2\bu∗ℓ[x]\buℓ[y]=√∑ℓg(\bλℓ)2\bu2ℓ[x]√∑ℓg(\bλℓ)2\bu2ℓ[y] ()

For , with a constant, the left hand side is :

 ∑ℓg(\bλℓ)2\bu∗ℓ[x]\buℓ[y]=c2∑ℓ\bu∗ℓ[x]\buℓ[y]=0 ()

The last equality comes from the fact that two lines of an orthonormal matrix are orthogonal, and . Now the right-hand side is :

 √∑ℓg(\bλℓ)2\bu2ℓ[x]√∑ℓg(\bλℓ)2\bu2ℓ[y]=c2∑ℓ\bu2ℓ[x]∑ℓ\bu2ℓ[y]=c2 ()

with the last equality coming from the fact that is an orthonormal basis. Now, since we have a contradiction, and thus the proof is completed.

### Kernelized Diffusion Distance

Another approach to use localized atoms to define distances is to measure the norm of the difference between a filter localized at two different nodes. We call it the Kernelized Diffusion Distance and define it as:

 \kdd(i,j)=∥\Tig−\Tjg∥, ()

where is a kernel defined in the graph spectral domain. Before going further, and as it will be useful later, let us derive a corollary definition of :

 \kdd(i,j)=√∑ℓg(\bλℓ)2(\bu∗ℓ[i]−\bu∗ℓ[j])2. ()

This alternative definition can be quickly derived as follows :

 \kdd(i,j)2 = ∥\Tig−\Tjg∥ = ∑n(∑ℓg(\bλℓ)\bu∗ℓ[i]\buℓ[n]−∑ℓg(\bλℓ)\bu∗ℓ[j])\buℓ[n])2 = ∑n(∑ℓg(\bλℓ)(\bu∗ℓ[i]−\bu∗ℓ[j])\buℓ[n])2 = ∑n∑ℓg(\bλℓ)2(\bu∗ℓ[i]−\bu∗ℓ[j])2\bu2ℓ[n] = ∑ℓg(\bλℓ)2(\bu∗ℓ[i]−\bu∗ℓ[j])2∑n\bu2ℓ[n] = ∑ℓg(\bλℓ)2(\bu∗ℓ[i]−\bu∗ℓ[j])2

which implies by taking the square root on both sides. Let us now examine the properties of the KDD by stating the following theorem: The space with the vertex set of a graph and as defined in is a pseudometric space, that is, for every :

Let us verify the properties in order :

1. This property holds trivially due to the positivity of the norm .

2. We have

 \kdd(x,y) = ∥\Txg−\Tyg∥ = √∑ℓg(\bλℓ)2(\bu∗ℓ[x]−\bu∗ℓ[y])2 = √∑ℓg(\bλℓ)2(\bu∗ℓ[y]−\bu∗ℓ[x])2 = ∥\Tyg−\Txg∥ = \kdd(y,x)
3. We have

 \kdd(x,z) = ∥\Txg−\Tzg∥ = ∥\Txg−\Tyg+\Tyg−\Tzg∥ ≤ ∥\Txg−\Tyg∥+∥\Tyg−\Tzg∥ = \kdd(x,y)+\kdd(y,z)

which holds using the triangle inequality for vectors.

Now that we proved that the KDD is a pseudo-metric, we only need to have the identity of the indiscernibles, i.e. to prove it is a metric. However, we can only do it using an additional hypothesis on . This is formulated in the following theorem : The space with the vertex set of a graph and as defined in , with being full rank, is a metric space, that is, for every :

Properties 1-3 are still valid as stated in Theorem the-at-equationgroup-at-IDd. Now let us check Property 4.

• We first prove :

 dg(x,y) = dg(x,x) = ∥\Txg−\Txg∥ = √∑ℓg(\bλℓ)2(\bu∗ℓ[x]−\bu∗ℓ[x])2 = 0
• Now let us check that . We do it by contradiction and thus want to find any pair , for which . In particular we need that :

 \kdd(x,y)=√∑ℓg(\bλℓ)2(\bu∗ℓ[x]−\bu∗ℓ[y])2=0 ()

with . Since is full rank then , and thus the only way for eq:kdd_contradiction to hold is if , . In other words it would imply that the lines and of are identical. Since is a basis, it implies that all its lines are orthonormal, which means there exist no pair such as eq:kdd_contradiction hold, and thus the contradiction is established, which concludes the proof.

##### Diffusion distance

As was hinted in the name, the distance defined in eq:kdd_definition happens to be a generalized diffusion distance. Indeed, taking its spectral formulation we have :

 dg(i,j)=√∑ℓg(\bλℓ)2(\bu∗ℓ[i]−\bu∗ℓ[j])2=Dt(i,j), ()

where is the diffusion distance associated to specific kernels depending on (i.e. the diffusion parameter). If we take two common definitions of the diffusion distance, the original works of [nadler2005diffusion] and [coifman2006diffusion] use a kernel of the form and the Graph Diffusion Distance defined in [hammond2013graph] uses the heat kernel .

## Graph transductive learning

In this section we want to cast the problem of diffusing the information obtained on a few samples of the data (e.g. using sampling schemes such as defined in Section Document) in a transductive inference framework. In this setting, we are observing a label field or signal only at a subset of vertices , i.e , , with being the observed signal also called the label function. The goal of transductive learning is to predict the missing signal/labels using both the observed signal and the remaining data points.

### Global graph diffusion

Solutions of transductive inference using graphs can be solved in a number of ways, for example using Tikhonov regression :

 \argmin\bx∥\by−\M\bx∥22+μ\bxt\L\bx, ()

where is the sampling operator and the graph Laplacian. An alternative to the use of the Dirichlet smoothness constraint is to use graph Total Variation (TV). The regression would thus become :

 \argmin\bx∥\by−\M\bx∥22+μ∥∇\G\bx∥1 ()

with , . For large scale learning, solving the optimization problems as described above can be too expensive and one typically uses accelerated descent methods.

### RKHS transductive learning on graphs

#### Motivation

Our first contribution is to replace the smoothness term arising in by constraining the solution to belong to the finite dimensional Reproducing Kernel Hilbert Space (RKHS) corresponding to the graph kernel , for some filter . In this case, we instead solve the following problem :

 argmin\bx∈\rkhs∥\by−\M\bx∥22

and show that the solution is given by a simple low-pass filtering step applied to the labelled examples.

#### Transductive learning and graph filters

In this section, we formulate transductive learning as a finite dimensional regression problem. This problem is solved by constructing a reproducing kernel Hilbert space from a graph filter, which controls the smoothness of the solution and provides a fast algorithm to compute it.

##### An empirical reproducing kernel Hilbert space

Let be a smooth, strictly positive function defining a graph filter as defined in Section Document. The graph filter defines the following matrix :

 G[i,j]=g(\L)[i,j]=\Tig[j],

where is the localisation operator at vertex . Since the filter is strictly positive definite,

is positive definite and can be written as the Gram matrix of a set of linearly independent vectors. To see this, we use the spectral representation :

 G = \Ug(Λ)\U∗ = \Ug(Λ)1/2(\Ug(Λ)1/2)∗.

Let be the -th row of , we immediately see that . More explicitly, these vectors are written in terms of the graph filter :

 \bri[j]=∑ℓ√g(\bλℓ)\buℓ[i]\buℓ[j].

These expressions suggest to define the Hilbert space as the closure of all linear combinations of localized graph filters . This space is therefore composed of functions of the form :

 \bx=∑k∈\Vαk\Tkg. ()

Note that any has a well-defined graph Fourier transform :

 ^\bx(ℓ)=g(\bλℓ)∑k∈\Vαk\buℓ[k].

This allows to equip with following scalar product :

 ⟨\bx,\by⟩\rkhs=∑ℓ1g(\bλℓ)^\bx(ℓ)∗^\by(ℓ)

and the vectors form an orthonormal basis of :

 ⟨\bri,\brj⟩\rkhs = ∑ℓ1g(\bλℓ)√g(\bλℓ)\buℓ[i]∗√g(\bλℓ)\buℓ[j] = ∑ℓ\buℓ[i]∗\buℓ[j] = δi,j.

Let us now see that is a reproducing kernel Hilbert space (rkhs). We show that the scalar product with in is the evaluation functional at vertex . We first compute :

 ⟨\Tig,\Tjg⟩\rkhs = ∑ℓ1g(λℓ)g(λℓ)2uℓ[i]∗uℓ[j] = \Tig[j].

By linearity of the scalar product and the definition of eq:rkhs we have :

 ⟨\Tig,\bx⟩\rkhs = ∑k∈\Vαk⟨\Tig,\Tkg⟩\rkhs = ∑k∈\Vαk\Tkg[i] = \@text@baccentx[i].

Finally, for any , , we have the following explicit form of their norm :

 ∥\bx∥2\rkhs = ⟨\bx,\bx⟩\rkhs = ∑ℓ1g(\bλℓ)g(\bλℓ)2∑i,j∈\Vβiβ∗j\buℓ[i]\buℓ[j]∗ = ∑i,j∈\Vβiβ∗j(∑ℓg(\bλℓ)\buℓ[i]\buℓ[j]∗) = ∑i,j∈\VβiG[i,j]β∗j = βTGβ.
##### Transductive learning

Now that we have established as a valid RKHS, we will seek to recover the full signal by solving the following problem :

 ~\bx=argmin