On the Estimation of Latent Distances Using Graph Distances

04/27/2018 ∙ by Ery Arias-Castro, et al. ∙ 0

We are given the adjacency matrix of a geometric graph and the task of recovering the latent positions. We study one of the most popular approaches which consists in using the graph distances and derive error bounds under various assumptions on the link function. In the simplest case where the link function is an indicator function, the bound is (nearly) optimal as it (nearly) matches an information lower bound.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Suppose that we observe a undirected graph with adjacency matrix (where and ) with and . We assume the existence of points, , such that

(1)

for some non-increasing link function . The are assumed to be independent. We place ourselves in a setting where the adjacency matrix is observed, but the underlying points are unknown. We will be mostly interested in settings where is unknown (and no parametric form is known).

Our most immediate interest is in the pairwise distances

(2)

Since we assume the link function is unknown, all we can hope for is to rank these distances. Indeed, the most information we can aspire to extract from

is the probability matrix

, where

(3)

and even with perfect knowledge of , the distances can only be known up to a monotone transformation, since and is an arbitrary monotone (here non-increasing) function.

Once a possibly incomplete, and likely only approximate, ranking of the distances has been produced, the problem of recovering the latent positions amounts to a problem of ordinal embedding (aka, non-metric multidimensional scaling), which has a long history [39] dating back to pioneering work by Shepard [29, 30] and Kruskal [15]. Because of this, we focus on the pairwise distances rather than the latent positions themselves.

1.1 Related work

This is an example of a latent graph model and the points are often called latent positions. In its full generality, the model includes the planted partition model popular in the area of graph partitioning. To see this, take and let denote the number of blocks and, with denoting the

-th canonical basis vector, set

if belongs to block . The planted partition model is a special case of the stochastic blockmodel of Holland et al. [13]. This is also a special case of our model, as can be seen by changing to , chosen so that , the connection probability between blocks and . Mixed-membership stochastic blockmodels as in [2, 38, 1] are also special cases of latent graph models, but of a slightly different kind. The literature on the stochastic blockmodel is now substantial and includes results on the recovery of the underlying communities; see, e.g., [31, 17, 24, 21, 6, 11] and references therein.

Our contribution here is of a different nature as we focus on the situation where the latent positions are well spread out in space, forming no obvious clusters. This relates more closely to the work of Hoff et al. [12]

. Although their setting is more general in that additional information may be available at each position, their approach in our setting reduces to the following logistic regression model:

(4)

which is clearly a special case of (1), with . Sarkar et al. [25]

consider this same model motivated by a link prediction problem where the nodes are assumed to be embedded in space with their Euclidean distances being the dissimilarity of interest. In fact, they assume that the points are uniformly distributed in some region. They study a method based on the number of neighbors that a pair of nodes have in common, which is one of the main methods for link prediction

[19, 18]. Parthasarathy et al. [23] consider a more general setting where a noisy neighborhood graph is observed: if are points in a metric space with pairwise distances , then an adjacency matrix, , is observed, where with probability if and with probability if . Under fairly general conditions on the metric space and the sampling distributions, and additional conditions on , they show that the graph distances computed based on provide, with high probability, a 2-approximation to the underlying distances in the case where . In the case where , the same is true, under some conditions on , if is replaced by where exactly when , where is a (carefully chosen) parameter of the method, where (number of neighbors) and (number of common neighbors).

Scheinerman and Tucker [26] and Young and Scheinerman [40] consider what they call a dot-product random graph model where , where it is implicitly assumed that for all . This model is a special case of (1), with . Sussman et al. [32] consider recovering the latent positions in this model (with full knowledge of the link function). They devise a spectral method which consists in embedding the items as points in (where is assumed known) as the row vectors of , where is the SVD of , and for a matrix and an integer , . They analyze their method in a context where the latent positions are in fact a sample from a (possibly unknown) distribution. The same authors extended their work in [33] to an arbitrary link function, which may be unknown, although the focus is on a binary classification task in a setting where for each a binary label is available.

von Luxburg and Alamgir [36] consider the closely related problem of recovering the latent positions in a setting where a nearest-neighbor graph is available. They proposed a method based on estimating the underlying density (denoted ). If denotes the density estimate at , a graph is defined on with weights , and is estimated by the graph distance between nodes and . Terada and von Luxburg [35] derive some theory for this method based on results conjectured in [36].

1.2 Our contribution

Graph distances are well-known estimates for the Euclidean distances in the context of graph drawing [16, 28], where the goal is to embed items in space based on an incomplete distance matrix. They also appear in the literature on link prediction [19, 18] and are part of the method proposed in [36]. We examine the use of graph distances for the estimation of the Euclidean distances (2). As we shall see, the graph distances are directly useful when the link function is compactly supported, which is for example the case in the context of a neighborhood graph where for some connectivity radius . In fact, the method is shown to achieve a minimax lower bound in this setting (under a convexity assumption). This setting is discussed in Section 2. In Section 3, we extend the analysis to other (compactly supported) link functions. In Section 4

, we briefly discuss a regularization known as Maximum Variance Unfolding (MVU), due to

Weinberger et al. [37]. In Section 5, we discuss the problem of embedding a nearest-neighbor graph in a setting where the positions represent a sample from a uniform distribution. In this context, the graph distance method can be seen as a simple variant of the method proposed by von Luxburg and Alamgir [36] and subsequently analyzed by Terada and von Luxburg [35]. We show that, in dimension , the method is biased due to a boundary effect that persists even as the number of positions increases without bound. We end with Section 6, where we discuss some important limitations of the method based on graph distances and consider some extensions, including localization (to avoid the convexity assumption) and the use of the number of common neighbors (to accommodate non-compact link functions). Proofs are gathered in Section 7.

1.3 Preliminaries

Given the adjacency matrix , the graph distance between nodes and is defined as

(5)

where by convention. (Here and elsewhere, we will sometimes use the notation for , for , etc.) We propose estimating, up to a scale factor, the Euclidean distances (2) with the graph distances (5). Note that since is unknown the scale factor cannot be recovered from the data.

The method is the analog of the MDS-D method of Kruskal and Seery [16] for graph drawing, which is a setting where some of the distances (2) are known and the goal is to recover the missing distances. Let denote the set of pairs for which is known. MDS-D estimates the missing distances with the distances in the graph with node set and edge set , and with edge weighed by . This method was later rediscovered by Shang et al. [28], who named it MDS-MAP(P), and coincides with the IsoMap procedure of Tenenbaum et al. [34] for isometric manifold embedding. (For more on the parallel between graph drawing and manifold embedding, see the work of Chen and Buja [5].)

As we shall see, the graph distance method is most relevant when the positions are sufficiently dense in their convex hull, which is a limitation it shares with MDS-D. For and , define

(6)

which measures how dense the latent points are in . We also let denote (6) when is the convex hull of .

2 Simple setting

In this section we focus on the simple, yet emblematic case of a neighborhood (ball) graph, that is, a setting where the link function is given by for some . We start with a performance bound for the graph distance method and then establish minimax lower bound. Similar results are available in [3, 23, 4], among other places, and we only provide a proof for completeness, and also to pave the way to the more sophisticated Theorem 3.

Theorem 1.

Assume that for some , and define . Then

(7)

for any set of points that satisfy with .

For a numerical example, see Figure 1. In Figure 2 we confirm numerically that the method is biased when the underlying domain from which the positions are sampled is not convex. That said, the method is robust to mild violations of the convex constraint, as shown in Figure 3, where the positions correspond to US cities.555 These were sampled at random from the dataset available at simplemaps.com/data/us-cities

Remark 1.

Computations were done in R, with the graph distances computed using the igraph package, to which classical scaling was applied, followed by a procrustes alignment and scaling using the package vegan.

(a) Latent positions
(b) Recovered positions with
(c) Recovered positions with
(d) Recovered positions with
Figure 1: A numerical example illustrating the setting of Theorem 1. Here positions were sampled uniformly at random from , from , and from , for a total of positions.
(a) Latent positions
(b) Recovered positions with
Figure 2: A numerical example illustrating the setting of Theorem 1 showing that the convexity constraint is indeed required for the graph distance method to be unbiased. Here positions were sampled uniformly at random from .
(a) Latent positions
(b) Recovered positions with
(c) Recovered positions with
(d) Recovered positions with
Figure 3: A numerical example illustrating the setting of Theorem 1. The latent positions are located at the coordinates of US cities and the connectivity radius varies (in degrees).

Note that is not a true estimator in general as it relies on knowledge of , which is a feature of the unknown link function. Nevertheless, the result says that, up to that scale parameter, the graph distances achieve a nontrivial level of accuracy.

It turns out that the graph distance method comes close to achieving the best possible performance (understood in a minimax sense) in this particularly simple setting. Indeed, we are able to establish the following general lower bound that applies to any method.

Theorem 2.

Assume that with (without loss of generality). Then there is a numeric constant with the property that, for any and any estimator666 An estimator here is a function on the set of -by- symmetric binary matrices with values in . , there is (so that here) such that and, for at least half of the pairs ,

(8)

Note that is known in Theorem 2 and that we are talking about bonafide estimators.

3 General setting

Beyond the setting of a neighborhood graph considered in Section 2, the graph distance method, in fact, performs similarly in more generality. We consider in this section the case of a link function that is compactly supported and establish a performance bound that is comparable to that of Theorem 1.

Theorem 3.

Assume that has support , for some , and define . Assume that, for some and , for all . Then there are depending only on such that, whenever , with probability at least ,

(9)

for any points that satisfy and .

See Figure 4 for a numerical example. In the bound (9), the behavior of depends mainly on how behaves near the edge of its support, namely as . If , then is discontinuous at , and we recover a bound similar to that of Theorem 1, proved there for the case where . The bound obtained in this case appears near-optimal in view of Theorem 2. For , we do not know whether the bound (9) is similarly near-optimal. In general, we expect the graph distance procedure to be less accurate as increases, and in particular, even though the result does not cover the case of a function converging quicker to , such as , we speculate that the graph distance does not perform well in this case.

(a) Latent positions
(b) Recovered positions with
(c) Recovered positions with
(d) Recovered positions with
Figure 4: Same setting as in Figure 3. Here we set and vary . (In fact, to ease the comparison, we coupled the different adjacency matrices in the sense that the -matrix was built by erasing edges from the -matrix independently with probability .

4 Regularization by maximum variance unfolding

With the graph distances (5) computed, it might be tempting to regularize them by an approximation with Euclidean distances. We say that a metric on is Euclidean if there are points in some Euclidean space, , such that . Let denote the set of such metrics on and consider the following optimization problem

(10)

Let denote a solution and let be points in a Euclidean space such that .

Other measures of discrepancy between and are of course possible. The reason we consider this particular form is that it corresponds to Maximum Variance Unfolding (MVU), a well-known method for graph drawing proposed by Weinberger, Sha, Zhu, and Saul [37], although it is perhaps best known as a method for manifold embedding. Given a weighed graph on with weight matrix , MVU consists in solving the following optimization problem:

(11)
(12)

Paprotny and Garcke [22] have shown that the two formulations are equivalent when denotes the corresponding graph distances.

Proposition 1.

We have for all .

Proof.

We only need to consider a pair of nodes such that , for otherwise the bound holds by convention. Let denote a shortest path in the graph connecting and , so that . We then derive

(13)

using the triangle inequality and then the fact that for all , since satisfies (12) and for all . ∎

We also have the following, which results from a straightforward adaptation of [22, Th 3].

Proposition 2.

Assume there is and such that satisfies for all . Letting , we have that there is a universal constant such that

(14)

This proposition, together with Theorem 3, allows one to bound the performance of the MVU regularization of the graph distances.

5 Embedding a nearest-neighbor graph

In this section we turn our attention to a different, yet closely related setting, that of embedding a nearest-neighbor graph. We observe the -nearest-neighbor graph given by the adjacency matrix , where if is among the -nearest-neighbors of , and otherwise. Note that is not symmetric in general. The goal remains the same, that is, to recover the Euclidean distances (2).

For a set , define its radius, , as half its diameter; its inradius, , as the radius of the largest ball included in ; and for , , which is possibly empty (and called an erosion in mathematical morphology). If is measurable, then denotes its Lebesgue measure. Let denote the volume of the unit ball in .

Theorem 4.

Let and assume that were generated iid from the uniform distribution on a compact, convex set with non-empty interior, and define . There are depending only on such that the following holds. Define , where , where and . If and , with probability at least ,

(15)

as well as

(16)

In fact, when , for all .

Remark 2.

A similar, yet slightly different result may be obtained for the setting where . We leave the details up to the reader’s curiosity.

Thus , as defined in Theorem 4 is an appropriate scaling for the graph distances for the pairs of items satisfying (16). However there are pairs for which this scaling is not accurate, and the two statement combined show that there is no scaling that works in the sense of making the graph distances close to the Euclidean distances. That boundary effect does not go away in the large-sample limit. See Figure 5 for a numerical example. What happens is that the boundary serves as a ‘freeway’, because for a point near the boundary, its nearest neighbors are farther away. See Figure 6 for an illustration.

Rather than argue the existence of pairs for which is not an accurate scaling for a general (satisfying the conditions of Theorem 4), which we claim is true as stated, we focus on an example. Arguably, this is the most extreme situation, but it is not hard to see that it generalizes to other set by zooming in on the boundary. The example is in dimension , but only for concreteness, as the same phenomenon is easily seen to generalize to any dimension .

(a) Latent positions
(b) Recovered positions with
Figure 5: A numerical example illustrating the setting of Proposition 3. Here (so that ) and .
Figure 6: This continues with the setting of Figure 5. Each colored polygonal line is the shortest path in the -nearest-neighbor graph between its endpoints. While the distance estimate for the pair of positions that are nearby (joined by the green path) is accurate, the distance estimate the pair of positions that are farther apart (joined by the red path) is not. This is congruent with our statements in Theorem 4 and Proposition 3.
Proposition 3.

Let , with , and consider the statement of Theorem 4. When is sufficiently large, there is depending only on such that, with probability at least ,

(17)
Remark 3.

Above we considered the approach put forth by von Luxburg and Alamgir [36] naturally adapted to the setting where the underlying distribution is a uniform distribution. The actual method proposed by von Luxburg and Alamgir [36], however, may actually be consistent for estimating the distances (again, up to scaling). Indeed, it is quite possible that the density estimate along the boundary is biased in just the proper way.

Remark 4.

We mention that the nearest-neighbor setting is also considered in [8], where a synchronization approach is proposed. More generally, we are looking at an ordinal embedding problem, since triplet distance comparisons can be obtained as follows

(18)

However, none of the methods we know of come with theoretical guarantees — although [14] comes close.

6 Discussion

The method based on graph distances studied in the previous section suffers from a number of serious limitations:

  1. The positions need to span a convex set. (In practice, the method is robust to mild violations of this constraint as exemplified in Figure 3.)

  2. Even in the most favorable setting of Section 2, the relative error is still of order , as established (9), and this does not seem improvable. (To see this, contrast the situations where is just below with the case where it is just above .)

  3. The link function needs to be compactly supported. Indeed, the method can be grossly inaccurate in the presence of long edges, as in the interesting case where the link function is of the form

    (19)

    where .

We address each of these three issues in what follows.

6.1 Localization

A possible approach to addressing Issue 1 is to operate locally. This is well-understood and is what lead Shang and Ruml [27] to suggest MDS-MAP(P), which effectively localizes MDS-MAP [28]. (As we discussed earlier, the latter is essentially a graph-distance method and thus bound by the convexity constraint.) More recent methods for graph drawing based on ‘synchronization’ also operate locally [9, 7].

Experimentally, this strategy works well. See Figure 7 for a numerical example, which takes place in the context of the rectangle with a hole of Figure 2. We adopted a simple approach: we kept the graph distances that were below a threshold, leaving the other ones unspecified, and then applied a method for multidimensional scaling with missing values, specifically SMACOF [10] (initialized with the output of the graph distance method).

(a) Latent positions
(b) Recovered positions with and
Figure 7: Same setting as in Figure 2.

6.2 Regularization by multidimensional scaling

Regarding Issue 2, in numerical experiments we have found that the graph distances, although grossly inaccurate, are nevertheless useful for embedding the points using (classical) multidimensional scaling. This remains surprising to us and we do not have a good understanding of the situation. (Note if one is truly interested in estimating the Euclidean distances, one may use graph distances, apply multidimensional scaling, and then compute the distances between the embedded points.) For a numerical illustration, see Figure 8.

(a) Latent positions
(b) Recovered positions with
Figure 8: Here positions were sampled uniformly at random from . For all , , so that the graph distances are rather discrete, yet the embedding computed by classical multidimensional scaling is surprisingly accurate.

6.3 Number of common neighbors

A possible approach to addressing Issue 3 (and also Issue 2) is to work with the number of common neighbors, which provides an avenue to ‘super-resolution’ in a way. By this we mean that, say in the simple setting of Section 

2, although the adjacency matrix only tells whether two positions are within distance , it is possible to gather all this information to refine this assessment. Similarly, in the setting where (19) is the link function, it is possible to tell whether two positions are nearby or not. This sort of concentration is well-known to the expert and seems to be at the foundation of spectral methods (see, e.g., [32, Prop 4.2]). We refer the reader to [25, 23], where such an approach is considered in great detail.

7 Proofs

7.1 Proof of Theorem 1

Fix distinct.

Let and note that . For , let . We have and , and are on the line joining and and satisfy for all . Let be such that , with and . Note that is well-defined since belongs to the convex hull of and we have assumed that . By the triangle inequality, for all ,

(20)

Hence, forms a path in the graph, and as a consequence, . In turn, this implies that

(21)

using the fact that .

Resetting the notation, let denote a shortest path joining and , so that . By the triangle inequality,

(22)

using the fact that for all .

7.2 Proof of Theorem 2

We construct two point configurations that yield the same adjacency matrix and then measure the largest difference between the corresponding sets of pairwise distances.

Assume that is an integer for convenience.

  • Configuration 1. In this configuration,

    (23)

    Note that .

  • Configuration 2. In this configuration,

    (24)

    for some chosen small later on. When , which we assume, is increasing with and . Note that

The two configurations coincide when , but we will choose in what follows. Under Configuration 1, the adjacency matrix is given by . For the design matrix to be the same under Configuration 2, it suffices that have (exactly) neighbors (to the right) and that have (exactly) neighbors (to the left); this is because is decreasing in this configuration. These two conditions correspond to four equations, given by

(25)

We need only consider the first and fourth as they imply the other two. After some simplifications, we see that the first one holds when , while the fourth holds when and . Since , when , and we choose for example. Then in Configuration 2 (same as in Configuration 1). We choose just large enough that in both configurations. In particular, as . Since the result only needs to be proved for small, we may take as large as we need.

Now that the two designs have the same adjacency matrix, we cannot distinguish them with the available information. It therefore suffices to look at the difference between the pairwise distances. Indeed, if denotes the distance between and in Configuration , then for any estimator ,

(26)

For , we have

(27)

so that

(28)

Let

(29)

and note that include at least half of all pairs in . For , so that , and given the condition on and our choice for ,

(30)

for some universal constants and . We then conclude with the fact that for some universal constant , due to the fact that .

7.3 Proof of Theorem 3

As before in (22), we have for all . Recall the definition of in (3). Let and note that . Below, , etc, will denote positive constants that only depend on . Since the result only needs to be proved for large , we will take this quantity as large as needed. In what follows, we connect each node in the graph to itself. This is only for convenience and has no impact on the validity of the resulting arguments.


Special case. Suppose that for all . In that case, for all , . For distinct, forms a path in the graph if and only if , which happens with probability . Therefore, by independence,

(31)

Therefore, by the union bound, with probability at least , we have , implying , for all .

Henceforth, we assume that

(32)

Claim 1. By choosing large enough, the following event happens with probability at least ,

(33)

Take such that . We first note that there is such that , for otherwise, for all , , which would contradict our assumption (32).

Define

(34)

where . By construction each is on the line segment joining and , and so belongs to the convex hull of ; hence, by the fact that , there is be such that . By the triangle inequality,

(35)

and

(36)

Therefore, for each , forms a path with probability at least . By independence, therefore, there is such an with probability at least .

With the union bound and the fact that , we may conclude that, if is chosen large enough, the event

(37)

has probability at least , since

(38)

eventually, by our assumption that , where will be chosen large.

Next, we prove that implies , which will suffice to establish the claim. For this, it remains to consider the other cases. Therefore, take such that , and define and

(39)

As before, for each , there is such that . We let and . The latter is possible since .

We have

(40)

so that, under ,

(41)

implying that when . Thus, by the triangle inequality, under ,

(42)

By the triangle inequality,

(43)

and it is not hard to verify that , so that , implying under that . Hence, under ,

(44)

when is small enough, as we needed to prove.


The claim, of course, falls quite short of proving the theorem, but we use it in the remainder of the proof. However, it does prove that (9) holds for all such that . Thus, for the remainder of the proof, we only need focus on such that . Define and as before, and also the corresponding .

As before, (40) implies that

(45)

and in particular