 # Randomized Near Neighbor Graphs, Giant Components, and Applications in Data Science

If we pick n random points uniformly in [0,1]^d and connect each point to its k-nearest neighbors, then it is well known that there exists a giant connected component with high probability. We prove that in [0,1]^d it suffices to connect every point to c_d,1n points chosen randomly among its c_d,2n-nearest neighbors to ensure a giant component of size n - o(n) with high probability. This construction yields a much sparser random graph with ∼ n n instead of ∼ n n edges that has comparable connectivity properties. This result has nontrivial implications for problems in data science where an affinity matrix is constructed: instead of picking the k-nearest neighbors, one can often pick k' ≪ k random points out of the k-nearest neighbors without sacrificing efficiency. This can massively simplify and accelerate computation, we illustrate this with several numerical examples.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction and Main Results

### 1.1. Introduction.

The following problem is classical (we refer to the book of Penrose  and references therein).

Suppose points are randomly chosen in and we connect every point to its nearest neighbors, what is the likelihood of obtaining a connected graph?

It is not very difficult to see that is the right order of magnitude. Arguments for both directions are sketched in the first section of a paper by Balister, Bollobás, Sarkar & Walters . Establishing precise results is more challenging; the same paper shows that leads to a disconnected graph and leads to a connected graph with probabilities going to 1 as . We refer to [4, 5, 9, 33, 34] for other recent developments.

We contrast this problem with one that is encountered on a daily basis in applications.

Suppose points are randomly sampled from a set with some geometric structure (say, a submanifold in high dimensions); how should one create edges between these vertices to best reflect the underlying geometric structure?

This is an absolutely fundamental problem in data science: data is usually represented as points in high dimensions and for many applications one creates an affinity matrix

that may be considered an estimate on how ‘close’ two elements in the data set are; equivalently, this corresponds to building a weighted graph with data points as vertices. Taking the

nearest neighbors is a standard practice in the field (see e.g. [2, 11, 23]) and will undoubtedly preserve locality. Points are only connected to nearby points and this gives rise to graphs that reflect the overall structure of the underlying geometry. The main point of our paper is that this approach, while correct at the local geometric perspective, is often not optimal for how it is used in applications. We first discuss the main results from a purely mathematical perspective and then explain what this implies for applications.

### 1.2. k−nearest Neighbors.

We now explore what happens if is fixed and . More precisely, the question being treated in this section is as follows.

Suppose

points are randomly sampled from a nice (compactly supported, absolutely continuous) probability distribution and every point is connected to its

nearest neighbors. What can be said about the number of connected components as ? How do these connected components behave?

The results cited above already imply that the arising graph is disconnected with very high likelihood. We aim to answer these questions in more detail. By a standard reduction (a consequence of everything being local, see e.g. Beardwood, Halton & Hammersley 

), it suffices to study the case of uniformly distributed points on

. More precisely, we will study the behavior of random graphs generated in the following manner: a random sample from a Poisson process of intensity on yields a number of uniformly distributed points (roughly ), and we connect each of these points to its nearest neighbors, where is fixed.

###### Theorem 1 (There are many connected components.).

Let denote the number of connected components of a graph obtained from connecting point samples from a Poisson process with intensity to their nearest neighbors. There exists a constant such that

 limn→∞EXnn=cd,k.

Moreover, the expected diameter of a connected component is .

In terms of number of clusters, this is the worst possible behavior: the number of clusters is comparable to the number of points. The reason why this problem is not an issue in applications is that the implicit constant decays quickly in both parameters (see Table 1). The second part of the statement is also rather striking: a typical cluster lives essentially at the scale of nearest neighbor distances; again, one would usually expect this to be a noticeable concern but in practice the implicit constant in is growing extremely rapidly in both parameters. It could be of interest to derive some explicit bounds on the growth of these constants. Our approach could be used to obtain some quantitative statements but they are likely far from the truth.

We emphasize that this is a statement about the typical clusters and there are usually clusters that have very large diameter – this is also what turns nearest neighbor graphs into a valuable tool in practice: usually, one obtains a giant connected component. Results in this direction were established by Balister & Bollobas  and Teng & Yao : more precisely, in dimension 2 the nearest neighbor graph percolates (it is believed that 11 can be replaced by 3, see ).

### 1.3. Randomized Near Neighbors.

Summarizing the previous sections, it is clear that if we are given points in coming from a Poisson process with intensity , then the associated nearest neighbor graph will have connected components for fixed (as ) and will be connected with high likelihood as soon as . The main contribution of our paper is to show that there is a much sparser random construction that has better connectivity properties – this is of intrinsic interest but has also a number of remarkable applications in practice (and, indeed, was inspired by those).

###### Theorem 2.

There exist constants , depending only on the dimension, such that if we connect every one of points, i.i.d. uniformly sampled from ,

 to each of its cd,1logn  nearest neighbors with % likelihood p=cd,2loglognlogn,

then the arising graph has a connected component of size with high probability.

1. This allows us to create graphs on edges that have one large connected component of size proportional to with high likelihood.

2. While the difference between and may seem negligible for any practical applications, there is a sizeable improvement in the explicit constant that can have a big effect (we refer to §2.2 for numerical examples).

3. The result is sharp in the sense that the graph is not going to be connected with high probability (see §5). In practical applications the constants scale favorably and the graphs are connected (in practice, even large are too small for asymptotic effects). We furthermore believe that another randomized construction, discussed in Section §5, has conceivably the potential of yielding connected graphs without using significantly more edges; we believe this to be an interesting problem. Figure 2. Theorem 1 and Theorem 2 illustrated: 5000 uniformly distributed random points are connected by 2−nearest neighbors (left) and 2 out of the 4−nearest neighbors, randomly selected (right). Connected components are distinguished by color – we observe a giant component on the right.

### 1.4. The Big Picture.

We believe this result to have many applications. Many approaches in data science require the construction of a local graph reflecting underlying structures and one often chooses a

nearest neighbor graph. If we consider the example of spectral clustering, then Maier, von Luxburg & Hein

 describe theoretical results regarding how this technique, applied to a nearest neighbor graph, approximates the underlying structure in the data set. Naturally, in light of the results cited above, the parameter has to grow at least like for such results to become applicable. Our approach allows for a much sparser random graph that has comparable connectivity properties and several other useful properties.

However, we believe that there is also a much larger secondary effect that should be extremely interesting to study: suppose we are given a set of points . If these points are well-separated then the nearest neighbor graph is an accurate representation of the underlying structure; however, even very slight inhomogeneities in the data, with some points being slightly closer than some other points, can have massive repercussions throughout the network (Figure 3). Even a slight degree of clustering will produce strongly localized clusters in the nearest neighbor graph – the required degree of clustering is so slight that even randomly chosen points will be subjected to it (and this is one possible interpretation of Theorem 1).

Smoothing at logarithmic scales. It is easy to see that for points in coming from a Poisson process with intensity , local clustering is happening at spatial scale . The number of points contained in a ball with volume

is given by a Poisson distribution with parameter

and satisfies

 P(B contains less than ℓ points)∼(clogn)ℓℓ!1nc≲1nc−ε.

This likelihood is actually quite small, which means that it is quite unlikely to find isolated clusters at that scale. In particular, an algorithm as proposed in Theorem 2 that picks random elements at that scale, will then destroy the nonlinear concentration effect in the nearest neighbor construction induced by local irregularities of uniform distribution. We believe this to be a general principle that should have many applications.

When dealing with inhomogeneous data, it can be useful to select nearest neighbors and then subsample random elements. Here, should be chosen at such a scale that localized clustering effects disappear.

An important assumption here is that clusters at local scales are not intrinsic but induced by unavoidable sampling irregularities. We also emphasize that we believe the best way to implement this principle in practice and in applications to be very interesting and far from resolved.

## 2. Applications

### 2.1. Implications for Spectral Clustering.

The first step in spectral clustering of points into clusters involves computation of a kernel for all pairs of points, resulting in an matrix . The kernel is a measure of affinity and is commonly chosen to be a Gaussian with bandwidth ,

 w(xi,xj)=exp(−∥xi−xj∥2/σ2).

defines a graph with nodes where the weight of the edge between and is . The Graph Laplacian is defined as:

 L=D−W

where is a diagonal matrix with the sum of each row on the diagonal. Following , the Graph Laplacian can be normalized symmetrically

 Lsym=D−1/2LD1/2=I−D−1/2WD−1/2,

giving a normalized Graph Laplacian. The eigenvectors

corresponding to the

smallest eigenvalues of

are then calculated and concatenated into the matrix . The rows of are normalized to have unit norm, and its rows are then clustered using the means algorithm into clusters. Crucially, the multiplicity of the eigenvalue of of equals the number of connected components, and the eigenvectors corresponding to the th eigenvalue are piecewise constant on each connected component. In the case of well-separated clusters, each connected component corresponds to a cluster, and the first eigenvectors contain the information necessary to separate the clusters. Figure 4. 16000 points arranged in 4 clusters and a spiral. We compare the effect of connecting every point to its 2−nearest neighbors (left) and connecting every point to 2 randomly chosen out of its 7−nearest neighbors (right). Connected components are colored, the graph on the left has ∼700 connected components; the graph on the right consists of the actual 5 clusters.

However, the computational complexity and memory storage requirements of scale quadratically with , typically making computation intractable when exceeds tens of thousands of points. Notably, as the distance between any two points and increases, decays exponentially. That is, for all that are not close (relative to ), is negligibly small. Therefore, can be approximated by a sparse matrix , where if is among ’s nearest neighbors or is among ’s nearest neighbors, and otherwise. Fast algorithms have been developed to find approximate nearest neighbors (e.g. ), allowing for efficient computation of , and Lanczos methods can be used to efficiently compute its eigenvectors. When the number of neighbors, , is chosen sufficiently large, is a sufficiently accurate approximation of for most applications. However, when cannot be chosen large enough (e.g. due to memory limitations when is on the order of millions of points), the connectivity of the nearest neighbor graph represented by can be highly sensitive to noise, leading to overfit the data and poorly model the overall structure. In the extreme case, can lead to a large number of disconnected components within each cluster, such that the smallest eigenvectors correspond to each of these disconnected components and not the true structure of the data. On the other hand, choosing a random sized subset of nearest neighbors, for , results in a graph with the same number of edges but which is much more likely to be connected within each cluster, and hence, allow for spectral clustering (Figure 4). The latter strategy is a more effective “allocation” of the edges, in the resource limited setting.

### 2.2. Numerical Results.

We demonstrate the usefulness of this approach on the MNIST8M dataset generated by InfiMNIST , which provides an unlimited supply of handwritten digits derived from MNIST using random translations and permutations. For simplicity of visualization, we chose digits , resulting in a dataset of in

dimensional space. We then computed the first ten principal components (PCs) using randomized principal component analysis

 and performed subsequent analysis on the PCs. Let denote the symmetrized Laplacian of the graph formed by connecting each point to its nearest neighbors, with each edge weighted using the Gaussian kernel from above with an adaptive bandwidth equal to the point’s distance to its th neighbor. Similarly, let refer to the Laplacian of the graph formed by connecting each point to a sized subset of its nearest neighbors, where each edge is weighted with a Gaussian of bandwidth equal to the squared distance to its th nearest neighbor. We then used the Lanczos iterations as implemented in MATLAB’s ‘eigs’ function to compute the first three eigenvectors of , , and which we plot in Figure 5. Figure 5. Eigenvectors of sparse Graph Laplacian of three digits in the Infinite-MNIST data set. Connecting to k−nearest neighbors when k is too small leads to catastrophic results (left). Connecting to 2 randomly chosen out of the 100−nearest neighbors points is comparable to connecting to the 50 nearest neighbors, but it requires fewer edges and computing the top three eigenvectors is much faster (time under each plot).

The first three eigenvectors of do not separate the digits, nor do they reveal the underlying manifold on which the digits lie. Increasing the number of nearest neighbors in provides a meaningful embedding. Remarkably, the same quality embedding can be obtained with , despite it being a much sparser graph. Furthermore, computing the first three eigenvectors of took only 4 minutes, as compared to 48 and 16 minutes for and .

### 2.3. Potential Implementation

In order to apply this method to large datasets on resource-limited machines, an efficient algorithm for finding a sized subset for the nearest neighbors of each point is needed. For the above experiments we simply computed all nearest neighbors and randomly subsampled, which is clearly suboptimal. Given a dataset so large that nearest neighbors cannot be computed, how can we find sized random subsets of the nearest neighbors for each point? Interestingly, this corresponds to an “inaccurate” nearest neighbors algorithm, in that the “near” neighbors of each point are sought, not actually the “nearest.” From this perspective, it appears an easier problem than that of finding the nearest neighbors. We suggest a simple and fast implementation which we have found to be empirically successful in Algorithm 1.

On average points from will be among the nearest neighbors of any point in . As such, every point will connect to a sized subset of its nearest neighbors. Choosing a single subset of points, however, dramatically reduces the randomness in the algorithm, and hence is not ideal. We include it here for its simplicity and its success in our preliminary experiments.

### 2.4. Further outlook.

We demonstrate the application of our approach in the context of spectral clustering but this is only one example. There are a great many other methods of dimensionality reduction that start by constructing a graph that roughly approximates the data, for example t-distributed Stochastic Neighborhood Embedding (t-SNE) [16, 31], diffusion maps  or Laplacian Eigenmaps . Basically, this refinement could possibly be valuable for a very wide range of algorithms that construct graph approximations out of underlying point sets – determining the precise conditions under which this method is effective for which algorithm will strongly depend on the context, but needless to say, we consider the experiments shown in this section to be extremely encouraging. We believe that this paper suggests many possible directions for future research: are there other natural randomized near neighbor constructions (we refer to §5 for an example)? Other questions include the behavior of the spectrum and the induced random walk – here we would like to briefly point out that random graphs are natural expanders [14, 19, 21]. This should imply several additional degrees of stability that the standard nearest neighbor construction does not have.

## 3. Proof of Theorem 1

### 3.1. A simple lower bound

We start by showing that there exists a constant such that

 EXn≥εd,kn.

This shows that the number of connected components grows at least linearly.

###### Proof.

We assume that we are given a Poisson process with intensity in .

It is possible to place balls of radius in such that the balls with the same center and radius do not intersect. (The implicit constant is related to the packing density of balls and decays rapidly in the dimension.) The probability of finding points is

 P(Ω contains ℓ points)=e−|Ω|n(|Ω|n)ℓℓ!

This implies that the likelihood of finding points in the ball of radius and 0 points in the spherical shell obtained from taking a ball with radius and removing the ball of radius is given by a fixed constant independent of (because these sets all have measure ). This implies the result since the events in these disjoint balls are independent. ∎

### 3.2. Bounding the degree of a node

Suppose we are given a set of points and assume that every vertex is connected to its nearest neighbors.

Packing Problem. What is the maximal degree of a vertex in a nearest neighbor graph created by any set of points in dimensions?

We will now prove the existence of a constant , depending only on the dimension, such that the maximum degree is bounded by . It is not difficult to see that this is the right order of growth in : in , we can put a point in the origin and find a set of distinguished points at distance 1 from the origin and distance 1.1 from each other. Placing points close to each of the distinguished points yields a construction of points where the degree of the point in the origin is , where is roughly the largest number of points one can place on the unit sphere so that each pair is at least separated.

###### Lemma 1.

The maximum degree of a vertex in a nearest-neighbor graph on points in is bounded from above by .

###### Proof.

In a nearest neighbor graph any node has at least edges since it connects to its nearest neighbors. It therefore suffices to bound the number of vertices that have among its nearest neighbors. Let now be a cone with apex in and opening angle .

Then, by definition, for any two points , we have that

We will now argue that if has a bigger distance to than , then is closer to than . Formally, we want to show that implies . We expand the scalar product and use the inequality above to write

 ∥a−b∥2=∥(a−x)−(b−x)∥2 ≤∥a−x∥2+∥b−x∥2−∥a−x∥∥b−x∥ <∥a−x∥2.

Now we proceed as follows: we cover with cones of opening angles and apex in . Then, clearly, the previous argument implies that every such cone can contain at most vertices different from that have as one of their nearest neighbors. can thus be chosen as one more than the smallest number of such cones needed to cover the space. ∎

A useful Corollary.

We will use this statement in the following way: we let the random variable

denote the number of clusters of randomly chosen points w.r.t. some probability measure on (we will apply it in the special case of uniform distribution on but the statement itself is actually true at a greater level of generality).

###### Corollary 1.

The expected number of clusters cannot grow dramatically; we have

 EXn+1≤EXn+cdk.
###### Proof.

We prove a stronger statement: for any given given set of of points and any , we have that the number of clusters in is at most larger than the number of clusters in . Adding is going to induce a localized effect in the graph: the only new edges that are being created are the nearest neighbors of that are being added as well as changes coming from the fact that some of the points will now have as one of their nearest neighbors. We have already seen in the argument above that this number is bounded by . This means that at most of the existing edges are being removed. Removing an edge can increase the number of clusters by at most 1 and this then implies the result. ∎

### 3.3. The diameter of connected components

The fact that most connected components are contained in a rather small region of space follows relatively quickly from Theorem 1 and the following consequence of the degree bound.

###### Lemma 2.

Let . Summing the distances over all pairs where one is a nearest neighbor of the other is bounded by

 ∑xi,xj knn∥xi−xj∥≲k,dnd−1d.
###### Proof.

Whenever is a nearest neighbor, we put a ball of radius around . A simple application of Hölder’s inequality shows that

 ∑xi,xj knn∥xi−xj∥≤(kn)d−1d⎛⎜⎝∑xi,xj knn∥xi−xj∥d⎞⎟⎠1d≲k,dnd−1d∑xi,xj knn% ∣∣B(xi,∥xj−xi∥)∣∣.

Lemma 1 shows that each point in can be contained in at most balls (otherwise adding a point would create a vertex with too large a degree). This implies

 ∑xi,xj knn∣∣B(xi,∥xj−xi∥)∣∣≤ck5d≲k,d1

and we have the desired result. ∎

We note that this result has an obvious similarity to classical deterministic upper bounds on the length of a traveling salesman path, we refer to [10, 24, 27] for examples. Nonetheless, while the statements are similar, the proof of this simple result here is quite different in style. It could be of interest to obtain some good upper bounds for this problem.

###### Corollary 2.

The diameter of a typical connected component is

This follows easily from the fact that we can bound the sum of the diameters of all connected component by the sum over all distances of nearest neighbors. Put differently, the typical cluster is actually contained in a rather small region of space; however, we do emphasize that the implicit constants (especially in the estimate on the number of clusters) are rather small and thus the implicit constant in this final diameter estimate is bound to be extremely large. This means that this phenomenon is also not usually encountered in practice even for a moderately large number of points. As for Lemma 2 itself, we can get much more precise results if we assume that the points stem from a Poisson process with intensity on . We prove a Lemma that is fairly standard; the special cases are easy to find (see e.g.  and references therein); we provide the general case for the convenience of the reader.

###### Lemma 3.

The probability distribution function of the distance of a fixed point to its th nearest neighbor in a Poisson process with intensity is

 fk,d(r)=dnkωkdrkd−1(k−1)!e−nωdrd, whereωd=πd2Γ(d2+1).
###### Proof.

The proof proceeds by induction, the base case is elementary. We derive the cumulative distributive function and then differentiate it. First, recall that for Borel measurable region

, the probability of finding points in is

 P(B contains ℓ points)=e−n|B|(n|B|)ℓℓ!

Let denote the probability that the nearest neighbor is at least at a distance and let, as usual, denote a ball of radius .

 Fk,d(r)=1−k−1∑ℓ=0P(Br contains ℓ points)=1−(e−nωdrd+k−1∑ℓ=1nℓωℓdrℓdℓ!e−nωdrd)

Differentiating in and summing a telescoping sum yields

 fk,d(r)=dnkωkdrkd−1(k−1)!e−nωdrd.

The distance to the th neighbor therefore has expectation

 ∫∞0rfk,d(r)dr=∫∞0dnkωkdrkd(k−1)!e−nωdrddr=Γ(k+1d)ω1/dd(k−1)!1n1/d

For example, in two dimensions, the expected distance to first five nearest neighbors is

 12√n,34√n,1516√n,3532√n,315256√n,…

respectively. We note, but do not prove or further pursue, that the sequence has some amusing properties and seems to be given (up to a factor of 2 in the denominator) by the series expansion

 (1−x)32=1+32x+158x2+3516x3+315128x4+693256x5+30031024x6+…

### 3.4. A Separation Lemma.

The proof of Theorem 1 shares a certain similarity with arguments that one usually encounters in subadditive Euclidean functional theory (we refer to the seminal work of Steele [25, 26]). The major difference is that our functional, the number of connected components, is scaling invariant, and slightly more troublesome, not monotone: adding a point can decrease the number of connected components. Suppose now that we are dealing with points and try to group them into sets of points each. Here, one should think of as a very large constant and . Ideally, we would like to argue that the number of connected components among the is smaller than the sum of the connected components of each of the sets of points. This, however, need not be generally true (see Fig. 9).

###### Lemma 4 (Separation Lemma.).

For every , fixed, there exists such that, for all sufficiently large, we can subdivide into sets of the same volume whose combined volume is at least such that the expected number of connected components of points following a Poisson distribution of intensity in (each connecting to its nearest neighbors) is the sum of connected components of each piece with error .

###### Proof.

Recall that Lemma 2 states that for all sets of points

 ∑xi,xj knn∥xi−xj∥≤c⋅nd−1d,

where the implicit constant depends on the dimension and but on nothing else. Let us now fix and see how to obtain such a decomposition. We start by decomposing the entire cube into cubes of width . This is followed by merging cubes in a cube-like fashion starting a corner. We then leave a strip of width cubes in all directions and keep constructing bigger cubes assembled out of smaller cubes and separated by strips in this manner. We observe that is a fixed constant: in particular, by making sufficiently large, the volume of the big cubes can be made to add up to of the total volume.

A typical realization of will now have, if is sufficiently large, an arbitrarily small portion of points in the strips. These points may add or destroy clusters in the separate cube: in the worst case, each single point is a connected component in itself (which, of course, cannot happen but suffices for this argument), which would change the total count by an arbitrarily small factor. Or, in the other direction, these points, once disregarded, might lead to the separation of many connected components; each deleted edge can only create one additional component and each vertex has a uniformly bounded number of edges, which leads to the same error estimate as above. However, there might also be a separation of edges that connected two points in big cubes. Certainly, appealing to the total length, this number will satisfy

 #{edges connected different m− cubes}≤ε2⋅n.

Since was arbitrary small, the result follows. ∎

### 3.5. Proof of Theorem 1

###### Proof.

We first fix the notation: let denote the number of connected components of points drawn from a Poisson process with intensity in where each point is connected to its nearest neighbors. The proof has several different steps. A rough summary is as follows.

1. We have already seen that . This implies that if the limit does not exist, then is sometimes large and sometimes small. Corollary 1 implies that if is small, then cannot be much larger as long as . The scaling shows that can actually be chosen to grow linearly with . This means that whenever is small, we actually get an entire interval where that number is rather small and can grow linearly in .

2. The next step is a decomposition of into many smaller cubes such that each set has an expected value of points. It is easy to see with standard bounds that most sets will end up with points. Since can grow linearly with , this is in the interval with likelihood close to 1.

3. The final step is to show that the sum of the clusters is pretty close to the sum of the clusters in each separate block (this is where the separation Lemma comes in). This then concludes the argument and shows that for all sufficiently large number of points , we end up having The small error decreases for larger and larger values of and this will end up implying the result.

Step 1. We have already seen that

 εk,dn≤EXn≤n,

where the upper estimate is, of course, trivial. The main idea can be summarized as follows: if the statement were to fail, then both quantities

 a––:=liminfn→∞EXnnand¯¯¯a:=limsupn→∞EXnn

exist and are positive. This implies that we can find arbitrarily large numbers for which is quite small (i.e. close to ). We set, for convenience, . By definition, there exist arbitrarily large values of such that

 EXn≤(a––+δ10)n.

It follows then, from Corollary 1, that

 EXn+m≤(a––+δ10)n+cdkm≤(a––+δ5)(n+m)for allm≤δn10cdk=:m0.

This means that for all bad values , all the values with are still guaranteed to be very far away from achieving any value close to . We also note explicitly that the value of can be chosen as a fixed proportion of independently of the size of , i.e. is growing lineary with .

Step 2. Let us now consider a Poisson distribution with intensity being given by It is easy to see that, for every and all sufficiently large (depending only on )

 P(n≤Pλ≤n+m0)=e−λn+m0∑i=nλii!≥1−ε.

This follows immediately from the fact that the variance is

and the classical Tschebyscheff inequality arguments. Indeed, much stronger results are true since the scaling of a standard deviation is at the square root of the sample size and we have an interval growing linearly in the sample size – moreover, there are large deviation tail bounds, so one could really show massively stronger quantitative results but these are not required here. We now set

and henceforth only consider values of that are so large that the above inequality holds.

Step 3. When dealing with a Poisson process of intensity , we can decompose, using the Separation Lemma, the unit cube into disjoint, separated cubes with volume with a volume error of size (due to not enough little cubes fitting exactly). When considering the effect of this Poisson process inside a small cube, we see that with very high likelihood (), the number of points is in the interval . The Separation Lemma moreover guarantees that the number of points that end up between the little cubes (‘fall between the cracks’) is as small a proportion of as we wish provided is sufficiently large. Let us now assume that the total number of connected components among the points is exactly the same as the sum of the connected components in the little cubes. Then we would get that

 EXN≤(a––+δ5)N.

This is not entirely accurate: there are points that fell between the cracks (with sufficiently small if is sufficiently large) and there are points that end up in cubes that have a total number of points outside the regime. However, any single point may only add new clusters and thus

 EXN≤(a––+δ5)N+cdk(ε+ε2)N

and by making (possibly by increasing ), we obtain

 ¯¯¯a≤a––+2δ5,

which is a contradiction, since . ∎

## 4. Proof of Theorem 2

### 4.1. The Erdős-Renyi Lemma.

Before embarking on the proof, we describe a short statement. The proof is not subtle and follows along completely classical lines but occurs in an unfamiliar regime: we are interested in ensuring that the likelihood of obtaining a disconnected graph is very small. The subsequent argument, which is not new but not easy to immediately spot in the literature, is included for the convenience of the reader (much stronger and more subtle results usually focus on the threshold ).

###### Lemma 5.

Let an Erdős-Renyi graph with Then, for sufficiently large,

 P(G(n,p) is disconnected)≲e−pn/3.
###### Proof.

It suffices to bound the likelihood of finding an isolated set of vertices from above, where . For any fixed set of vertices of it being isolated is bounded from above by

 P(fixed set of k vertices being disconnected)≤(1−p)k(n−k)

and thus, using the union bound,

 P(G(n,p) is connected)≤n/2∑k=1(n)k(1−p)k(n−k).

We use

 (nk)≤(nek)k

to rewrite the expression as

 n/2∑k=1(nk)(1−p)k(n−k) ≤n/2∑k=1ek+klogn+klogk+[log(1−p)]k(n−k) ≤n/2∑k=1ek(3logn+[log(1−p)](n−k)) ≤n/2∑k=1ek(3logn+[log(1−p)](n/2)) ≲e3logn+[log(1−p)]n/2,

where the last step is merely the summation of a geometric series and valid as soon as

 3logn+[log(1−p)]n2<0,

which is eventually true for sufficiently large since . ∎

### 4.2. A Mini-Percolation Lemma

The purpose of this section is to derive rough bounds for a percolation-type problem.

###### Lemma 6.

Suppose we are given a grid graph on and remove each of the points with likelihood for some . Then, for sufficiently large, there is a giant component with expected size .

The problem seems so natural that some version of it must surely be known. It seems to be dual to classical percolation problems (in the sense that one randomly deletes vertices instead of edges). It is tempting to believe that the statement remains valid for all the way up to some critical exponent that depends on the dimension (and grows as the dimension gets larger). Before embarking on a proof, we show a separate result. We will call a subset connected if the resulting graph is connected: here, edges are given by connecting every node to all of its adjacent nodes that differ by at most one in each coordinate (that number is bounded from above by ).

###### Lemma 7.

The number of connected components in the grid graph over with cardinality is bounded from above

 #number of connected components of size ℓ≤nd(23d−1)ℓ.
###### Proof.

The proof proceeds in a fairly standard way by constructing a combinatorial encoding. We show how this is done in two dimensions, giving an upper bound of – the construction immediately transfers to higher dimensions in the obvious way.

The encoding is given by a direct algorithm.

1. Pick an initial vertex . Describe which of the 8 adjacent squares are occupied by picking a subset of .

2. Implement a depth-first search as follows: pick the smallest number in the set attached to and describe its neighbors, if any, that are distinct from previously selected nodes as a subset of .

3. Repeat until all existing neighbors have been mapped out (the attached set is the empty set) and then go back and describe the next branch.

Just for clarification, we quickly show the algorithm in practice. Suppose we are given an initial point and the sequence of sets

 {4,5},{3,4},{},{},{5},{4},{},

then this uniquely identifies the set showing in Figure 12.

Clearly, this description returns subsets of of which there are 256. Every element in generates exactly one such subset and every connected component can thus be described by giving the initial points and then a list of subsets of . This implies the desired statement; we note that the actual number should be much smaller since this way of describing connected components has massive amounts of redundancy and overcounting. ∎

###### Proof of Lemma 6.

The proof is actually fairly lossy and proceeds by massive overcounting. The only way to remove mass from the giant block is to remove points in an organized manner: adjacent squares have to be removed in a way that encloses a number of squares that are not removed (see Fig. 13).

The next question is how many other points can possibly be captured by a connected component on nodes. The isoperimetric principle implies

 #blocks captured by ℓ nodes≲dℓdd−1≤ℓ2.

Altogether, this implies we expect to capture at most

 nd∑ℓ=1nd(23d−1)ℓ(logn)−cℓℓ2≤nd∞∑ℓ=1(23d−1(logn)c)ℓℓ2≲nd(logn)c,

where the last inequality holds as soon as and follows from the derivative geometric series

 ∞∑ℓ=1ℓ2qℓ=q(1+q)(1−q)3% whenever |q|<1.

Remark. There are two spots where the argument is fairly lossy. First of all, every connected component on nodes is, generically, counted as connected components of length , as connected components of size and so on. The second part of the argument is the application of the isoperimetric inequality: a generic connected component on nodes will capture other nodes. These problems seem incredibly close to existing research and it seems likely that they either have been answered already or that techniques from percolation theory might provide rather immediate improvements.

### 4.3. Outline of the Proof

The proof proceeds in three steps.

1. Partition the unit cube into smaller cubes such that each small cube has an expected number of points (and thus, the number of cubes is ). Show that the likelihood of a single cube containing significantly more or significantly less points is small.

2. Show that graphs within the cube are connected with high probability.

3. Show that there are connections between the cubes that ensure connectivity.

### 4.4. Step 1.

We start by partitioning in the canonical manner into axis-parallel cubes having side-length for some constant to be chosen later. There are roughly cubes and they have measure . We start by bounding the likelihood of a one such cube containing points. Clearly, this likelihood can be written as a Bernoulli random variables

 number of points in cube=B(n,clognn).

The Chernoff-Hoeffding theorem  implies

 P(B(n,clognn)≤logn100)≤exp(−nD(logn100n||clognn)),

where is the relative entropy

 D(a||b)=alogab+(1−a)log1−a1−b.

Here, we have, for large,

 D(logn100n||clognn)∼lognn(c−1100+1100log1100c).

This implies that for sufficiently large, we have

 P(fixed cube has less than logn100 points)≲c,ε1nc−ε

and the union bound implies

 P(there exists cube that has less than logn100 points)≲c,ε1nc−1−ε

The same argument also shows that

 P(exists cube with more than 10clogn points)≲1nc.

This means we have established the existence of a constant such that with likelihood tending to 1 as (at arbitrary inverse polynomial speed provided is big enough)

 ∀ cubes Qlogn100≤#{% points in Q}≤10clogn.

We henceforth only deal with cases where these inequalities are satisfied for all cubes.

### 4.5. Step 2.

We now study what happens within a fixed cube . The cube is surrounded by at most other cubes each of which contains at most points. This means that if, for any , we compile a list of its nearest neighbours, we are guaranteed that every other element in is on that list. Let us suppose that the rule is that each point is connected to each of its nearest neighbors with likelihood

 p=m3d10clogn.

Then, Lemma 5 implies that for the likelihood of obtaining a connected graph strictly within is at least . Lemma 6 then implies the result provided we can ensure that points in cubes connect to their neighboring cubes.

### 4.6. Step 3.

We now establish that the likelihood of a cube having, for every adjacent cube , a point that connects to a point in is large. The adjacent cube has points. The likelihood of a fixed point in not connecting to any point in is

 ≤⎛⎜⎝1−logn1003d10clogn⎞⎟⎠cloglogn=(1−13d1000c)cloglogn≲(logn)−εc,d.

The likelihood that this is indeed true for every point is then bounded from above by

 (logn)−εc,dlogn/100≲n−1,

which means, appealing again to the union bound, that this event occurs with a likelihood going to 0 as .

Connectedness. It is not difficult to see that this graph is unlikely to be connected. For a fixed vertex , there are possible other vertices it could connect to and other vertices might possibly connect to . Thus

 P(v is isolated)≲(1−c2loglognlogn)c3logn≤e−c2c3loglogn=1(logn)c2c3.

This shows that we can expect at least isolated vertices. This also shows that the main obstruction to connectedness is the nontrivial likelihood of vertices not forming edges to other vertices. This suggests a possible variation of the graph construction that is discussed in the next section.

## 5. An Ulam-type modification

There is an interesting precursor to the Erdös-Renyi graph that traces back to a question of Stanislaw Ulam in the Scottish Book.

Problem 38: Ulam. Let there be given elements (persons). To each element we attach others among the given at random (these are friends of a given person). What is the probability that from every element one can get to every other element through a chain of mutual friends? (The relation of friendship is not necessarily symmetric!) Find (0 or 1?). (Scottish Book, )

We quickly establish a basic variant of the Ulam-type question sketched in the introduction since the argument itself is rather elementary. It is a natural variation on the Ulam question (friendship now being symmetric) and the usual Erdös-Renyi argument applies. A harder problem (start by constructing a directed graph, every vertex forms an outgoing edge to other randomly chosen vertices, and then construct an undirected graph by including edges where both and are in the directed graph) was solved by Jungreis .

Question. If we are given randomly chosen points in and connect each vertex to exactly of its nearest neighbors, is the arising graph connected with high probability?

We have the following basic Lemma that improves on the tendency of Erdős-Renyi graphs to form small disconnected components.