Reconstructing Latent Orderings by Spectral Clustering

07/18/2018 ∙ by Antoine Recanati, et al. ∙ 0

Spectral clustering uses a graph Laplacian spectral embedding to enhance the cluster structure of some data sets. When the embedding is one dimensional, it can be used to sort the items (spectral ordering). A number of empirical results also suggests that a multidimensional Laplacian embedding enhances the latent ordering of the data, if any. This also extends to circular orderings, a case where unidimensional embeddings fail. We tackle the task of retrieving linear and circular orderings in a unifying framework, and show how a latent ordering on the data translates into a filamentary structure on the Laplacian embedding. We propose a method to recover it, illustrated with numerical experiments on synthetic data and real DNA sequencing data. The code and experiments are available at https://github.com/antrec/mdso.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The seriation problem seeks to recover a latent ordering from similarity information. We typically observe a matrix measuring pairwise similarity between a set of elements and assume they have a serial structure, i.e. they can be ordered along a chain where the similarity between elements decreases with their distance within this chain. In practice, we observe a random permutation of this similarity matrix, where the elements are not indexed according to that latent ordering. Seriation then seeks to find that global latent ordering using only (local) pairwise similarity.

Seriation was introduced in archaeology to find the chronological order of a set of graves. Each contained artifacts, assumed to be specific to a given time period. The number of common artifacts between two graves define their similarity, resulting in a chronological ordering where two contiguous graves belong to a same time period. It also has applications in, e.g., envelope reduction (Barn95), bioinformatics (atkins1996physical; cheema2010thread; jones2012anges) and DNA sequencing (Meid98; Garr11; recanati2016spectral).

In some applications, the latent ordering is circular. For instance, in de novo genome assembly of bacteria, one has to reorder DNA fragments subsampled from a circular genome.

In biology, a cell evolves according to a cycle: a newborn cell passes through diverse states (growth, DNA-replication, etc.) before dividing itself into two newborn cells, hence closing the loop. Problems of interest then involve collecting cycle-dependent data on a population of cells at various, unknown stages of the cell-cycle, and trying to order the cells according to their cell-cycle stage. Such data include gene-expression (liu2017reconstructing), or DNA 3D conformation data (liu2018unsupervised). In planar tomographic reconstruction, the shape of an object is inferred from projections taken at unknown angles between 0 and . Reordering the angles then enables to perform the tomography (coifman2008graph).

The main structural hypothesis on similarity matrices related to seriation is the concept of -matrix, which we introduce below, together with its circular counterpart.

Definition 1.1.

We say that is a R-matrix (or Robinson matrix) iff it is symmetric and satisfies and in the lower triangle, where .

Definition 1.2.

We say that is a circular R-matrix iff it is symmetric and satisfies, for all , and are unimodal : they are decrease to a minimum and then increase.

Here is the set of real symmetric matrices of dimension . Definition 1.1 states that when moving away from the diagonal in a given row or column of , the entries are non-increasing, whereas in Def 1.2, the non-increase is followed by a non-decrease. For instance, the proximity matrix of points embedded on a circle follows Def 1.2. Figure 1 displays examples of such matrices.

(a) R-matrix
(b) circular R-matrix
(c) permuted R-matrix
Figure 1: From left to right, R-matrix (0(a)), circular R-matrix (0(b)), and a randomly permuted observation of a R-matrix (0(c)). Seriation seeks to recover (0(a)) from its permuted observation (0(c)).

In what follows, we write (resp., ) the set of R (resp., circular-R) matrices of size , and the set of permutations of

elements. A permutation can be represented by a vector

(lower case) or a matrix (upper case) defined by iff , and where . We refer to both representations by and may omit the subscript whenever the dimension is clear from the context. We say that is pre- (resp., pre-) if there exists a permutation such that the matrix (whose entry is ) is in (resp., ). Given such , Seriation seeks to recover this permutation ,

(Linear Seriation)
(Circular Seriation)

A widely used method for Linear Seriation is a spectral relaxation based on the graph Laplacian of the similarity matrix. It transposes Spectral Clustering (von2007tutorial) to the case where we wish to infer a latent ordering rather than a latent clustering on the data. Roughly speaking, both methods embed the elements on a line and associate a coordinate to each element . Spectral clustering addresses a graph-cut problem by grouping these coordinates into two clusters. Spectral ordering (Atkins) addresses Linear Seriation by sorting the .

Most Spectral Clustering algorithms actually use a Laplacian embedding of dimension , denoted d- in the following. Latent cluster structure is assumed to be enhanced in the d-

, and the k-means algorithm

(macqueen1967some; hastie2009unsupervised) seamlessly identifies the clusters from the embedding. In contrast, Spectral Ordering is restricted to by the sorting step (there is no total order relation on for ). Still, the latent linear structure may emerge from the d-, if the points are distributed along a curve. Also, for , it may capture the circular structure of the data and allow for solving Circular Seriation. One must then recover a (circular) ordering of points lying in a manifold (a curve, or filament) embedded in .

In Section 2, we review the Spectral Ordering algorithm and the Laplacian Embedding used in Spectral Clustering. We mention graph-walk perspectives on this embedding and how this relates to dimensionality reduction techniques. Finally, we recall how these perspectives relate the discrete Laplacian to continuous Laplacian operators, providing insights about the curve structure of the Laplacian embedding through the spectrum of the limit operators. These asymptotic results were used to infer circular orderings in a tomography application in e.g. coifman2008graph. In Section 3, we evidence the filamentary structure of the Laplacian Embedding, and provide theoretical guarantees about the Laplacian Embedding based method for Circular Seriation. We then propose a method in Section 4 to leverage the multidimensional Laplacian embedding in the context of Linear Seriation and Circular Seriation. We eventually present numerical experiments to illustrate how the spectral method gains in robustness by using a multidimensional Laplacian embedding.

2 Related Work

2.1 Spectral Ordering for Linear Seriation

Linear Seriation can be addressed with a spectral relaxation of the following combinatorial problem,

(2-SUM)

Intuitively, the optimal permutation compensates high values with small , thus laying similar elements nearby. For any , the objective of 2-SUM can be written as a quadratic (with simple algebra using the symmetry of , see von2007tutorial),

(1)

where is the graph-Laplacian of . From (1), is positive-semi-definite for having non-negative entries, and

is an eigenvector associated to

.

The spectral method drops the constraint in 2-SUM and enforces only norm and orthogonality constraints, , , to avoid the trivial solutions and , yielding,

(Relax. 2-SUM)

This is an eigenvalue problem on

solved by , the eigenvector associated to the second smallest eigenvalue of . If the graph defined by is connected (which we assume further) then . From , one can recover a permutation by sorting its entries. The spectral relaxation of 2-SUM is summarized in Algorithm 1. For pre- matrices, Linear Seriation is equivalent to 2-SUM (Fogel), and can be solved with Algorithm 1 (Atkins), as stated in Theorem 2.1.

0:  Connected similarity matrix
1:  Compute Laplacian
2:  Compute second smallest eigenvector of ,
3:  Sort the values of
3:  Permutation
Algorithm 1 Spectral ordering (Atkins)
Theorem 2.1 (Atkins).

If is a pre- matrix, then Algorithm 1 recovers a permutation such that , i.e., it solves Linear Seriation.

2.2 Laplacian Embedding

Let , , , be the eigendecomposition of . Algorithm 1 embeds the data in 1D through the eigenvector (1-). For any , defines a -dimensional embedding (d-)

(d-)

which solves the following embedding problem,

(Lap-Emb)

Indeed, like in (1), the objective of Lap-Emb can be written (see belkin2003laplacian for a similar derivation). The 2-SUM intuition still holds: the d- lays similar elements nearby, and dissimilar apart, in . Other dimensionality reduction techniques such as Multidimensional scaling (MDS) (kruskal1978multidimensional), kernel PCA (scholkopf1997kernel), or Locally Linear Embedding (LLE) (roweis2000nonlinear) could be used as alternatives to embed the data in a way that intuitively preserves the latent ordering. However, guided by the generalization of Algorithm 1 and theoretical results that follow, we restrict ourselves to the Laplacian embedding.

2.2.1 Normalization and Scaling

Given the weighted adjacency matrix of a graph, its Laplacian reads , where has diagonal entries (degree of ). Normalizing by or leads to the normalized Laplacians,

(2)

They correspond to graph-cut normalization (normalized cut or ratio cut). Moreover,

has a Markov chain interpretation, where a random walker on edge

jumps to edge from time to

with transition probability

. It has connections with diffusion processes, governed by the heat equation , where is the Laplacian operator, the heat kernel, and is time (qiu2007clustering). These connections lead to diverse Laplacian embeddings backed by theoretical justifications, where the eigenvectors of are sometimes scaled by decaying weights (thus emphasizing the first eigenvectors),

((, d)-)

Laplacian eigenmaps (belkin2003laplacian) is a nonlinear dimensionality reduction technique based on the spectral embedding of (((, d)-) with for all ). Specifically, given points , the method computes a heat kernel similarity matrix and outputs the first eigenvectors of

as a lower dimensional embedding. The choice of the heat kernel is motivated by connections with the heat diffusion process on a manifold, a partial differential equation involving the Laplacian operator. This method has been successful in many machine learning applications such as semi-supervised classification

(belkin2004semi) and search-engine type ranking (zhou2004ranking). Notably, it provides a global, nonlinear embedding of the points that preserves the local structure.

The commute time distance between two nodes and on the graph is the expected time for a random walker to travel from node to node and then return. The full (, d)-, with and , satisfies . Given the decay of , the d- with approximately preserves the CTD. This embedding has been successfully applied to vision tasks, e.g.

, anomaly detection

(albano2012euclidean), image segmentation and motion tracking (qiu2007clustering).

Another, closely related dimensionality reduction technique is that of diffusion maps (coifman2006diffusion), where the embedding is derived to preserve diffusion distances, resulting in the (, d)-, for , .

coifman2006diffusion; coifman2008graph also propose a normalization of the similarity matrix , to extend the convergence of towards the Laplace-Beltrami operator on a curve when the similarity is obtained through a heat kernel on points that are non uniformly sampled along that curve.

Finally, we will use in practice the heuristic scaling

to damp high dimensions, as explained in Appendix B.5.

For a deeper discussion about spectral graph theory and the relations between these methods, see for instance qiu2007clustering and chung2000discrete.

2.3 Link with Continuous Operators

In the context of dimensionality reduction, when the data points lie on a manifold of dimension , the graph Laplacian of the heat kernel () used in belkin2003laplacian is a discrete approximation of , the Laplace-Beltrami operator on (a differential operator akin to the Laplace operator, adapted to the local geometry of ). singer2006graph specify the hypothesis on the data and the rate of convergence of towards when grows and the heat-kernel bandwidth shrinks. von2005limits also explore the spectral asymptotics of the spectrum of to prove consistency of spectral clustering.

This connection with continuous operators gives hints about the Laplacian embedding in some settings of interest for Linear Seriation and Circular Seriation. Indeed, consider points distributed along a curve of length , parameterized by a smooth function , , say . If their similarity measures their proximity along the curve, then the similarity matrix is a circular-R matrix if the curve is closed (), and a R matrix otherwise. coifman2008graph motivate a method for Circular Seriation with the spectrum of the Laplace-Beltrami operator on when is a closed curve. Indeed, is simply the second order derivative with respect to the arc-length , (for

twice continuously differentiable), and its eigenfunctions are given by,

(3)

With periodic boundary conditions, , , and smoothness assumptions, the first eigenfunction is constant with eigenvalue , and the remaining are , associated to the eigenvalues of multiplicity 2. Hence, the 2-, should approximately lay the points on a circle, allowing for solving Circular Seriation (coifman2008graph). More generally, the 2d-, is a closed curve in .

If is not closed, we can also find its eigenfunctions. For instance, with Neumann boundary conditions (vanishing normal derivative), say, , , , the non-trivial eigenfunctions of are , with associated eigenvalues of multiplicity 1. The 1- respects the monotonicity of , which is consistent with Theorem 2.1. lafon2004diffusion invoked this asymptotic argument to solve an instance of Linear Seriation but seemed unaware of the existence of Atkin’s Algorithm 1. Note that here too, the d-, follows a closed curve in , with endpoints.

These asymptotic results hint that the Laplacian embedding preserves the latent ordering of data points lying on a curve embedded in . However, these results are only asymptotic and there is no known guarantee for the Circular Seriation problem as there is for Linear Seriation. Also, the curve (sometimes called filamentary structure) stemming from the Laplacian embedding has been observed in more general cases where no hypothesis on a latent representation of the data is made, and the input similarity matrix is taken as is (see, e.g., diaconis2008horseshoes for a discussion about the horseshoe phenomenon).

2.4 Ordering points lying on a curve

Finding the latent ordering of some points lying on (or close to) a curve can also be viewed as an instance of the traveling salesman problem (TSP), for which a plethora of (heuristic or approximation) algorithms exist (reinelt1994traveling; laporte1992traveling). We can think of this setting as one where the cities to be visited by the salesman are already placed along a single road, thus these TSP instances are easy and may be solved by simple heuristic algorithms.

Existing approaches for Linear Seriation and Circular Seriation have only used 2D embeddings so far, for simplicity. kuntz2001iterative use the 2- to find a circular ordering of the data. They use a somehow exotic TSP heuristic which maps the 2D points onto a pre-defined “space-filling” curve, and unroll the curve through its closed form inverse to obtain a 1D embedding and sort the points. friendly2002corrgrams uses the angle between the two first coordinates of the 2D-MDS embedding and sorts them to perform Linear Seriation. coifman2008graph use the 2- to perform Circular Seriation in a tomographic reconstruction setting, and use a simple algorithm that sorts the inverse tangent of the angle between the two components to reorder the points. liu2018unsupervised use a similar approach to solve Circular Seriation in a cell-cycle related problem, but with the 2D embedding given by MDS.

3 Spectral properties of some (circular) Robinson matrices

We have claimed that the d- enhances the latent ordering of the data and we now present some theoretical evidences. We adopt a point of view similar to Atkins, where the feasibility of Linear Seriation relies on structural assumptions on the similarity matrix (). For a subclass of (set of circular-R matrices), we show that the d- lays the points on a closed curve, and that for , the elements are embedded on a circle according to their latent circular ordering. This is a counterpart of Theorem 2.1 for Circular Seriation. It extends the asymptotic results motivating the approach of coifman2008graph, shifting the structural assumptions on the elements (data points lying on a curve embedded in ) to assumptions on the raw similarity matrix that can be verified in practice. Then, we develop a perturbation analysis to bound the deformation of the embedding when the input matrix is in up to a perturbation. Finally, we discuss the spectral properties of some (non circular) -matrices that shed light on the filamentary structure of their d- for .

For simplicity, we assume odd in the following. The results with even are relegated to the Appendix, together with technical proofs.

3.1 Circular Seriation with Symmetric, Circulant matrices

Let us consider the set of matrices in that are circulant, in order to have a closed form expression of their spectrum. A matrix is Toeplitz if its entries are constant on a given diagonal, for a vector of values of size . A symmetric Toeplitz matrix satisfies , with of size . In the case of circulant symmetric matrices, we also have that , for , thus symmetric circulant matrices are of the form,

(4)

Where is a vector of values of size (recall that ). The circular-R assumption (Def 1.2) imposes that the sequence is non-increasing. We thus define the set of circulant matrices of as follows.

Definition 3.1.

A matrix is in iff it verifies and for with a non-increasing sequence.

The spectrum of symmetric circulant matrices is known (reichel1992eigenvalues; gray2006toeplitz; massey2007distribution), and for a matrix of size , it is given by,

(5)

For , is an eigenvalue of multiplicity 2 with associated eigenvectors ,. For any , embeds the points on a circle, but for , the circle is walked through times, hence the ordering of the points on the circle does not follow their latent ordering. The from equations (5) are in general not sorted. It is the Robinson property (monotonicity of ) that guarantees that , for , and thus that the 2- embeds the points on a circle that follows the latent ordering and allows one to recover it by scanning through the unit circle. This is formalized in Theorem 3.2, which is the main result of our paper, proved in Appendix C. It provides guarantees in the same form as in Theorem 2.1 with the simple Algorithm 2 that sorts the angles, used in coifman2008graph.

0:  Connected similarity matrix
1:  Compute normalized Laplacian
2:  Compute the two first non-trivial eigenvectors of ,
3:  Sort the values of
3:  Permutation
Algorithm 2 Circular Spectral Ordering (coifman2008graph)
Theorem 3.2.

Given a permuted observation () of a matrix , the 2- maps the items on a circle, equally spaced by angle , following the circular ordering in . Hence, Algorithm 2 recovers a permutation such that , i.e., it solves Circular Seriation.

3.2 Perturbation analysis

The spectrum is a continuous function of the matrix. Let us bound the deformation of the 2- under a perturbation of the matrix using the Davis-Kahan theorem (davis1970rotation), well introduced in (von2007tutorial, Theorem 7). We give more detailed results in Appendix D for a subclass of (KMS) defined further.

Proposition 3.3 (Davis-Kahan).

Let and be the Laplacian matrices of and , respectively, and be the associated 2- of and , i.e., the concatenation of the two eigenvectors associated to the two smallest non-zero eigenvalues, written for . Then, there exists an orthonormal rotation matrix such that

(6)

3.3 Robinson Toeplitz matrices

Let us investigate how the latent linear ordering of Toeplitz matrices in translates to the d-. Remark that from Theorem 2.1, the 1- suffices to solve Linear Seriation. Yet, for perturbed observations of , the d- may be more robust to the perturbation than the 1-, as the experiments in §5 indicate.

Tridiagonal Toeplitz matrices are defined by . For , they have eigenvalues with multiplicity 1 associated to eigenvector (trench1985eigenvalue),

(7)

thus matching the spectrum of the Laplace operator on a curve with endpoints from §2.3 (up to a shift). This type of matrices can indeed be viewed as a limit case with points uniformly sampled on a line with strong similarity decay, leaving only the two nearest neighbors with non-zero similarity.

Kac-Murdock-Szegö (KMS) matrices are defined, for , , by . For , there exists , such that is a double eigenvalue associated to eigenvectors ,,

(8)

Linearly decreasing Toeplitz matrices defined by have spectral properties analog to those of KMS matrices (trigonometric expression, interlacement, low frequency assigned to largest eigenvalue), but with more technical details available in bunger2014inverses. This goes beyond the asymptotic case modeled by tridiagonal matrices.

Banded Robinson Toeplitz matrices typically include similarity matrices from DNA sequencing. Actually, any Robinson Toeplitz matrix becomes banded under a thresholding operation. Also, fast decaying Robinson matrices such as KMS matrices are almost banded. There is a rich literature dedicated to the spectrum of generic banded Toeplitz matrices (boeottcher2005spectral; gray2006toeplitz; bottcher2017asymptotics). However, it mostly provides asymptotic results on the spectra. Notably, some results indicate that the eigenvectors of some banded symmetric Toeplitz matrices become, up to a rotation, close to the sinusoidal, almost equi-spaced eigenvectors observed in equations (7) and (8) (bottcher2010structure; ekstrom2017eigenvalues).

3.4 Spectral properties of the Laplacian

For circulant matrices , and have the same eigenvectors since , with . For general symmetric Toeplitz matrices, this property no longer holds as varies with . Yet, for fast decaying Toeplitz matrices, is almost constant except for at the edges, namely close to or to . Therefore, the eigenvectors of resemble those of except for the “edgy” entries.

4 Recovering Ordering on Filamentary Structure

We have seen that (some) similarity matrices with a latent ordering lead to a filamentary d-. The d- integrates local proximity constraints together into a global consistent embedding. We expect isolated (or, uncorrelated) noise on to be averaged out by the spectral picture. Therefore, we present Algorithm 3 that redefines the similarity between two items from their proximity within the d-. Basically, it fits the points by a line locally, in the same spirit as LLE, which makes sense when the data lies on a linear manifold (curve) embedded in . Note that Spectral Ordering (Algorithm 1) projects all points on a given line (it only looks at the first coordinates ) to reorder them. Our method does so in a local neighborhood, allowing for reordering points on a curve with several oscillations. We then run the basic Algorithms 1 (or 2 for Circular Seriation). Hence, the d- is eventually used to pre-process the similarity matrix.

0:  A similarity matrix , a neighborhood size , a dimension of the Laplacian Embedding .
1:   Compute Laplacian Embedding
2:  Initialize New similarity matrix
3:  for  do
4:       find nearest neighbors of
5:       fit by a line
6:      , for . Compute distances on the line
7:      , for . Update similarity
8:  end for
9:  Compute from the matrix with Algorithm 1 (resp., Algorithm 2) for a linear (resp., circular) ordering.
9:  A permutation .
Algorithm 3 Ordering Recovery on Filamentary Structure in .

In Algorithm 3, we compute a d- in line 1 and then a 1- (resp., a 2-) for linear ordering (resp., a circular ordering) in line 9. For reasonable number of neighbors in the -NN of line 4 (in practice, ), the complexity of computing the d- dominates Algorithm 3. We shall see in Section 5 that our method, while being almost as computationally cheap as the base Algorithms 1 and 2 (roughly only a factor 2), yields substantial improvements. In line 7 we can update the similarity by adding any non-increasing function of the distance , e.g., , , or (the latter case requires to add an offset to afterwards to ensure it has non-negative entries. It is what we implemented in practice.) In line 9, the matrix needs to be connected in order to use Algorithm 1, which is not always verified in practice (for low values of , for instance). In that case, we reorder separately each connected component of with Algorithm 1, and then merge the partial orderings into a global ordering by using the input matrix , as detailed in Algorithm 4, Appendix A.

5 Numerical Results

5.1 Synthetic Experiments

We performed synthetic experiments with noisy observations of Toeplitz matrices , either linear () or circular (). We added a uniform noise on all the entries, with an amplitude parameter varying between 0 and , with maximum value of the noise . The matrices used are either banded (sparse), with linearly decreasing entries when moving away from the diagonal, or dense, with exponentially decreasing entries (KMS matrices). We used , several values for the parameters (number of neighbors) and (dimension of the d-), and various scalings of the d- (parameter in (, d)-), yielding similar results (see sensitivity to the number of neighbors and to the scaling (, d)- in Appendix B.4). In an given experiment, the matrix is randomly permuted with a ground truth permutation . We report the Kendall-Tau scores between and the solution of Algorithm 3 for different choices of dimension , for varying noise amplitude , in Figure 2, for banded (circular) matrices. For the circular case, the ordering is defined up to a shift. To compute a Kendall-Tau score from two permutations describing a circular ordering, we computed the best Kendall-Tau scores between the first permutation and all shifts from the second, as detailed in Algorithm 5. The analog results for exponentially decaying (KMS) matrices are given in Appendix B.3, Figure 7

. For a given combination of parameters, the scores are averaged on 100 experiments and the standard-deviation divided by

(for ease of reading) is plotted in transparent above and below the curve. The baseline (in blue) corresponds to the basic spectral method of Algorithm 1 for linear and Algorithm 2 for circular seriation. Other lines correspond to given choices of the dimension of the d-, as written in the legend.

(a) Linear Banded
(b) Circular Banded
Figure 2: Kendall-Tau scores for Linear (1(a)) and Circular (1(b)) Seriation for noisy observations of banded, Toeplitz, matrices, displayed for several values of the dimension parameter of the d-(), for fixed number of neighbors .

We observe that leveraging the additional dimensions of the d- unused by the baseline methods Algorithm 1 and 2 substantially improves the robustness of Seriation. For instance, in Figure 1(a), the performance of Algorithm 3 is almost optimal for a noise amplitude going from 0 to 4, when it falls by a half for Algorithm 1. We illustrate the effect of the pre-processing of Algorithm 3 in Figures 12 and 13, Appendix B.6.

5.2 Genome assembly experiment

In de novo genome assembly, a whole DNA strand is reconstructed from randomly sampled sub-fragments (called reads) whose positions within the genome are unknown. The genome is oversampled so that all parts are covered by multiple reads with high probability. Overlap-Layout-Consensus (OLC) is a major assembly paradigm based on three main steps. First, compute the overlaps between all pairs of read. This provides a similarity matrix , whose entry measures how much reads and overlap (and is zero if they do not). Then, determine the layout from the overlap information, that is to say find an ordering and positioning of the reads that is consistent with the overlap constraints. This step, akin to solving a one dimensional jigsaw puzzle, is a key step in the assembly process. Finally, given the tiling of the reads obtained in the layout stage, the consensus step aims at determining the most likely DNA sequence that can be explained by this tiling. It essentially consists in performing multi-sequence alignments.

In the true ordering (corresponding to the sorted reads’ positions along the genome), a given read overlaps much with the next one, slightly less with the one after it, and so on, until a point where it has no overlap with the reads that are further away. This makes the read similarity matrix Robinson and roughly band-diagonal (with non-zero values confined to a diagonal band). Finding the layout of the reads therefore fits the Linear Seriation framework (or Circular Seriation for circular genomes, as illustrated in Supplementary Figure 5). In practice however, there are some repeated sequences (called repeats) along the genome that induce false positives in the overlap detection tool (Pop04), resulting in non-zero similarity values outside (and possibly far away) from the diagonal band. The similarity matrix ordered with the ground truth is then the sum of a Robinson band matrix and a sparse “noise” matrix, as in Figure 2(a). Because of this sparse “noise”, the basic spectral Algorithm 1 fails to find the layout, as the quadratic loss appearing in 2-SUM

is sensitive to outliers.

recanati2018robust tackle this issue by modifying the loss in 2-SUM to make it more robust. Instead, we show that the simple multi-dimensional extension proposed in Algorithm 3 suffices to capture the ordering of the reads despite the repeats.

(a) similarity matrix
(b) ordering found
Figure 3: Overlap-based similarity matrix (2(a)) from E. coli reads, and the ordering found with Algorithm 3 (2(b)) versus the position of the reads within a reference genome obtained by mapping to a reference with minimap2. The genome being circular, the ordering is defined up to a shift, which is why we observe two lines instead of one in (2(b)).

We used our method to perform the layout of a E. coli bacterial genome. We used reads sequenced with third-generation sequencing data, and computed the overlaps with dedicated software, as detailed in Appendix B.1. The new similarity matrix computed from the embedding in Algorithm 3 was disconnected, resulting in several connected component instead of one global ordering (see Figure 5(b)). However, the sub-orderings could be unambiguously merged into one in a simple way described in Algorithm 4, resulting in the ordering shown in Figure 2(b). The Kendall-Tau score between the ordering found and the one obtained by sorting the position of the reads along the genome (obtained by mapping the reads to a reference with minimap2 (li2018minimap2)) is of 99.5%, using Algorithm 5 to account for the circularity of the genome.

6 Conclusion

In this paper, we bring together results that shed light on the filamentary structure of the Laplacian embedding of serial data. It allows for tackling Linear Seriation and Circular Seriation in a unifying framework. Notably, we provide theoretical guarantees for Circular Seriation analog to those existing for Linear Seriation. These do not make assumptions about the underlying generation of the data matrix, and can be verified a posteriori by the practitioner. Then, we propose a simple method to leverage the filamentary structure of the embedding. It can be seen as a pre-processing of the similarity matrix. Although the complexity is comparable to the baseline methods, experiments on synthetic and real data indicate that this pre-processing substantially improves robustness to noise.

Acknowledgements

AA is at the département d’informatique de l’ENS, École normale supérieure, UMR CNRS 8548, PSL Research University, 75005 Paris, France, and INRIA Sierra project-team. The authors would like to acknowledge support from the data science joint research initiative with the fonds AXA pour la recherche and Kamet Ventures. TK acknowledges funding from the CFM-ENS chaire les modèles et sciences des données. Finally the authors would like to thanks Raphael Berthier for fruitfull discussions.

References

Appendix A Additional Algorithms

a.1 Merging connected components

The new similarity matrix computed in Algorithm 3 is not necessarily the adjacency matrix of a connected graph, even when the input matrix is. For instance, when the number of nearest neighbors is low and the points in the embedding are non uniformly sampled along a curve, may have several, disjoint connected components (let us say there are of them in the following). Still, the baseline Algorithm 1 requires a connected similarity matrix as input. When is disconnected, we run 1 separately in each of the components, yielding sub-orderings instead of a global ordering.

However, since is connected, we can use the edges of between the connected components to merge the sub-orderings together. Specifically, given the ordered subsequences, we build a meta similarity matrix between them as follows. For each pair of ordered subsequences , we check whether the elements in one of the two ends of have edges with those in one of the two ends of in the graph defined by . According to that measure of similarity and to the direction of these meta-edges (i.e., whether it is the beginning or the end of and that are similar), we merge together the two subsequences that are the closest to each other. We repeat this operation with the rest of the subsequences and the sequence formed by the latter merge step, until there is only one final sequence, or until the meta similarity between subsequences is zero everywhere. We formalize this procedure in the greedy Algorithm 4, which is implemented in the package at https://github.com/antrec/mdso.

Given reordered subsequences (one per connected component of ) , that form a partition of , and a window size that define the length of the ends we consider ( must be smaller than half the smallest subsequence), we denote by (resp. ) the first (resp. the last) elements of , and is the similarity between the ends and , for any pair , , and any combination of ends . Also, we define the meta-similarity between and by,

(9)

and the combination of signs where the argmax is realized, i.e., such that . Finally, we will use to denote the ordered subsequence read from the end to the beginning, for instance if , then .

0:   ordered subsequences forming a partition of , an initial similarity matrix , a neighborhood parameter .
1:  while  do
2:     Compute meta-similarity such that , and meta-orientation , for all pairs of subsequences with equation 9.
3:     if  then
4:        break
5:     end if
6:     find , and the corresponding orientations.
7:     if  then
8:        
9:     else if  then
10:        
11:     else if  then
12:        
13:     else if  then
14:        
15:     end if
16:     Remove and from .
17:     Add to .
18:     
19:  end while
19:  Total reordered sequence , which is a permutation if or a set of reordered subsequences if the loop broke at line 5.
Algorithm 4 Merging connected components

a.2 Computing Kendall-Tau score between two permutations describing a circular ordering

Suppose we have data having a circular structure, i.e., we have items that can be laid on a circle such that the higher the similarity between two elements is, the closer they are on the circle. Then, given an ordering of the points that respects this circular structure (i.e., a solution to Circular Seriation), we can shift this ordering without affecting the circular structure. For instance, in Figure 4, the graph has a affinity matrix whether we use the indexing printed in black (outside the circle), or a shifted version printed in purple (inside the circle).

Figure 4: Illustration of the shift-invariance of permutations solution to a Circular Seriation problem.

Therefore, we transpose the Kendall-Tau score between two permutations to the case where we want to compare the two permutations up to a shift with Algorithm 5

0:  Two permutations vectors of size , and
1:  for  to  do
2:     
3:  end for
4:  
4:  best score
Algorithm 5 Comparing two permutation defining a circular ordering

Appendix B Additional Numerical Results

b.1 Genome assembly experiment (detailed)

Here we provide background about the application of seriation methods for genome assembly and details about our experiment. We used the E. coli reads from Loman15. They were sequenced with Oxford Nanopore Technology (ONT) MinION device. The sequencing experiment is detailed in http://lab.loman.net/2015/09/24/first-sqk-map-006-experiment where the data is available. The overlaps between raw reads were computed with minimap2 (li2018minimap2) with the ONT preset. The similarity matrix was constructed directly from the output of minimap2. For each pair of reads where an overlap was found, we let the number of matching bases be the similarity value associated (and zero where no overlap are found). The only preprocessing on the matrix is that we set a threshold to remove short overlaps. In practice we set the threshold to the median of the similarity values, i.e., we discard the lower half of the overlaps. We then apply our method to the similarity matrix. The laplacian embedding is shown in Figure 5(a). We used no scaling of the Laplacian as it corrupted the filamentary structure of the embedding, but we normalized the similarity matrix beforehand with as in coifman2006diffusion. The resulting similarity matrix computed from the embedding in Algorithm 3 is disconnected. Then, Algorithm 1 is applied in each connected component, yielding a fragmented assembly with correctly ordered contigs, as shown in Figure 5(b). However, if the new similarity matrix is disconnected, the input matrix is connected. The fragmentation happened while “scanning” the nearest-neighbors from the embedding. One can therefore merge the ordered contigs using the input matrix as follows. For each contig, we check from if there are non-zero overlaps between reads at the edges of that contig and some reads at the edges of another contig. If so, we merge the two contigs, and repeat the procedure until there is only one contig left (or until there is no more overlaps between edges from any two contigs). This procedure is detailed in Algorithm 4. Note that the E. coli genome is circular, therefore computing the layout should be casted as a Circular Seriation problem, as illustrated in Figure 5. Yet, since the genome is fragmented in subsequences since is disconnected, we end up using Algorithm 1 in each connected component, i.e., solving an instance of Linear Seriation in each contig.

Figure 5: Illustration of why the overlap-based similarity matrix of an ideal circular genome should be .
(a) 3-
(b) partial orderings
Figure 6: 3d Laplacian embedding from E. coli reads overlap-based similarity matrix (5(a)), and the orderings found in each connected component of the new similarity matrix created in Algorithm 3 (5(b)) versus the position of the reads within a reference genome obtained by mapping tge reads to the reference with minimap2 (all plotted on the same plot for compactness). The orderings have no absolute direction, i.e., and are equivalent, which is why the lines in subfigure 5(b) can be either diagonal or anti-diagonal.

The experiment can be reproduced with the material on https://github.com/antrec/mdso, and the parameters easily varied. Overall, the final ordering found is correct when the threshold on the overlap-based similarity is sufficient (in practice, above of the non-zero values). When the threshold increases or when the number of nearest neighbors from Algorithm 3 decreases, the new similarity matrix gets more fragmented, but the final ordering remains the same after the merging procedure.

b.2 Gain over baseline

In Figure 2

, each curve is the mean of the Kendall-tau (a score directly interpretable by practitioners) over many different Gaussian random realizations of the noise. The shaded confidence interval represents the area in which the true expectation is to be with high probability but not the area in which the score of an experiment with a given noisy similarity would be. As mentioned in the main text, the shaded interval is the standard deviation divided by

, since otherwise the plot was hard to read, as the intervals crossed each others.

Practitioners may use this method in one-shot (e.g. for one particular data-set). In that case, it would be more relevant to show directly the standard deviation on the plots, which is the same as what is displayed, but multiplied by 10. Then, the confidence intervals between the baseline and our method would cross each other. However, the standard deviation on all experiments is due to the fact that some instances are more difficult to solve than some others. On the difficult instances, the baseline and our method perform more poorly than on easy instances. However, we also computed the gain over the baseline, i.e., the difference of score between our method and the baseline, for each experiment, and it is always, or almost always positive, i.e., our method almost always beats the baseline although the confidence intervals cross each other.

b.3 Numerical results with KMS matrices

In Figure 7 we show the same plots as in Section 5 but with matrices such that , with and .

(a) Linear KMS
(b) Circular KMS
Figure 7: K-T scores for Linear (6(a)) and Circular (6(b)) Seriation for noisy observations of KMS, Toeplitz, matrices, displayed for several values of the dimension parameter of the d-.

b.4 Sensitivity to parameter (number of neighbors)

Here we show how our method performs when we vary the parameter (number of neighbors at step 4 of Algorithm 3), for both linearly decrasing, banded matrices, (as in Section 5), in Figure 8 and with matrices such that , with (Figure 9.

(a) Linear Banded
(b) Circular Banded
Figure 8: K-T scores for Linear (7(a)) and Circular (7(b)) Seriation for noisy observations of banded, Toeplitz, matrices, displayed for several values of the number of nearest neighbors , with a fixed value of the dimension of the d-, .
(a) Linear KMS
(b) Circular KMS
Figure 9: K-T scores for Linear (8(a)) and Circular (8(b)) Seriation for noisy observations of KMS, Toeplitz, matrices, displayed for several values of the number of nearest neighbors , with a fixed value of the dimension of the d-, .

We observe that the method performs roughly equally well with in a range from 5 to 20, and that the performances drop when gets too large, around . This can be interpreted as follows. When is too large, the assumption that the points in the embedding are locally fitted by a line no longer holds. Note also that in practice, for small values of , e.g., , the new similarity matrix can be disconnected, and we have to resort to the merging procedure described in Algorithm 4.

b.5 Sensitivity to the normalization of the Laplacian

We performed experiments to compare the performances of the method with the default Laplacian embedding (d-) (red curve in Figure 10 and 11) and with two possible normalized embeddings ((, d)-) (blue and black curve). We observed that with the default d-, the performance first increases with , and then collapses when gets too large. The CTD scaling (blue) has the same issue, as the first eigenvalues are roughly of the same magnitude in our settings. The heuristic scaling (, d)- with that damps the higher dimensions yields better results when increases, with a plateau rather than a collapse when gets large. We interpret these results as follows. With the (d-), Algorithm 3, line 5 treats equally all dimensions of the embedding. However, the curvature of the embedding tends to increase with the dimension (for matrix, the period of the cosines increases linearly with the dimension). The filamentary structure is less smooth and hence more sensitive to noise in high dimensions, which is why the results are improved by damping the high dimensions (or using a reasonably small value for ).

(a) Linear Banded
(b) Circular Banded
Figure 10: Mean of Kendall-Tau for Linear (9(a)) and Circular (9(b)) Seriation for noisy observations of banded, Toeplitz, matrices, displayed for several scalings of the Laplacian embedding, with a fixed number of neighbors and number of dimensions in the d-.
(a) Linear Banded
(b) Circular Banded
Figure 11: Mean of Kendall-Tau for Linear (10(a)) and Circular (10(b)) Seriation for noisy observations of banded, Toeplitz, matrices, displayed for several scalings of the Laplacian embedding, with a fixed number of neighbors and number of dimensions in the d-.

b.6 Illustration of Algorithm 3

Here we provide some visual illustrations of the method with a circular banded matrix. Given a matrix (Figure 11(a)), Algorithm 3 computes the d-. The 2- is plotted for visualization in Figure 11(b). Then, it creates a new matrix (Figure 12(a)) from the local alignment of the points in the d-. Finally, from the new matrix , it computes the 2- (Figure 12(a)), on which it runs the simple method from Algorithm 2.

Figure 12 and 13 give a qualitative illustration of how the method behaves compared to the basic Algorithm 2.

(a) Noisy circular banded matrix
(b) Noisy 2-
Figure 12: Noisy Circular Banded matrix (11(a)) and associated 2d Laplacian embedding (11(b)).
(a) Matrix from Algorithm 3
(b) New 2-
Figure 13: Matrix created through Algorithm 3 (12(a)), and associated 2d-Laplacian embedding (12(b)).

Appendix C Proof of Theorem 3.2

In this Section, we prove Theorem 3.2. There are many technical details, notably the distinction between the cases even and odd. The key idea is to compare the sums involved in the eigenvalues of the circulant matrices . It is the sum of the times values of cosines. For , we roughly have a reordering inequality where the ordering of the matches those of the cosines. For the following eigenvalues, the set of values taken by the cosines is roughly the same, but it does not match the ordering of the . Finally, the eigenvectors of the Laplacian of are the same than those of for circulant matrices , as observed in §3.4.

We now introduce a few lemmas that will be useful in the proof.

Notation. In the following we denote and . Let’s define . Depending on the parity of , we will write or . Hence we always have . Also when and are not coprime we will note as well as with and coprime.

c.1 Properties of sum of cosinus.

The following lemma gives us how the partial sum sequence behave for or as well as it proves its symmetric behavior in (11).

Lemma C.1.

For , and any

(10)

Also, for ,

(11)

For and even (), we have

(12)