The use of simplicial complexes as a means for estimating topology has found many applications to data analysis in recent times. For example, unsupervised learning techniques such as persistent homology[MR2476414, MR2405684] often use what are known as Cech or Vietoris-Rips filtrations to capture multi-scale topological features of a point cloud . The simplicial complexes in these filtrations individually are not always a good representation of the actual physical shape of since, for example, they often have a higher dimension than . Our aim is to give an algorithm that can approximate to the greatest extent possible when fed any simplicial complex mapped linearly into . This algorithm has several nice properties, including a tendency towards preserving embeddings, as well as reducing to k-means clustering when is -dimensional. The resulting fitting is further refined by deleting simplices that have been poorly matched with ; the end result being a locally linear approximation of . A lower dimensional representation of in terms of barycentric coordinates follows by projecting onto this approximation.
2. Algorithm Description
Fix be a (geometric) simplicial complex, and let be the set of vertices of . The facets of are the simplices of that have the highest dimension in the sense that they are not contained in the boundary of any other simplex. may then be represented as a collection of facets, each of which is represented by the set of vertices that it contains. The dimension of a facet is equal to the number of vertices it contains minus . When we refer to the boundary of a simplex , we mean the union of its smaller dimensional boundary simplices, even when the simplex is embedded as a subset of . Its interior is minus this boundary.
Any point is contained in a unique smallest dimensional simplex . We may represent uniquely as a convex combination
over the vertices that are contained in , where for some barycentric coordinates . A map is said to be linear if it is linear on each simplex of . Namely,
So restricts to a linear embedding on each simplex of , and is uniquely determined by the values that it takes on its vertices . Thus, we have a convenient representation of in terms of the ’s.
Fix our finite set of data points. Suppose is any choice of linear map, meant to represent our initial fitting of to . Starting with , our aim is to obtain successively better fittings of to , at each iteration giving a better reflection of the shape and structure of . We do this as follows.
The assignment for in Step 8 can be generalized to
where is the learning rate. Larger values for lead to faster convergence, but poorer fittings and reduced stability. Taking to be small () but nonzero has every advantage in addition to preventing the change in the mapping of a vertex becoming sluggish between iterations when the barycentric coordinates are all near zero. On the other hand, this does not prevent mapping of from becoming stuck in its current position when each is zero (regardless of the value ) since consists only of those for which is on the interior of some simplex that contains (meaning those on the boundary faces of opposite to are ignored). There are advantages and disadvantages to this. On the positive side, higher dimensional simplices are often prevented from being collapsed onto lower dimensional linear patches of data, but this can also be prevent simplices from being fitted in situations where it would be desired. For example, when fitting homeomorphic to a sphere to data sampled from a sphere (Figure LABEL:LF8), craters can form and remain in subsequent iterations if data points surrounding the crater are in fact nearest to points lying on its rim. This issue can be circumvented (but our advantages reversed!) by taking the slightly larger set
in place of , together with .
There is a straightforward intuition underlying Algorithm 1. Consider the case where at the iteration is a linear embedding. Then and can then be thought of as and the subspace , and Algorithm 1 in essence has attract towards it by having each point exert a pull on a nearest point . The caveat here is that in doing so the embedding must be kept linear, so the net effect of this pull must come down to its influence on the individual vertices of the simplex containing . The influence on each of these vertices should in turn decrease with some measure of distance of to . This is analogous to pulling on a string attached at some point along a perpendicular uniform rod floating in space, in which case the acceleration of an endpoint of the rod in the direction of the pull decreases with increasing distance from the string. Since the size or shape of the simplex in our context is irrelevant, the distance from to is measured in terms of its barycentric coordinates . In particular, if lies near a boundary simplex opposite to , then is near and has little influence on , while has full influence on when , or equivalently, when . The accumulation of these pulling forces on a vertex leads us to take the centroid of the over all ; equivalently, over all that are closest to a point lying on the interior of a simplex that has as a vertex.
When is -dimensional – namely a collection of disjoint vertices – there is only a single barycentric coordinate for each and , so Algorithm 1 reduces to the classical k-means algorithm with initial clusters . Algorithm 1
is thereby a high dimensional non-discrete generalization of k-means clustering. This is perhaps in the same spirit as persistent homology is a higher dimensional generalization of hierarchical clustering (also by way of simplicial complexes).
2.2. Preserving embeddings
To obtain a our fitting , why not simply apply the k-means algorithm to the vertices ? This would likely be a poor fitting of since the arrangement of simplices comprising is ignored completely. Moreover,
would probably not be an embedding irrespective of. Take for example the case where is the following four-vertex graph embedded in