, most of them assume concurrent access to all data being clustered. Our interest is in efficiently clustering each datum as it becomes available, for applications that require unsupervised learning in real time.
The Links approach is to estimate the probability distribution of each cluster based on its current constituent vectors, to use those estimates to assign new vectors to clusters, and to update estimated distributions with each added vector. The update step includes fixing past cluster assignments where indicated by taking the additional data into account, although this is primarily to improve the internal model over time, since in typical online usage scenarios, each cluster assignment is provided once, at the time a new vector is made available.
Prior work 
addressing online clustering of unit vectors employs a small-variance approximation and is applied to low-dimensional problems such as segmentation of surface normals in 3D. Our approach is complementary in that it uses a high-dimensional approximation, and has been applied to problems with relatively high variance.
2.1 Generative model for a cluster
Let be a set of unit-length vectors in . They are confined to the submanifold , and to determine proximity for the purpose of clustering these vectors, we will use the natural metric on this submanifold, which is simply the angle between vectors:
We address the problem of cluster distributions within this submanifold with the following properties:
Each cluster has a center vector and its member vectors are generated by a probability density that is isotropic in the sense that it only depends on distance from the center, .
The function is the same for every cluster, so that probability densities for different clusters are related by isometry.
decreases exponentially with ; for example, as a Gaussian suitably normalized on :
This ensures that the distribution is reasonably localized, since the exponential decrease compensates for a polynomial factor in the marginal distribution of :
where is a constant equal to the hypersurface area of ,
The prior distribution for the center of a cluster is constant on (no unit vector is preferred).
2.2 Estimated distribution
Given a set chosen randomly from the same cluster, but without knowledge of the center of the cluster, we would like to estimate the cluster’s probability distribution. The likelihood of the center value is
Since the prior is constant, the posterior is also proportional to the expression in equation 5. The maximum likelihood (and maximum a posteriori) center is therefore
which is the same as the centroid of the vectors as defined for a hypersphere according to . The estimated probability distribution for the cluster is
The probability that a new vector belongs to the same cluster can then be estimated as the cumulative amount
2.3 High-dimensional approximation
Our primary interest is in problems with relatively large . For example, our typical embedding vectors have . For large enough , the following are true:
- Lemma 1
Two randomly chosen vectors are almost always almost perpendicular, i.e.,
for some positive numbers and .
- Lemma 2
The angle between a cluster center and a random vector from that cluster is almost always almost equal to a global constant , i.e.,
for some positive numbers and .
- Lemma 3
Given two randomly chosen vectors from a cluster with center , their components perpendicular to will almost always be almost perpendicular to each other, i.e.,
for some positive numbers and .
To assess whether to add a new vector to an existing cluster known to include the vectors , we determine a threshold
on the cosine similarity between the new vector and the centroidof the existing vectors. Using the approximation in lemmas 2 and 3, and assuming , we can compute vector components in an orthonormal basis including , and . This yields
and a threshold of
where , which we call the cluster similarity threshold.
which confirms that as we accumulate more vectors in a given cluster, the center and cosine similarity threshold of the estimated distribution approach the center and cosine similarity threshold of the generative distribution (i.e., the estimate improves). Since is a strictly increasing function of , the variance of the estimated distribution decreases with .
Similarly, to assess whether two clusters are the same, we determine a threshold on the cosine similarity between their centroids where, for and ,
Note that equation 13 is the special case with ,
The latter confirms that the centers estimated from the two sets of cluster points converge.
3.1 Online clustering
Each new input vector is assigned to a cluster as soon as it is produced, with no knowledge of future vectors and no backtracking. A unique ID for that cluster is returned. The clusterer keeps statistical information about the vectors received so far. Although it cannot change a previous answer, it can change the internal representation of cluster statistics, such as improvements to estimated distributions as well as cluster splits and merges when indicated by new information.
3.2 Internal representation
The Links algorithm’s internal representation is a two-level hierarchy: clusters are collections of subclusters, and subclusters are collections of input vectors. The subclusters are represented as nodes in a graph whose edges join ‘nearby’ nodes (meaning subclusters that likely belong to the same cluster given the data so far), and clusters are defined as connected components of the graph. Whereas subclusters are indivisible, clusters can become split along graph edges in response to changes in subcluster estimated probability distributions as new data is added. Alternatively, subclusters joined by an edge can become merged in response to changes.
The reasons for maintaining this two-level hierarchy (rather than, say, an arbitrary number of levels) are efficiency and practicality. It is efficient because the algorithm scales with number of subclusters rather than number of vectors. It is practical because the key cluster substructure that can affect future cluster IDs is the set of potential split points.
3.3 Assessing cluster membership
When a new vector is available, compute its cosine similarity to each subcluster centroid , and add it to the most-similar subcluster if the similarity is above a fixed threshold . In other words, let
then add to subcluster . , called the subcluster similarity threshold
, is a hyperparameter determining the granularity of cluster substructure appropriate for the data.
If inequality 20 does not hold, then start a new subcluster containing just . Next, use the estimated probability distribution of subcluster to determine whether to include the new subcluster in the same cluster as , by thresholding the cumulative probability in expression 8. In the high-dimensional approximation, this means the subcluster is included in the cluster whenever
where is the number of vectors in the subcluster . To a first approximation, is as given in equation 13. This will be further refined in section 3.5. If inequality 21 does hold, then add an edge to the graph joining the new subcluster to subcluster .
3.4 Updating clusters
When a new vector is added to an existing subcluster, the subcluster’s centroid may change. If this brings it within the subcluster similarity threshold of the centroid of another subcluster currently joined to the first by an edge, then the two are merged. In other words, if , then nodes and are replaced with a single node containing the vectors of both, and with the edge connections of both. Since the merging process also results in a new subcluster centroid, this check is continued recursively on affected subclusters.
Next, the edges joining affected nodes are checked for validity. The edge joining subclusters and is removed if the following does not continue to hold:
where is approximately as given in equation 16, but with improvements to follow in section 3.5. After severing a cluster in two by removing an edge, an attempt is made to re-join the two parts by adding an edge from the affected node to a new partner node that does satisfy inequality 22. If no such partner is found, then the cluster remains permanently split.
Equations 13 and 16 were used to determine thresholds for membership in the same cluster as a given subcluster, effectively treating the subcluster’s members as randomly chosen from the cluster and not correlated with each other. If one were to properly take into account intra-subcluster correlations, then one consequence is that the limit in equation 18 would be reduced to a positive number , which we call the pair similarity maximum,
whereas the value of , which is , would remain unchanged. Any implicit anisotropy in the cluster distribution, such as an elongation along a preferred axis, will further reduce the value of without changing . A simple though approximate way to incorporate these adjustments into the algorithm is to replace and
by the following interpolated versions:
3.6 Hyperparameter Tuning
The similarity thresholds , and need to be tuned to best represent the data source. This is done by manually labeling a dataset with cluster IDs, running the clusterer on the data, and adjusting hyperparameters to improve the accuracy of the output cluster IDs. Accuracy is simply fraction of correct IDs. Prior to evaluation, the Hungarian algorithm  is used to map a subset of output cluster IDs bijectively to a subset of ground truth cluster IDs in such a way that produces the best possible accuracy. For some applications an alternate objective has been used; for example, one that gives different weights for conflating IDs vs. fracturing IDs, to reflect the seriousness of each type of error in practise.
The authors would like to thank Dr. Brian Budge and Dr. Navid Shiee for help with APIs and evaluation frameworks used in the implementation of the Links algorithm.
-  Brian Everitt, Cluster Analysis, John Wiley & Sons, 2011.
-  Christian Hennig, Marina Meila, Fionn Murtagh, and Roberto Rocci, Handbook of Cluster Analysis, Chapman and Hall/CRC, December 2015.
-  Julian Straub, Trevor Campbell, Jonathan P. How, and John W. Fisher, “Small-variance nonparametric clustering on the hypersphere,” in , June 2015, pp. 334–342.
F. Schroff, D. Kalenichenko, and J. Philbin,
“Facenet: A unified embedding for face recognition and clustering,”in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 815–823.
-  Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno, “Generalized end-to-end loss for speaker verification,” arXiv preprint arXiv:1710.10467, 2017.
-  Quan Wang, Carlton Downey, Li Wan, Philip Andrew Mansfield, and Ignacio Lopez Moreno, “Speaker diarization with lstm,” arXiv preprint arXiv:1710.10468, 2017.
-  Samuel R. Buss and Jay P. Fillmore, “Spherical averages and applications to spherical splines and interpolation,” ACM Transactions on Graphics, vol. 20, no. 2, pp. 95–126, 2001.
-  Harold W. Kuhn, “The hungarian method for the assignment problem,” Naval Research Logistics Quarterly, vol. 2, pp. 83–97, 1955.