 # Maximum Likelihood Estimation for Single Linkage Hierarchical Clustering

We derive a statistical model for estimation of a dendrogram from single linkage hierarchical clustering (SLHC) that takes account of uncertainty through noise or corruption in the measurements of separation of data. Our focus is on just the estimation of the hierarchy of partitions afforded by the dendrogram, rather than the heights in the latter. The concept of estimating this "dendrogram structure" is introduced, and an approximate maximum likelihood estimator (MLE) for the dendrogram structure is described. These ideas are illustrated by a simple Monte Carlo simulation that, at least for small data sets, suggests the method outperforms SLHC in the presence of noise.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Distance-based clustering is the task of grouping objects by some measure of similarity, so that objects in the same group (or cluster

) are more similar or closer (with respect to a prescribed notion of distance) than those in different clusters. Clustering is a common technique for statistical data analysis, widely used in data mining, machine learning, pattern recognition, image analysis, bioinformatics and cyber security.

Conventional (“flat”, “hard”) clustering methods accept a finite metric space as input and return a partition of as their output. Hierarchical clustering (HC) methods have a different philosophy: their output is an entire hierarchy of partitions, called a dendrogram, capable of exhibiting multi-scale structure in the data set [1, 2]. Rather than fixing the required number of clusters in advance, as is common for many flat clustering algorithms, it is more informative to furnish a hierarchy of clusters, providing an opportunity to choose a partition at a scale most natural for the context of the task at hand.

Many HC methods require linkage functions to provide a measure of dissimilarity between clusters (see [3, 4] for a fairly recent review). Some commonly used linkage functions are single linkage, complete linkage, average linkage, etc. The SLHC method, though suffering from the so called “chaining effect”, remains popular for large scale applications  because of the low complexity of implementing it using minimum spanning trees (MST) . This work relies on a particularly useful representation of dendrograms using ultra-metrics, introduced by Jardine and Sibson . Their point of view enabled redefinition of HC methods as maps from the collection of finite metric spaces to the collection of finite ultra-metric spaces [1, 8]. This enables a discussion of two essential properties  —  stability and convergence with respect to the Gromov-Hausdorff metric  —  that characterize SLHC within a broad class of HC methods .

Motivation: As described in , distance-based clustering methods, hierarchical as well as flat and overlapping, are deeply rooted in several mathematical disciplines, and are ubiquitous in bio-informatics applications. For example, in contemporary applications to the analysis of gene expression data [10, 11, 12], the raw data generated by microarrays is usually preprocessed to extract normalized expression values from which distance measures are computed, to be subsequently fed as input to a clustering algorithm. Depending on the kind of information sought, different variants of the conventional HC methods are applied, such as, for instance, hybrid HC  or improved Pearson correlation proximity-based HC .

More generally, HC methods play an important role wherever learning and analysis of data have to be performed in an unsupervised fashion. For example, clustering is a key underpinning technology in most algorithms for cyber-security. In this context, clustering arises in a large number of applications, including malware detection , identification of compromised domains in DNS traffic , classification of sources and methods of attacks , identification of hacker and cyber-criminal communities , detection of repackaged code in applications for mobile devices , and classification of files and documents . There is an urgent need for more robust and reliable clustering algorithms.

Essentially all approaches to clustering, hierarchical or otherwise, accept the distances as “the truth”. It is assumed uncorrupted by noise or artifacts. Particularly at the level of analogue data such as timing and device dependent parameters, but also even with some digital data, this is far from a correct model. For example, measures of dissimilarity between code samples are often engineered to reflect an opinion of the algorithm designer regarding the significance of specific features of executable code; it is more plausible to treat distance measures produced in this way as (quantifiably) uncertain measurements of the code sample rather than regard them as an objective truth.

Thus, practical necessities lead us to require that the output of clustering algorithms should account for uncertainty in the distance data and, to do that, a rigorous statistical approach is required. Obtaining dendrograms through statistical estimation (with an appropriate noise model for the data) will, in principle, result in improved HC methods to meet the needs of applications.

Conventional approaches to statistical estimation of partitions and hierarchies view the objects to be clustered as random samples of certain distributions over a prescribed geometry (e.g. Gaussian mixture model estimation using expectation-maximization in Euclidean spaces), and clusters can then easily be described in terms of their most likely origin. Thus, these are really

distribution-based clustering methods  —  not distance-based ones. Our approach directly attributes uncertainty to the process of obtaining values for the pairwise distances rather than distort the data by mapping it into one’s “favorite space”. To the best of our knowledge, very little work has been done in this vein. Of note is , where similar ideas have been applied to the estimation of spanning trees in a communication network (see related work below).

Related work: In phylogenetic applications, the use of MLE and Bayesian methods for the estimation of evolutionary trees is a time-honored tradition spanning decades [22, 23, 24, 25, 26, 27, 28, 29], and various clustering methods having been introduced for purposes of “phenetic clustering” [30, 31]. In a rough outline, one estimates a tree structure to describe a population of samples from distributions of the form , where is the evolutionary tree structure and the measurement is a gene character (such as gene or nucleotide frequency); alternatively, one assigns (deterministically!) distance measures to reflect uncertainties in the quantities measured in the population. Estimation relying only on the noise model of the underlying dissimilarity measure would clearly constitute a much more general apparatus, compressing all the uncertainty about the data into the noise model, but otherwise treating all data sources with equal mathematical rigor.

A serious hurdle in the way of brute-force MLE estimation of dendrogram structure is the super-exponential growth of the number of such structures with the size of the data set. Naturally, this aspect of the estimation problem is more readily seen in applications related to cyber-space, where large data sets dominate the scene. The work of Castro et al. [32, 21] needs to be credited for having inspired an MCMC-based hypothesis-pruning procedure we have applied in this paper. However, we point out that both the clustering and the estimation problems that are the foci of their work are quite different from ours, and much more limited in scope. First, and most important, is that Castro et al. restrict attention to similarities with constant inter-cluster values, which effectively corresponds to postulating an ultra-metric setting ab initio. This is well-suited for the purposes of their application (network topology identification), but is unsatisfactory for the general case. Secondly, the network model of Castro et al. is not, strictly speaking a metric model, as they do not enforce the constraints coming from the triangle inequalities. The notorious complexity [33, 34] of this set of inequalities poses significant additional challenges to the problem of estimating structure from a measurement of a metric.

## 2 Preliminaries

Distance-Based Clustering. Given a set of objects , a hard clustering method generates a partition of   —  a collection of pairwise-disjoint subsets (clusters) of whose union is . For any and any partition of , we denote the cluster of containing by .

We focus on distance-based clustering methods, where a data set undergoes initial processing to produce a symmetric, real-valued, non-negative function on , whose values satisfy the triangle inequality, and serve to quantify the “degree of dissimilarity” between data entries. In applications, the user has some freedom to determine the values of the according to the requirements of the application in hand.

Hierarchical Clustering. Attempts have been made to anchor distance-based clustering in a firm axiomatic foundation, but results so far have been negative:  studies a seemingly intuitive and minimalistic axiomatic system for distance-based clustering which fails to support any clustering method;  works in a similar vein to demonstrate that reasonable axiomatic notions of distance between outputs of a flat clustering method are equally elusive. As later described in , the key obstruction to such axiomatic approaches lies with the requirement to produce a single partition as its output. To resolve this issue they proposed HC methods, producing dendrograms.

Recall that a partition is said to refine a partition , if every cluster of is contained in a cluster of ; we denote this by . A hierarchy is a collection of partitions, where every satisfy either or . Intuitively, a dendrogram is a hierarchy with assigned heights, or resolutions; this is usually represented visually as a rooted tree  —  see Figure 1 (left). Figure 1: A rooted tree with labelled leaves as a dendrogram (left) and as an ultra-metric on O={1,…,5} (right)

Formally, following , we describe a dendrogram as a pair , where is a map of to the collection of partitions on satisfying the following:

• There exists such that for all ;

• If then refines ;

• For all , there exists s.t. for .

Clusters of are called clusters at resolution .

Encoding Dendrograms. Ultra-metrics provide a convenient tool for encoding dendrograms . Recall that a metric on is said to be an ultra-metric, if

 dxy⩽max(dxz,dzy),∀x,y,z∈O. (1)

The correspondence between dendrograms and ultra-metrics  is described as follows: any dendrogram gives rise to an ultra-metric , as shown in Figure 1:

 u(β)xy:=inf{r>0∣∣β(r)x=β(r)y}. (2)

Conversely, the dendrogram may be reconstructed from an ultra-metric by setting

 β(r)x:={y∈O∣∣uxy≤r}. (3)

Single Linkage Hierarchical Clustering (SLHC). Single-linkage hierarchical clustering is defined, from the point of view of dendrograms, as follows. Given the metric space , for each , a dendrogram is constructed by setting to lie in the same cluster of the partition if and only if there exists a finite sequence of points with , and for all . Such a sequence is called an -chain from to in .

SLHC is often implemented by constructing an MST in . The partition is obtained from any MST of by removing all edges of of length; the clusters of the corresponding dendrogram are the connected components of the resulting forest, and the corresponding ultra-metric distance () then equals the length of the longest edge of (with respect to the distance ) separating from in . Henceforth, we will write to denote the single linkage mapping of a metric to the ultra-metric encoding of the corresponding dendrogram .

It is a central result of , that SLHC is the unique hierarchical clustering method enjoying certain naturality properties, in stark contrast with the flat clustering situation. For a detailed discussion of the map , we refer the reader to [1, 2].

Notation. Note that metrics and ultra-metrics are conveniently written in matrix form, after fixing an order on . Thus, writing , we will use , to denote the metric in matrix form, and to denote the ultra-metric obtained from it by applying the map .

## 3 Statistical Model

It is useful to separate the metric information in a dendrogram (the grading by resolution) from the combinatorial information it conveys: a dendrogram/ultra-metric is uniquely represented by a pair of parameters , where:

• The parameter, , denotes the structure of : the hierarchy defined by (with the resolutions forgotten), ordered by refinement.

• The parameter, , is the

height vector of

, whose coordinates, in order, indicate the minimum resolution at which each partition in the structure occurs  —  see Figure 1.

In what amounts to a choice of scale, we focus attention on the subset of the space of all metrics satisfying for all . This is a compact convex set in . Restricting attention to is equivalent to placing a restriction on to lie in the set of all vectors satisfying . Note that coincides with the pre-image under of the set of all ultra-metrics with .

###### Remark 3.1.

It must be observed that degenerate structures; that is, structures containing fewer than partitions (or, equivalently, corresponding to dendrograms that are not binary trees), occur in a set of metrics of Lebesgue measure zero, and therefore do not have any effect on statistical considerations regarding SLHC. Other clustering algorithms, such as hierarchical -means [37, 38], for example, do not have this property and, therefore, require more delicate analysis.

Our statistical model, introduced in , is as follows:

• The measurement only depends on a metric through a specific distribution .

• The ultra-metric is the parameter to be estimated from , with unknown playing the rule of nuisance (latent) parameter.

• A reasonable assumption for this noise model is that the measurements of the different values of are sampled independently from the same parametrized distribution , :

 p(X|θ)=∏ijGθij(Xij); (4)

Following the recommendations of , we pick integrated likelihood for our method of eliminating the nuisance parameter . Given a measurement the likelihood function is:

 L(u;x):=p(x|u)=∫p(x|θ)p(θ|u)dθ=∫∏ijGθij(xij)p(θ|u)dθ (5)

In the context of our problem the support of is restricted to .

The height vector is implicit in the likelihood function, because of the complex integral, and so we focus on estimating the dendrogram structure , treating height vector as a nuisance parameter for this task. It is reasonable to assume that the structure and height functions are independent parameters. Replacing by , we obtain

 p(x|τ)=∫p(x|u)p(a)da=∫p(x|τ,a)p(a)da. (6)

The MLE dendrogram structure is given by

 ^τ(x)=arg max τp(x|τ). (7)

### 3.1 The likelihood of the metric given the data

Since entries of are measurements of distance, they are assumed to satisfy and, for the purposes of this paper, we assume each

follows a log-normal distribution

. Other noise distributions could be considered here. Therefore,

 p(x|θ)=∏1≤i

where and are distribution parameters. We model the relationship between the measurement and the true metric as follows: we require and , which sets the distribution parameters accordingly to be

 μij=lnθij−12ln(1+vθ2ij),σij=√ln(1+vθ2ij). (9)

### 3.2 The likelihood of the dendrogram given the metric

In the absence of assumptions on the latent parameters ,

is taken to be a uniform distribution on its support. This is justified by the fact that, considering the total weight/length of an MST as a natural energy functional on the space of metrics, it follows directly from (

13) below that a maximum entropy distribution (subject to a total energy constraint) on this space restricts to a uniform distribution on the pre-image of under the single linkage map. Hence:

 p(θ|u)=⎧⎪⎨⎪⎩1Vol(sl−1(u)) when θ∈sl−1(u)0 otherwise (10)

In more detail, following , denote

 MST(θ)={T∣∣T is an MST of θ}. (11)

For any spanning tree of , the complete graph with vertex set , write

 C(T):={θ∣∣θ is a metric on O with T∈MST(θ)} (12)

and observe that any two of the cones intersect in a set of measure zero. In addition, the following identity is proved in :

 sl−1(u)=⋃T∈MST(u)C(T,u) (13)

where is defined to be the set of all metrics in which coincide with on the edges of . In particular, any integration over decomposes as a sum of integrals over the relevant . It is easy to verify that each is a polytope given by the inequalities

 C(T,u):⎧⎪⎨⎪⎩θij−θik−θkj⩽0−θij⩽−uijθij⩽Uij (14)

where is defined to be the sum of the -lengths of all the edges separating from in the tree for all . Since membership in is easily verified using the inequalities in (14), this enables Monte-Carlo evaluation of integrals over these domains.

### 3.3 The Prior on the Space of Dendrograms

We will assume that the parameter is uniformly distributed in . As a result,

 p(a)=⎧⎪⎨⎪⎩1Vol(Ω) when a∈Ω0otherwise (15)

Thus, computing using Equation (6) requires the computation of integrals both over and over the polytopes . Existing techniques for these computations, especially the latter, are computationally extremely complex [41, 42, 43]. In this paper we resort to numerical approximation techniques to be introduced in the next section.

## 4 Monte Carlo integration

Monte Carlo integration  is applied here to approximate the integrals.

### 4.1 Equation (5)

Setting up the Approximation. The inner integral splits, by (13), as a sum of integrals computed separately over each polytope . More precisely, we enumerate the MSTs associated with , where may depend on , and set . Then becomes

 p(x|u)=∫sl−1(u)p(x|θ)p(θ|u)dθ=K∑k=1p(Tk|u)∫Ckp(x|θ)p(θ|Tk)dθ. (16)

Since is uniformly distributed in , is uniform in and equals .

For the Monte-Carlo approximation we draw samples of metrics uniformly from , with of these metrics drawn from each . Then the Monte-Carlo approximation of is and, since , the Monte Carlo integration carried out for (16) yields the expression:

 ϕ(a):=K∑k=1NkN⋅1Nk∑θ(l)∈Ckp(x|θ(l))=1NN∑l=1p(x|θ(l)). (17)

Drawing the Samples. According to Equation (14), lower and upper bounds on the value of a metric are, respectively,

 θmin=min({uij}),θmax=% max({Uij}). (18)

Uniformly sampling metrics from the box defined by these bounds, and keeping only those lying in as our samples for (17) is a straightforward approach, but fails to produce enough samples, because has measure zero in the space of all nonnegative symmetric dissimilarity matrices in . Instead, we draw metrics separately in each , as described below and then combine these counts:

• Step 1. Set when edge , then uniformly draw values between and for the other parameters;

• Step 2. Check the first constraint in (14) (triangle inequality) on each drawn vector, keep a certain number of these metrics;

• Step 3. Check the last two constraints in (14) for , and keep the metrics satisfying them.

### 4.2 Equation (6)

Setting up the Approximation. The Monte Carlo approximation for the likelihood function obtained from samples drawn uniformly from is:

 L(τ;x)≈∫Ωϕ(a)p(a)% da≈1NΩNΩ∑l=1ϕ(a(l)). (19)

Drawing the Samples. Uniformly drawing a vector from is not so straightforward a task. We observe that the uniform distribution on the standard simplex is a Dirichlet distribution  with parameter vector . This can be mapped to using the linear change of variables replacing the standard basis of by . Rewriting the vectors in this basis yields:

 a=n∑i=1γiwi,n∑i=1γi=1,γi∈[0,1]. (20)

Equivalently, for any , we can write it as for . This transformation is volume preserving, so that uniform sampling of is equivalent to uniform sampling of . This method produce samples of the required kind.

Algorithm 1 gives the pseudo-code for the numerical integration.

## 5 Reducing the Complexity

There are elements in the set of combinatorial types of dendrograms on particles . Denote the space of all such types by . The explosive growth of this set as a function of makes brute-force maximization over it a prohibitive task even for reasonably small values of . Nevertheless, our data (see Figures 35) indicates that structures of sufficiently low likelihood very rarely coincide with the target structure. The removal from consideration of such structures will result in little information loss for the outcome of MLE while significantly reducing computational cost. This suggests adaptation of a similar approach to that of  to produce an approximation of the MLE estimator: for a fixed measurement , we regard the likelihood as a distribution over up to a normalizing factor denoted by and draw a collection of metrics from using the Metropolis–Hastings (MH) algorithm  with target distribution . From the resulting collection of structures, we choose the subset of those structures appearing with highest frequencies, and run the computation from the previous section only for structures . (intuitively, as long as is a measurement of reasonable quality, more metrics in the set are close enough to the true metric in order for to support the same structure as does).

For our implementation of the MH algorithm, we choose a proposal distribution

 g(θold→θnew)=(n−1)n∏ij=121√2πσexp⎧⎪⎨⎪⎩−(θ% \text{new}ij−θoldij)22σ2⎫⎪⎬⎪⎭ (21)

with , keeping in mind the possibility that the sample might not be a metric, in which case the sample will be discarded (thus, is, in fact, zero in the complement of and is otherwise only proportional to the above expression).

As the proposal distribution is symmetric, for the acceptance probability we may use the Metropolis Choice:

 (22)

We recall that samples are generated iteratively. At each iteration, a new state is drawn from the proposal distribution for the current state. A real number is drawn uniformly at random from , and the new state is accepted if . Otherwise, the new state is rejected and the process remains in the same state. With additional burn-in and thinning, the iteration ends when a required number of metrics, , is obtained. Algorithm 2 summarizes the whole process.

## 6 Simulation

We demonstrate the effectiveness of the proposed hypothesis pruning process in the 5-particle case. A 5-point dendrogram may have any one of different dendrogram structures, which can be enumerated and indexed using the algorithm proposed in .

First, we randomly draw a structure from and a height vector from to construct an objective ultra-metric (or, equivalently, a dendrogram). Then a random metric is sampled from the pre-image (under ) of this ultra-metric. This serves as the ground-truth metric used later to generate one measurement with a specified noise level. Finally, we implement MH as in Algorithm 2 with this measurement for its input to obtain a sequence of metrics from the distribution . For each measurement of steps of Algorithm 2 are computed, including steps of burn-in, and applying a thinning of steps. The resulting output of observations is processed as follows.

1. For each structure in its observed appearance frequency is calculated from among the (the frequency is set to for structures which did not arise).

2. We subdivide the range into 20 bins of equal lengths, and generate a vector of 20 integers, indicating for each bin the number of structures occurring with a frequency in .

3. This binning produces a ranking of the structures, in descending order, according to the frequency of occurrence.

Figure 3 shows the averaged histogram of the output vectors generated from iid measurements of

. Figure 3 shows the corresponding distribution of the rank of the true structure.

The inset in Figure 3 provides an enlarged plot of the bars excluding the leftmost one: observe that more than 170 of the 180 possible structures have insignificant appearance ratios (in the interval ).

At the same time, Figure 3 shows that the rank of the true structure is almost completely distributed among the 10 highest ranked, with a nearly 70% chance of the true structure ranking first.

Figures 3 and  3 support our contention that most of the structures may be removed from consideration, and that with little chance of harm we may restrict attention to just a few of the highest ranking structures.

Figures 5 and 5 show our simulation results for noise with a standard deviation of . The majority of structures still appear with extremely low probabilities, though the rank of the true structure displays a more scattered distribution because of the increased noise. Figure 2: Average number of structures in each interval with std 0.1. Figure 4: Average number of structures in each interval with std 0.3.

To provide an idea of the overall performance of the proposed MLE estimator, Figure 6 compares the success rates of the MLE estimator with those of SLHC performed directly on the measurements. In this example, height vectors were generated uniformly and, for each height vector, a dendrogram with the indicated structure was created. For measurements, metrics were sampled from , and for each of them a measurement , drawn according to our data generation model with the prescribed standard deviation. For each , we first restricted attention to the most highly ranked structures. Finally, the MLE estimator and SLHC were run on and for each the frequency of instances of successful identification of the initial were recorded. These ratios were then averaged over the heights. Figure 6

indicates that, on average, the proposed MLE estimator has a better error performance than SLHC, especially so for higher measurement noise variance. Figure 6: Comparison of MLE and SLHC

## 7 Conclusion

This paper introduces a rigorous MLE approach to statistical estimation of SLHC under fairly general assumptions regarding the data generation process. Simulations with particles demonstrate that the current approach used in all applications of SLHC  —  calculating SLHC directly from measured data  —  is significantly outperformed by the MLE estimation method.

A clear weakness of our current approach is its computational complexity which, as presented here, increases very rapidly with data size. This is largely because of the increased number of MSTs and the problem of sampling metrics in high dimensions, even though mitigated by our introduction of MCMC to cull the vast majority of structures. Further reducing the population of “MLE-eligible” MSTs is necessary, and suitable models are being investigated by us. It is likely that some variant on Kruskal moves, such as , will play a useful role here in providing a means to navigate spanning trees more effectively.

We also plan to consider a mixture of “top-down” and agglomerative methods of hierarchical approaches to further reduce the complexity of finding the “top split” in a computationally feasible way, and proceeding recursively from there, thereby reducing the search space in a step-by-step fashion.

## Acknowledgements

This work was supported by the National Science Fund of China under Grant 61025006 and 61422114, and by the US Air Force Office of Science Research under a MURI FA9550-10-1-0567, and under FA9550-12-1-0418.

## References

•  Carlsson G, Mémoli F. Persistent clustering and a theorem of J. Kleinberg. arXiv preprint arXiv:08082241. 2008.
•  Carlsson G, Mémoli F. Characterization, stability and convergence of hierarchical clustering methods. J Mach Learn Res. 2010 Aug;11:1425–1470; Available from: http://dl.acm.org/citation.cfm?id=1756006.1859898.
• 

Everitt BS, Landau S, Leese M, Stahl D. Cluster analysis (5th edition). 2011.

•  Müllner D. Modern hierarchical agglomerative clustering algorithms. (preprint) http://arxivorg/abs/11092378. 2011;Available from: http://arxiv.org/abs/1109.2378.
•  Jain AK, Murty MN, Flynn PJ. Data clustering: A review. ACM Comput Surv. 1999 Sep;31(3):264–323; Available from: http://doi.acm.org/10.1145/331499.331504.
•  Gower JC, Ross GJS. Minimum spanning trees and single linkage cluster analysis. Applied Statistics. 1969;54–64.
•  N Jardine RS. Mathematical taxonomy. New York: John Wiley & Sons; 1971.
• 

Carlsson G, Mémoli F. Classifying clustering schemes. arXiv preprint arXiv:10115270. 2010.

•  Sturmfels B. Can biology lead to new theorems. Annual report of the Clay Mathematics Institute. 2005;13–26.
•  Shannon W, Culverhouse R, Duncan J. Analyzing microarray data using cluster analysis. Pharmacogenomics. 2003;4(1):41–52.
•  Butte A. The use and analysis of microarray data. Nature reviews drug discovery. 2002;1(12):951–960.
•  Levenstien MA, Yang Y, Ott J. Statistical significance for hierarchical clustering in genetic association and microarray expression studies. BMC bioinformatics. 2003;4(1):62.
•  Chipman H, Tibshirani R. Hybrid hierarchical clustering with applications to microarray data. Biostatistics. 2006;7(2):286–301.
•  Booma P, Prabhakaran S, Dhanalakshmi R. An improved pearson’s correlation proximity-based hierarchical clustering for mining biological association between genes. The Scientific World Journal. 2014.
•  Wang J, Miller D, Kesidis G. Efficient mining of the multidimensional traffic cluster hierarchy for digesting, visualization, and anomaly identification. IEEE Journal on Selected Areas in Communications. 2006 Oct;24(10):1929–1941.
• 

Biggio B, Rieck K, Ariu D, Wressnegger C, Corona I, Giacinto G, Roli F. Poisoning behavioral malware clustering. In Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop. ACM; 2014. p. 27–36.

•  Kang J, Zhang Y, Ju J. Classifying DDoS attacks by hierarchical clustering based on similarity. In Machine Learning and Cybernetics 2006 International Conference; Aug; 2006. p. 2712–2717.
•  Lu Y, Luo X, Polgar M, , Cao Y. Social network analysis of a criminal hacker community. Journal of Computer Information Systems. 2010;51(2):31–41.
•  Hanna S, Huang L, Wu E, Li S, Chen C, Song D. Juxtapp: A scalable system for detecting code reuse among android applications. In Proceedings of the 9th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment; Heraklion, Crete, Greece; DIMVA’12. Berlin, Heidelberg: Springer-Verlag; 2013. p. 62–81; Available from: http://dx.doi.org/10.1007/978-3-642-37300-8_4.
•  Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. In KDD Workshop on Text Mining; 2000.
•  Castro RM, Member S, Coates MJ, Nowak RD. Likelihood based hierarchical clustering. IEEE Trans on Signal Processing. 2004;52:2308–2321.
•  Cavalli-Sforza LL EA. Phylogenetic analysis: Models and estimation procedures. Am J Hum Genet. 1967 May;19:233–257.
•  Edwards AW. Estimation of the branch points of a branching diffusion process. Journal of the Royal Statistical Society Series B (Methodological). 1970;155–174.
•  Felsenstein J. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Systematic Biology. 1973;22(3):240–249; Available from: http://sysbio.oxfordjournals.org/content/22/3/240.abstract.
•  Felsenstein J. Maximum-likelihood estimation of evolutionary trees from continuous characters. Am J Hum Genet. 1973;25:471–492.
•  Felsenstein J. Evolutionary trees from dna sequences: A maximum likelihood approach. Journal of Molecular Evolution. 1981;17(6):368–376; Available from: http://dx.doi.org/10.1007/BF01734359.
•  Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. Journal of Molecular Evolution. 1994;39(3):306–314; Available from: http://dx.doi.org/10.1007/BF00160154.
•  Rannala B, Yang Z. Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference. Journal of Molecular Evolution. 1996;43(3):304–311; Available from: http://dx.doi.org/10.1007/BF02338839.
•  Yang Z, Rannala B. Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo method. Molecular Biology and Evolution. 1997;14(7):717–724; Available from: http://mbe.oxfordjournals.org/content/14/7/717.abstract.
•  Farris JS. Estimating phylogenetic trees from distance matrices. American Naturalist. 1972;645–668.
•  Felsenstein J. Distance methods for inferring phylogenies: a justification. Evolution. 1984;16–24.
• 

Castro R, Nowak R. Likelihood based hierarchical clustering and network topology identification. Energy minimization methods in computer vision and pattern recognition. Springer Berlin Heidelberg; 2003.

•  Deza A, Fukuda K, Pasechnik DV, Sato M. On the skeleton of the metric polytope. In Revised Papers from the Japanese Conference on Discrete and Computational Geometry; JCDCG ’00. London, UK: Springer-Verlag; 2001. p. 125–136; Available from: http://dl.acm.org/citation.cfm?id=646320.686393.
•  Deza MM, Laurent M. Geometry of cuts and metrics. Vol. 15 of Algorithms and Combinatorics. Springer, Heidelberg; 2010; first softcover printing of the 1997 original [MR1460488]; Available from: http://dx.doi.org/10.1007/978-3-642-04295-9.
•  Kleinberg J. An impossibility theorem for clustering. Advances in neural information processing systems. 2003;463–470.
•  Meilǎ M. Comparing clusterings: an axiomatic view. In Proceedings of the 22nd international conference on Machine learning. ACM; 2005. p. 577–584.
•  Arslan O, Guralnik DP, Koditschek DE. Hierarchically clustered navigation of distinct euclidean particles. In 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE; 2012. p. 946–953.
•  Arslan O, Guralnik DP, Koditschek DE. Navigation of distinct euclidean particles via hierarchical clustering. In 2014 The Eleventh International Workshop on the Algorithmic Foundations of Robotics (WAFR2014); August; 2014.
•  Zhu D, Guralnik DP, Wang X, Li X, Moran B. Statistical properties of the single linkage hierarchical clustering. arXiv preprint arXiv:1511.07715. 2015. Available from: http://arxiv.org/abs/1511.07715.
•  Berger JO, Liseo B, Wolpert RL, et al. Integrated likelihood methods for eliminating nuisance parameters. Statistical Science. 1999;14(1):1–28.
•  Barvinok AI. Computing the volume, counting integral points, and exponential sums. Discrete & Computational Geometry. 1993;10(1):123–141.
•  Barvinok A, Hartigan J. Maximum entropy Gaussian approximations for the number of integer points and volumes of polytopes. Advances in Applied Mathematics. 2010;45(2):252–289.
•  Lasserre JB, Zeron ES. A Laplace transform algorithm for the volume of a convex polytope. Journal of the ACM (JACM). 2001;48(6):1126–1140.
•  MacKay DJ. Introduction to Monte Carlo methods. In Learning in graphical models. Springer; 1998. p. 175–204.
•  Ng KW, Tian GL, Tang ML. Dirichlet and related distributions: Theory, methods and applications. Vol. 888. John Wiley & Sons; 2011.
•  Chib S, Greenberg E. Understanding the Metropolis-Hastings algorithm. The American Statistician. 1995;49(4):327–335.
•  Osipov V, Sanders P, Singler J. The filter-kruskal minimum spanning tree algorithm. In ALENEX. SIAM; 2009. p. 52–61; Available from: http://dblp.uni-trier.de/db/conf/alenex/alenex2009.html#OsipovSS09.