optimalFlow: Optimal-transport approach to flow cytometry gating and population matching

07/18/2019 ∙ by Eustasio del Barrio, et al. ∙ 0

Data used in Flow Cytometry present pronounced variability due to biological and technical reasons. Biological variability is a well known phenomenon produced by measurements on different individuals, with different characteristics such as age, sex, etc... The use of different settings for measurement, the variation of the conditions during experiments or the different types of flow cytometers are some of the technical sources of variability. This high variability makes difficult the use of supervised machine learning for identification of cell populations. We propose optimalFlowTemplates, based on a similarity distance and Wasserstein barycenters, which clusterizes cytometries and produces prototype cytometries for the different groups. We show that supervised learning restricted to the new groups performs better than the same techniques applied to the whole collection. We also present optimalFlowClassification, which uses a database of gated cytometries and optimalFlowTemplates to assign cell types to a new cytometry. We show that this procedure can outperform state of the art techniques in the proposed datasets. Our code and data are freely available as R packages at https://github.com/HristoInouzhe/optimalFlow and https://github.com/HristoInouzhe/optimalFlowData.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Flow cytometry (FC) works with ‘high-dimensional quantitative measurement of light scatter and fluorescence emission properties of hundreds of thousands of individual cells in each analysed sample’ (see [1]

). These quantitative measurements allow to analyze and classify individual cells providing diverse applications. For example, as mentioned in

[29], ‘flow cytometry is used to identify and quantify populations of immune cells’ in order to monitor the immune state of patients or to detect relevant biomarkers by comparing flow cytometries from different patient groups.

A main component in FC is gating, the assignment of individual cells (data records) into discrete cell types. Manual gating, i.e., an expert assigning cell types (labels) to individual cells, using a set of rules on one or two-dimensional projections, has been the prevalent option. However, this manual approach has some shortcomings. First, it is subjective since it depends on the expertise of the user, on the sequence of markers (measured variables) used to do the projections and on the locations of the gates on those projections. Second, it can be very time consuming because it is ‘roughly quadratic in the number of markers’ (see [23]). Third, the recent increase in the number of markers and number of cells per cytometry makes human error a relevant factor.

In order to avoid some of the difficulties related to manual gating there have been different approaches to automated gating. Some are unsupervised, therefore, there is no use of previously gated cytometries. Hence, gating is done through a clustering procedure. We present a small selection of such unsupervised automated gating procedures. FLOCK [28]

, which dose grid-based density estimation (with merging) and then k-means; FLAME

[27]

, which performs skew

model based clustering and flowClust [24, 25], which does robust based clustering through mixture models with Box-Cox transformation. Other related clustering procedures are: flowPeaks [19]

performs Gaussian mixture model based clustering (with modified covariances) and merging and flowMeans

[2] does k-means with initialization via mode detection through kernel density based estimation. More information about state of the art methods can be found in [1, 29].

Accuracy of cell type assignation can be improved using supervised machine learning where historical information is contained in previously gated cytometries (manually or otherwise). Recently, some methods have been produced addressing this problem. In [23]

, DeepCyTOF was introduced, essentially combining de-noising, deep-learning algorithms and domain adaptation. In

[26]

, flowLearn was introduced, combining density features of the data, manually selected gating thresholds and derivative-based density alignments. We stress that other more classical approaches for supervised learning are also available. For example, random forest algorithms, support vector machines or quadratic discriminant analysis can be used when learning from some previously gated cytometry. Supervised machine learning is a well documented topic and for more detailed explanations we refer to

[3].

There are two main set-ups for using supervised learning. First, the classical one, where there is an available data base of historical information. In the FC context, this means that a collection of gated flow cytometries is available and we want to use this information in order to gate a new cytometry. Second, an alternative one, where we have a collection of ungated cytometries and it is possible to gate some of them and use these gated cytometries to classify the rest of the cytometries in the collection.

In both set-ups there is a fundamental problem intrinsic to FC. That is, flow cytometry data has considerable technical and biological variability. Biological variability appears due to intrinsic differences between individuals such as illness, age, sex, etc… Technical variability can appear due to the use of different settings for the measurements, due to the variations of the conditions during the experiments or due to the use of different types of flow cytometers. Precisely this high variability makes the use of supervised methods a hard task.

In this work we provide novel methods for grouping (clustering) gated cytometries. By clustering a set of cytometries we are producing groups (clusters) of cytometries that have lower variability than the whole collection. This in turn allows to improve greatly the performance of any supervised learning procedure. We provide evidence of this below. Once we have a partition (clustering) of a collection of cytometries, we provide several methods for obtaining an artificial cytometry (prototype, template) that represents in some optimal way the cytometries in each respective group. These prototypes can be used, among other things, for matching populations between different cytometries as suggested in [7, 20]. Even more, a procedure able to group similar cytometries could help to detect individuals with a common particular condition.

optimalFlowTemplates is our procedure for clustering cytometries and obtaining templates. It is based on recent developments in the field of optimal transport such as a similarity distance between clusterings, introduced in [12], and a barycenter (Frechet mean, see [22, 10]) and k-barycenters (see [4, 8, 5]

) of probability distributions.

We introduce a supervised classification tool, optimalFlowClassification, for the case when a database of gated cytometries is available. The procedure uses the prototypes obtained by optimalFlowTemplates on the database. These are used to initialise tclust, a robust extension of k-means that allows for non-spherical shapes, for gating a new cytometry (see [18], not to be confused with TCLUST, [14]). By using a similarity distance between the best clustering obtained by tclust and the artificial cytometries provided by optimalFlowTemplates we can assign the new cytometry to the most similar template (and the respective group of cytometries). We provide several options of how to assign cell types to the new cytometry using the most relevant information, represented by the assigned template and the respective cluster of cytometries.

2 Methods

We start with the mathematical treatment of flow cytometry data. We can view a gated flow cytometry, say , as a collection of multidimensional points with their associated labels (cell types or group labels) forming a set of different labels. Hence, a gated cytometry can be described as where and . Alternatively we could describe it as a partition (clustering) of all into groups (clusters) formed by points sharing the same labels. That is, where is a cluster and is a weight associated with label . A third useful description is to view a gated cytometry as a clustering but coming from a mixture of location-scatter multivariate distributions. With some abuse of notation where are the multivariate mean and covariance of the points in cluster .

2.1 optimalFlowTemplates

Due to the the high variability in flow cytometry data we should expect that learning form different elements in a database should produce significantly different results on the classification of a new cytometry . Our approach is to search for clusters of existing cytometries in the database. In this way we pursuit for a notable reduction of variability thus allowing a good representation of the cytometries in each of these groups through prototypic cytometries. Therefore, using a prototype of a group for learning should produce a similar result for classifying to the one obtained when using any other cytometry in the same group.

2.1.1 Clustering cytometries

Since gated cytometries can be viewed as partitions (clusterings) and we want to clusterize cytometries in order to reduce variability, we want to do clustering of clusterings, also known as metaclustering. The methodology we will develop in this work is to use some meaningful distance between partitions and then apply hierarchical clustering methods. As a distance between clusterings we propose to use the similarity distance introduced in

[12]. It is based on two auxiliary distances. The optimal transport distance between two partitions and , defined as

where is a distance between clusters and .

are the solutions of the optimal transport linear program

measures the cost of the optimal way of transforming one partition into the other. For more detailed explanations on optimal transport see Supplementary Material Section A. The other auxiliary distance is the naive transport distance, defined as

It measures the cost of naively transforming one partition into the other.

The similarity distance is defined as the quotient

(2)

We recall that , where means that partitions are represented by the same clusters with the same weights and means that every cluster in is transported proportionally to every cluster in . Therefore, values of close to 0 can be interpreted as high similarity between clusterings, and values of close to 1 can be interpreted as very dissimilar clusterings.

Input: ,

1:for  do
2:     while  and enough for covariance estimation do
3:         ;
4:         if  then
5:              
6:         else
7:              
8:         end if
9:         
10:     end while
11:end for
12:for  do
13:     for  do
14:         
15:     end for
16:end for
17: hierarchical clustering with
18:for  do
19:      template obtention on cytometries in
20:end for
21:

Output:

Algorithm 1 optimalFlowTemplates

In order to completely define , we need to specify a distance between clusters. Our choice is to use the well known Wasserstein distance (see Supplementary Material Section A) so

(3)

In essence, we are treating clusters as multivariate normal distributions,

and , with means and covariances calculated from the clusters. Our choice of the Wasserstein distance is based on the desire to account for the spatial shapes of the clusters and to obtain templates for the groups of cytometries. We stress that all results in this work are also valid when understanding clusters as members of a location-scatter family. Another interesting measure for cluster difference is, , the (entropy) regularized Wasserstein distance (see Supplementary material Section A) where clusters are understood as empirical distributions.

However, any other dissimilarity measure can be used, for example the symmetric Kullback-Leibler was used in [7]

or Friedman-Rafsky test statistic was used in

[20], in the context of cluster comparison in flow cytometry. When we see clusters as collections of points, the Adjusted Rand Index, the Jaccard distance or other similar can be used, at the expense of loosing spatial information.

The clustering of cytometries is presented in lines 1-17 in Algorithm 1, resulting in a partition . Lines 12-16 are concerned with the obtention of a distances matrix that in line 17 is used to perform hierarchical clustering. Classical agglomerative algorithms can be used, but also density based algorithms as DBSCAN and HDBSCAN.

2.1.2 Template obtention through consensus clustering

Once we have a partition, , of the collection of cytometries , we want to obtain a prototype cytometry, , for every group of cytometries, , in the partition (lines 18-21 in Algorithm 1). To address this goal we resort to k-barycenters using Wasserstein distance, which provide a suitable tool for consensus on probability distributions (see Supplementary Material Section A for more details). We propose three different methods on how to obtain a template cytometry from a group of cytometries. That is, on how to do consensus (ensemble) clustering on flow cytometries. More on this topic can be found in Section B in the Supplementary material.

Pooling: Suppose that . This is the case for a set of gated cytometries with identified cell populations.

Input: ,

1:for  do
2:      set of all clusters associated with label for the cytometries in group .
3:     if  then
4:          take 1-barycenter of the clusters in viewed as multivariate normals.
5:     else
6:          is empty
7:     end if
8:end for
9:

Output:

Density based hierarchical clustering:

Input: ,

1: set formed by every cluster of every cytometry in group .
2:for  do
3:     
4:end for
5: density based hierarchical clustering based on .
6:for  do
7:      barycenter of elements with label in .
8:end for
9:

Output:

k-barycenter:

Input: , ,

1: set formed by every cluster of every cytometry in group .
2: -barycenter of the elements in .

Output:

The intuition behind pooling, once we have groups of similar cytometries and cell types are known, is the following. A prototype of a cell type is the 1-barycenter (barycenter), a consensus representation, of the clusters (multivariate distributions) representing the same cell type in the cytometries that are members of the same group in . A prototype cytometry is the collection of prototypes of each cell type.

However, our templates could be obtained even when we have gated cytometries but without cell type identification between them. Density based hierarchical clustering and k-barycenter are based on the idea that clusters that are close in Wasserstein distance should be understood as representing the same cell type, although we may not know which cell type. When using k-barycenters we have to specify the number of cell types, , that we want for the artificial cytometry. However, when using density based hierarchical clustering as HDBSCAN (see [11]) or DBSCAN (see [15]) the selection of the number of cell types for the prototype cytometry is automatic. Recall that both k-barycenters, through trimming, and density based hierarchical clustering are robust clustering procedures.

2.2 optimalFlowClassification

Our goal is to do supervised classification, i.e., assign cell types to a new cytometry , using the information given in a database of gated cytometries . The different sources of variability, mainly those of technical nature and those which are properly due to different cohorts present in the database, advise to search for different cytometric structures. Hence, we should assign to the group of cytometries that is more similar to it and then use supervised techniques. Indeed, this is the purpose of optimalFlowClassification, as shown in Algorithm 2. As an input we apply optimalFlowTemplates to the database in order to obtain the partition and the templates .

Input: , ,

1:for  do
2:      tclust on initialized with
3:end for
4: of tclust objective function over all
5:for  do
6:     
7:end for
8:;
9: labelling of using transfer labelling or supervised classification based on or .

Output:

Algorithm 2 optimalFlowClassification

Lines 1-4 in Algorithm 2 are dedicated to finding an unsupervised partition of the new cytometry

using as initialization for tclust the prototypes of the database. We favour the use of tclust over k-means since it allows for non-spherical clusters and for trimming, making partitions more robust to outliers. Additional information about tclust is given in Supplementary Material Section

C. Initializing with the database entries attempts to use optimally the available information. Hence, if is similar to some of the cytometries in the database, supervised initialization should be advantageous. However, some other suitable unsupervised initializations can be used, as the ones proposed in FLOCK, flowPeaks or flowMeans. We need to cluster in order to compare it with the template cytometries.

In lines 5-8 we look to assign , using the clustering , produced in the previous step, to the template that is closest in similarity distance to . With this we hope to use only the most relevant information of the database, summarized in and .

The last step in algorithm 2, line 9, is concerned with assigning cell types to . To do this we have several options. We can try to relabel in an optimal way using or , i.e, do label transfer. Alternatively, we can use to do Quadratic Discriminant Analysis (QDA) or we can find the most similar partition in similarity distance (2) from to and use it to do QDA or random forest classification. In short, we can do label transfer or supervised classification.

For supervised classification we use standard tools, random forest and QDA, however, other methods can be used in a straightforward fashion. We remark that when using QDA and , we are using non-linear multidimensional gating regions obtained from in order to classify . This can be taught as an extension of the method presented in [26] where only linear one-dimensional regions are used. Another interesting fact is that the use of allows us to select the most similar real cytometry to , hence supervised tools should be more effective.

The problem of relabelling a clustering with respect to another clustering is usually stated as a weighted bipartite matching problem, where weights are related to the similarity between clusters in the two partitions. This problem can be solved by the hungarian method [21]. Generalized edge cover is another possible solution to relabelling (see [7]).

Additionally we introduce an approach to obtain a fuzzy relabelling based on solving the optimal transport linear program (2.1.1). The solution, , is the base for this fuzzy relabelling. We define the score of cluster in to come from cluster in as . In words, is the proportion of probability coming from cluster , with respect to the probability in cluster , that arrives at cluster . Clearly, , and the closer to 1 the score is the more evidence we have that cluster and represent the same cluster. A fuzzy relabelling for cluster in is the collection of all the scores . A variation of the previous score is , where we are weighting by the proportion of cluster that goes to cluster , with respect to the probability contained in cluster . In this way we down-weight the effect of a small proportion of a big cluster with respect to a big proportion of a small cluster arriving to . From these fuzzy relabellings a hard relabelling can be easily obtained.

Again, a suitable distance between clusters can be the Wasserstein distance as in (3). However, another possibility is to use

(4)

(3) is computationally very efficient but does not allow to label very small clusters in .(4) does allow labelling small clusters in , at the price of using sub-sampling to compare bigger clusters (for example more than 10000 points).

3 Results

In this section we present several experiments and comparisons of our methods with other state of the art procedures on a real database. In Figure 1 we provide visual intuition of what a cytometry looks like when understood as a mixture of normal distributions. We also provide an example of a prototype cytometry. In Figure 2 we provide comparisons between the result of optimalFlowTemplates, flowMatch and what can be considered a ground truth. In Table 1 we provide comparisons between state of the art unsupervised gating as given by flowMeans and our proposal for initializing tclust with supervised information. We also provide comparisons between the state of the art supervised method deepCyTOF and our own supervised procedure optimalFlowClassification.

3.1 Data

Our database is formed by 21 gated flow cytometries, , obtained following the Euroflow protocols, kindly provided by Centro de Investigación del Cancer (CIC) in Salamanca, Spain. All 21 cytometries have been obtained in a BD FACSCanto flow cytometer but in three different centres. The size of the cytometry datasets vary from 50,000 cells to 254,450 cells. The samples are from adult male and female individuals, with a varied range of ages, that have been diagnosed as healthy. More information about the data set can be found in Table 2 in the Supplementary material.

Clearly, there is biological variability, since there are different individuals with different ages and other different characteristics. Moreover, we have technical variability since we have different centres, different dates of measurement and different incubation times. However, we remark that all individuals belong to the same class of healthy people.

3.2 Measures of performance

We need appropriate measures of the performance of the different automated gating procedures that appear in this work. We recall that we use both unsupervised and supervised methods. In this set-up an appropriate tool is the F-measure statistic which has been used in [1, 2, 19, 23]. With our notation we have

with . We make the convention and . Another appealing measure is the median F-measure used in [26] specifically for supervised learning. The formal definition is

where is the considered ground truth, in our case a manual gating, and is another classification of the same data.

To measure how similar are two cytometries, i.e., how well we do when learning from one to classify the other and how well we do when learning with the later to classify the former we introduce the following distance.

where is the partition resulting from the classification of the data in using a random forest learned in . is the partition resulting from the classification of the data in using a random forest learned in . This measure gives us a notion of how close in terms of being good predictors for one another are two cytometries. We have that , and two cytometries are interchangeable for learning if is close to 0. A variation of this measure is

3.3 Clustering cytometries and template obtention

Figure 1: Representation of five cell types. Top-left: as a collection of points. Top-right: as a mixture of gaussians. Bottom-left: Pooling of and . Bottom-right: barycenters of the cell types in the pooling.

We start with a visual example to grasp the intuition behind our approach to flow cytometry. Top left in Figure 1 we have a three dimensional projection of five cell types present in : Basophils (black), CD4+CD8- (red), Eosinophils (green), Monocyts (Blue), Neutrophils (Cyan). From the picture we can view each cell type as a cluster, hence having , omitting weight information for simplicity, where corresponds to Basophils, to CD4+CD8- and so on…On the other hand, top right in Figure 1 we have a representation of the same groups when we understand them as multivariate normals, that is where are the multivariate mean and covariance of the points gated as Basophils, of points gated as CD4+CD8-… It becomes apparent that the multivariate geometrical information is a reasonable representation of the cytometry, even more, it allows to obtain templates in a computationally efficient way.

Suppose that we have a database, which is a subset of 15 cytometries, given by . We also have a partition of the dabase, , where one group is . If we plot together the five cell types for all three cytometries in we get the representation bottom left in Figure 1. A prototype for cluster , obtained by pooling as indicated in Section 2.1.2, is represented bottom right in Figure 1. We see that the artificial cytometry (template, prototype) gives a nice representation of the information contained in bottom left of Figure 1. Even more, this artificial cytometry is quite similar to any of the original cytometries in .

Figure 2: Hierarchical trees for the database . Top-left: result of optimalFlowTemplates. Top-right: result of flowMatch with Mahalanobis distance. Bottom-left: single linkage with . Bottom-right: single linkage with .

Once we have seen the intuitive meaning of clustering cytometries and the obtention of templates, we want to compare different methods to cluster the database . We use for a ground truth the simple linkage hierarchical clusterings obtained using , bottom right of Figure 2, and using , bottom left of Figure 2. For a state of the art comparison, we use flowMatch, described in [7], using Mahalanobis distance, depicted top right in Figure 2. The clustering obtained by optimalFlowTemplates when using single linkage hierarchical clustering is shown top left in Figure 2. More comparisons can be seen in Figure 4 in the Supplementary material. At a first glance it is clear that the results form optimalFlowTemplates are much more similar to the ground truth than those of flowMatch. This should be interpreted as the fact that optimalFlowTemlates captures more accurately the similarity between cytometries than flowMatch. Two additional facts should be stated: first, the similarity distance is independent of parameters, something that is not the case for the generalized edge cover distance used in flowMatch. Second, optimalFlowTemplates produces templates only at one stage, once the number of clusters is determined, while flowMatch produces templates at every stage of the hierarchical clustering procedure.

3.4 Gating and classification

Figure 3: Result of optimalFlowTemplates on the databse after gating each cytometry with flowMeans.

We will apply optimalFlowTemplates+optimalFlowClassification to the database introduced in the previous section. We will use as a test set . For the cytometries in , we also perform and unsupervised gating given by flowMeans and an unsupervised procedure given by tclust initialized with the templates obtained by optimalFlowTemplates.

Results can be seen in columns 3-5 of Table 1. In Table 3 and 4 in the Supplementary material we have a full description of the results of optimalFlowClassification. We see that tclust initialized with optimalFlowTemplates is competitive with flowMeans, but more importantly, optimalFlowTemplates+optimalFlowClassification is superior in every of the test cytometries, giving 5 form 6 F-measures higher than 0.95 and the other higher than 0.92. Clearly our supervised procedure is working well and, as expected, is giving better performance than state of the art unsupervised alternatives.

However, we also want to compare with a state of the art supervised procedure. In this case we will use deepCyTOF, with some bug corrections and some adaptations to our setting of the github version, implemented in Python with tensorflow 0.12 and keras 1.2.2. In order to use deepCyTOF we need cytometries with the same number and type of cell types so we use a data set

, where we have eliminated one group from each cytometry. We recall that deepCyTOF only uses the supervised information of one of the cytometries in to classify all other members. We see the results of deepCyTOF, with domain adaptation and without de-noising, since all entries are classified, in column 1 of Table 1. DeepCyTOF’s performance is rather poor, achieving worst F-measure than flowMeans in 6 of the 9 cases and also for all applicable cases (cytometries 1,18,20) than optimalFlowTemplates+optimalFlowClassification.

However, these poor results are due to the high variability of the cytometries that can not be accommodated by the domain adaptation procedure of deepCyTOF. Hence if we were able to reduce this variability, deepCyTOF should give better results. Indeed, if we use flowMeans to gate the cytometries in , and then we use optimalFlowTemplates, we obtain the hierarchical three presented in Figure 3. It suggests to split into and . We recall that until now we have not used supervised information. Applying deepCyTOF to and we obtain the results in column 2 of Table 1. Now deepCyTOF performs better than flowMeans in 7 of the 9 cases, however it is better than optimalFlowTemplates+optimalFlowClassification only for Cytometry 1, which is the one that deepCyTOF uses for learning in .

DeepCyTOF DeepCyTOF 2 flowMeans tclust optimalFlowC
Citometry 1 0.9641 0.9857 0.9501 0.9504 0.9740
Citometry 2 0.9420 0.9585 0.8988
Citometry 5 0.8728 0.8720 0.8977
Citometry 6 0.9195 0.9335 0.9522
Citometry 7 0.8763 0.8062 0.9508
Citometry 10 0.8610 0.8141 0.9595
Citometry 11 0.8653 0.9170 0.9256
Citometry 14 0.9825 0.9295 0.9004
Citometry 17 0.6982 0.9816 0.8978
Citometry 18 0.6840 0.9797 0.9003 0.8716 0.9853
Citometry 20 0.6884 0.9760 0.9223 0.8661 0.9834
Citometry 21 0.6942 0.9699 0.9269
Table 1: Table of F-measure statistics. DeepCyTOF: results of deepCyTOF on . DeepCyTOF 2: results of deepCyTOF on and . flowMeans: results of flowMeans. tclust: results of optimalFlowTemplates initialized tclust on . optimalFlowC: results ot optimalFlowClassification on .

Supplementary material

Appendix A Notions on optimal transport

Following [33], let us take , the space of probability distributions on . For in , let us define the set of all probability measures on with first marginal and second marginal . The optimal transport cost between the two measures is defined as

(5)

where is the cost of transporting one unit of mass from to . A probability that achieves the minimum in (5

) is called an optimal coupling, with an associated random variable

that has joint distribution

. When and are discrete, i.e., and , with , the optimal transport problem can be solved as the folowing linear program (see [9])

are the solutions of the optimal transport linear program

For a Polish metric space and , the Wasserstein distance between and is defined as

where refers to the law of .

We present the entropy regularized Wasserstein distance, since it is strictly convex and there are efficient solutions based on the Sinkhorn algorithm (see [13]). For a fixed the regularized Wasserstein distance is defined as

where are the solutions of the optimal transport linear program

Let us denote the set of probability measures on

with finite second moment and let us take

for . In [8] the notions of -barycenter and trimmed -barycenter were introduced, building on the concept of Wasserstein barycenter introduced in [10, 22]. A -barycenter of probabilities in with weights is any k-set in such that for any we have that

(6)

An -trimmed -barycenter of with weights as before is any k-set with weights such that

(7)

where .

Broadly speaking k-barycenters can be thought of as an extension of k-means to the space of probabilities with finite second order, since we can rewrite (6) as

(8)

where is a partition of and is the barycenter of the elements in . Therefore, trimmed k-barycenters may be matched to trimmed k-means. As stated in [8], efficient computations can be done when dealing with location-scatter families of absolutely continuous distributions in

. A notable example being the family of multivariate Gaussian distributions.

Appendix B Consensus clustering

Consensus clustering is the problem of combining different partitions. These partitions (clusterings), can come from different clustering algorithms applied to the same data, the same algorithm looking for different number of clusters or with a different initialization or a combination of the previous. A possible approach to the problem is to use some type of relabelling of the original partitions and an optimization procedure to extract a consensus clustering that is the closest in some sense to the relabelled partitions [6, 32]. Other alternatives comprising cluster-based similarity partitioning algorithm, hypergraph partitioning algorithm and meta-clustering algorithm (clustering of partitions) were proposed in [31]. Consensus clustering results in a gain in robustness and in the possibility of parallelization and reuse of information.

A shortcoming of the previous procedures is the difficulty to handle a combination of clusterings of different data. This shortcoming can be addressed when clusters in a partition can be viewed as a distribution. This is the case when mixture models are used for clustering. Also, in many cases we can represent clusters, obtained by some clustering procedure, by their means and covariance matrices. Even more, some fields work with data represented as probability distributions.

In this setting, we would have several partitions, coming from the same or from different data, where ,…, are subsets of some space of probability measures. A natural idea is to pool together all the distributions, obtaining the data set with , and to do clustering in the abstract space of probabilities. The resulting partition of can be used to obtain a set of distributions that will represent the consensus clustering.

Let us fix that . A particular implementation of the previous strategy was presented in [8] where the consensus cluster is taken to be the (trimmed) -barycenter as in ((7))(8). Here we present some alternative but related strategies to obtain a consensus clustering. We can define the distance matrix , where , and use hierarchical clustering to obtain a partition of . Then taking the barycenter (1-barycenter) of the elements in each we obtain the consensus clustering . With hierarchical clustering we mean single linkage, complete linkage or average linkage, but also density based hierarchical clustering as DBSCAN [15] or HDBSCAN [11]. The former are able to select automatically the appropriate number of clusters, which can be seen as a desirable quality, even more, they are also robust as the trimmed -barycenter.

Appendix C Initializing an unsupervised procedure

The unsupervised clustering procedure we have selected is tclust, introduced in [18], which is a robust model based clustering procedure that allows for non spherical clusters. Nontheless, it is possible to use any other unsupervised procedure that allows an initialization with a clustering defined by probability distributions. For example, this is the case for the popular mclust [16, 30], a finite Gaussian mixture model based clustering solved by an EM-algorithm.

tclust searches for a partition of , with , vectors , positive definite matrices and weights that approximately maximize the pseudo-likelihood

(9)

under restrictions over the scatter matrices . By we denote the density function of the multivariate normal . is the cluster of trimmed observations, where the trimming level is .

The details of the algorithm can be found in [17]. For us it is relevant to recall only the initialization step, i.e, to provide an initial . Hence, to initialize tclust we only need a set of weights with corresponding means and covariances.

Participant Final diagnosis Type of tested sample Type of coagulant (preservation) Sex Age (years) Incubation period Type of Flow Cytometer
Centre 1 HD PB EDTA M 53 30 min BD FACSCanto
Centre 1 HD PB EDTA M 50 30 min BD FACSCanto
Centre 1 HD PB EDTA M 61 30 min BD FACSCanto
Centre 2 HD PB Heparin M 29 30 min BD FACSCanto
Centre 2 HD PB Heparin M 38 30 min BD FACSCanto
Centre 2 HD PB Heparin F 27 30 min BD FACSCanto
Centre 2 HD PB Heparin F NA 30 min BD FACSCanto
Centre 2 HD PB Heparin M NA 30 min BD FACSCanto
Centre 2 HD PB Heparin F NA 30 min BD FACSCanto
Centre 2 HD PB Heparin F NA 30 min BD FACSCanto
Centre 3 HD PB NA M 34 15 min BD FACSCanto
Centre 3 HD PB NA F 33 15 min BD FACSCanto
Centre 3 HD PB NA M 32 15 min BD FACSCanto
Centre 3 HD PB NA M 33 15 min BD FACSCanto
Centre 3 HD PB NA F 35 15 min BD FACSCanto
Centre 3 HD PB EDTA NA Adult 15 min BD FACSCanto
Centre 3 HD PB EDTA NA Adult 15 min BD FACSCanto
Centre 3 HD PB EDTA NA Adult 15 min BD FACSCanto
Centre 3 HD PB EDTA NA Adult 15 min BD FACSCanto
Centre 3 HD PB EDTA NA Adult 15 min BD FACSCanto
Table 2: Detailed information about the participants and the measurments for 20 of the 21 cytometries.
Figure 4: Hierarchical trees for the database . Top-left: result of optimalFlowTemplates. Top-right: result of flowMatch with symmetric Kulback-Leibler divergence. Bottom-left: complete linkage with . Bottom-right: complete linkage with .
Cytometry 1
Cytometries clustering Templates formation Method 1 Method 2 Method 3 Method 4 Method 5
complete, k = 5 pooling 0.9064 0.9339 0.8992 0.9196 0.8521
HDBSCAN pooling 0.9064 0.9323 0.8992 0.9196 0.8521
complete, k = 5 HDBSCAN 0.0000 0.9398 0.8825 0.3542 0.7429
HDBSCAN HDBSCAN 0.0000 0.9111 0.7235 0.5263 0.9186
complete, k = 5 37-barycenter, alpha = 0.05 0.0000 0.9498 0.9512 0.2298 0.8707
HDBSCAN 37-barycenter, alpha = 0.05 0.0000 0.9485 0.7755 0.6846 0.5988
Ctometry 6
Cytometries clustering Templates formation Method 1 Method 2 Method 3 Method 4 Method 5
complete, k = 5 pooling 0.7692 0.8524 0.7613 0.8373 0.8373
HDBSCAN pooling 0.8787 0.8571 0.7762 0.8427 0.8427
complete, k = 5 HDBSCAN 0.7212 0.9265 0.8398 0.0653 0.8163
HDBSCAN HDBSCAN 0.8270 0.9276 0.8399 0.7924 0.8034
complete, k = 5 37-barycenter, alpha = 0.05 0.0000 0.9184 0.8399 0.8110 0.7572
HDBSCAN 37-barycenter, alpha = 0.05 0.1468 0.9112 0.7849 0.6314 0.6314
Citometry 10
Cytometries clustering Templates formation Method 1 Method 2 Method 3 Method 4 Method 5
complete, k = 5 pooling 0.9525 0.9431 0.9440 0.9306 0.9306
HDBSCAN pooling 0.9525 0.9421 0.9440 0.9306 0.9306
complete, k = 5 HDBSCAN 0.9009 0.9451 0.9445 0.0739 0.8181
HDBSCAN HDBSCAN 0.9009 0.9423 0.9445 0.0739 0.8181
complete, k = 5 37-barycenter, alpha = 0.05 0.0000 0.9345 0.9407 0.9358 0.1163
HDBSCAN 37-barycenter, alpha = 0.05 0.0000 0.6694 0.6325 0.7901 0.1610
Table 3: Median -measure for and for different combinations of our procedures. Cytometries clustering indicates the method used for clustering in optimalFlowTemplates. Templates formation indicates the method used for obtaining templates for the clusters of cytometries. Method 1 refers to using the best template and QDA. Method 2 refers to using random forest with the cytometry closest in similarity distance in the assigned group. Method 3 is like Method 2 but using QDA. Method 4 is label matching between the best template and the best tclust result. Method 5 is label matching through a vote between label matchings of every cytometry in the best group.
Citometry 11
Cytometries clustering Templates formation Method 1 Method 2 Method 3 Method 4 Method 5
complete, k = 5 pooling 0.9069 0.9506 0.9336 0.9378 0.9363
HDBSCAN pooling 0.9069 0.9496 0.9336 0.9378 0.9363
complete, k = 5 HDBSCAN 0.4304 0.7500 0.6864 0.1264 0.5928
HDBSCAN HDBSCAN 0.4304 0.7556 0.6864 0.1264 0.5928
complete, k = 5 37-barycenter, alpha = 0.05 0.6319 0.9383 0.9116 0.6206 0.9095
HDBSCAN 37-barycenter, alpha = 0.05 0.0000 0.9494 0.9116 0.7151 0.8975
Citometry 18
Cytometries clustering Templates formation Method 1 Method 2 Method 3 Method 4 Method 5
complete, k = 5 pooling 0.9088 0.9611 0.9038 0.6433 0.3232
HDBSCAN pooling 0.8465 0.8585 0.7045 0.7901 0.4740
complete, k = 5 HDBSCAN 0.9043 0.9618 0.9171 0.6415 0.7912
HDBSCAN HDBSCAN 0.9043 0.9617