1 Introduction
Flow cytometry (FC) works with ‘highdimensional quantitative measurement of light scatter and fluorescence emission properties of hundreds of thousands of individual cells in each analysed sample’ (see [1]
). These quantitative measurements allow to analyze and classify individual cells providing diverse applications. For example, as mentioned in
[29], ‘flow cytometry is used to identify and quantify populations of immune cells’ in order to monitor the immune state of patients or to detect relevant biomarkers by comparing flow cytometries from different patient groups.A main component in FC is gating, the assignment of individual cells (data records) into discrete cell types. Manual gating, i.e., an expert assigning cell types (labels) to individual cells, using a set of rules on one or twodimensional projections, has been the prevalent option. However, this manual approach has some shortcomings. First, it is subjective since it depends on the expertise of the user, on the sequence of markers (measured variables) used to do the projections and on the locations of the gates on those projections. Second, it can be very time consuming because it is ‘roughly quadratic in the number of markers’ (see [23]). Third, the recent increase in the number of markers and number of cells per cytometry makes human error a relevant factor.
In order to avoid some of the difficulties related to manual gating there have been different approaches to automated gating. Some are unsupervised, therefore, there is no use of previously gated cytometries. Hence, gating is done through a clustering procedure. We present a small selection of such unsupervised automated gating procedures. FLOCK [28]
, which dose gridbased density estimation (with merging) and then kmeans; FLAME
[27], which performs skew
model based clustering and flowClust [24, 25], which does robust based clustering through mixture models with BoxCox transformation. Other related clustering procedures are: flowPeaks [19]performs Gaussian mixture model based clustering (with modified covariances) and merging and flowMeans
[2] does kmeans with initialization via mode detection through kernel density based estimation. More information about state of the art methods can be found in [1, 29].Accuracy of cell type assignation can be improved using supervised machine learning where historical information is contained in previously gated cytometries (manually or otherwise). Recently, some methods have been produced addressing this problem. In [23]
, DeepCyTOF was introduced, essentially combining denoising, deeplearning algorithms and domain adaptation. In
[26], flowLearn was introduced, combining density features of the data, manually selected gating thresholds and derivativebased density alignments. We stress that other more classical approaches for supervised learning are also available. For example, random forest algorithms, support vector machines or quadratic discriminant analysis can be used when learning from some previously gated cytometry. Supervised machine learning is a well documented topic and for more detailed explanations we refer to
[3].There are two main setups for using supervised learning. First, the classical one, where there is an available data base of historical information. In the FC context, this means that a collection of gated flow cytometries is available and we want to use this information in order to gate a new cytometry. Second, an alternative one, where we have a collection of ungated cytometries and it is possible to gate some of them and use these gated cytometries to classify the rest of the cytometries in the collection.
In both setups there is a fundamental problem intrinsic to FC. That is, flow cytometry data has considerable technical and biological variability. Biological variability appears due to intrinsic differences between individuals such as illness, age, sex, etc… Technical variability can appear due to the use of different settings for the measurements, due to the variations of the conditions during the experiments or due to the use of different types of flow cytometers. Precisely this high variability makes the use of supervised methods a hard task.
In this work we provide novel methods for grouping (clustering) gated cytometries. By clustering a set of cytometries we are producing groups (clusters) of cytometries that have lower variability than the whole collection. This in turn allows to improve greatly the performance of any supervised learning procedure. We provide evidence of this below. Once we have a partition (clustering) of a collection of cytometries, we provide several methods for obtaining an artificial cytometry (prototype, template) that represents in some optimal way the cytometries in each respective group. These prototypes can be used, among other things, for matching populations between different cytometries as suggested in [7, 20]. Even more, a procedure able to group similar cytometries could help to detect individuals with a common particular condition.
optimalFlowTemplates is our procedure for clustering cytometries and obtaining templates. It is based on recent developments in the field of optimal transport such as a similarity distance between clusterings, introduced in [12], and a barycenter (Frechet mean, see [22, 10]) and kbarycenters (see [4, 8, 5]
We introduce a supervised classification tool, optimalFlowClassification, for the case when a database of gated cytometries is available. The procedure uses the prototypes obtained by optimalFlowTemplates on the database. These are used to initialise tclust, a robust extension of kmeans that allows for nonspherical shapes, for gating a new cytometry (see [18], not to be confused with TCLUST, [14]). By using a similarity distance between the best clustering obtained by tclust and the artificial cytometries provided by optimalFlowTemplates we can assign the new cytometry to the most similar template (and the respective group of cytometries). We provide several options of how to assign cell types to the new cytometry using the most relevant information, represented by the assigned template and the respective cluster of cytometries.
2 Methods
We start with the mathematical treatment of flow cytometry data. We can view a gated flow cytometry, say , as a collection of multidimensional points with their associated labels (cell types or group labels) forming a set of different labels. Hence, a gated cytometry can be described as where and . Alternatively we could describe it as a partition (clustering) of all into groups (clusters) formed by points sharing the same labels. That is, where is a cluster and is a weight associated with label . A third useful description is to view a gated cytometry as a clustering but coming from a mixture of locationscatter multivariate distributions. With some abuse of notation where are the multivariate mean and covariance of the points in cluster .
2.1 optimalFlowTemplates
Due to the the high variability in flow cytometry data we should expect that learning form different elements in a database should produce significantly different results on the classification of a new cytometry . Our approach is to search for clusters of existing cytometries in the database. In this way we pursuit for a notable reduction of variability thus allowing a good representation of the cytometries in each of these groups through prototypic cytometries. Therefore, using a prototype of a group for learning should produce a similar result for classifying to the one obtained when using any other cytometry in the same group.
2.1.1 Clustering cytometries
Since gated cytometries can be viewed as partitions (clusterings) and we want to clusterize cytometries in order to reduce variability, we want to do clustering of clusterings, also known as metaclustering. The methodology we will develop in this work is to use some meaningful distance between partitions and then apply hierarchical clustering methods. As a distance between clusterings we propose to use the similarity distance introduced in
[12]. It is based on two auxiliary distances. The optimal transport distance between two partitions and , defined aswhere is a distance between clusters and .
are the solutions of the optimal transport linear program
measures the cost of the optimal way of transforming one partition into the other. For more detailed explanations on optimal transport see Supplementary Material Section A. The other auxiliary distance is the naive transport distance, defined as
It measures the cost of naively transforming one partition into the other.
The similarity distance is defined as the quotient
(2) 
We recall that , where means that partitions are represented by the same clusters with the same weights and means that every cluster in is transported proportionally to every cluster in . Therefore, values of close to 0 can be interpreted as high similarity between clusterings, and values of close to 1 can be interpreted as very dissimilar clusterings.
In order to completely define , we need to specify a distance between clusters. Our choice is to use the well known Wasserstein distance (see Supplementary Material Section A) so
(3) 
In essence, we are treating clusters as multivariate normal distributions,
and , with means and covariances calculated from the clusters. Our choice of the Wasserstein distance is based on the desire to account for the spatial shapes of the clusters and to obtain templates for the groups of cytometries. We stress that all results in this work are also valid when understanding clusters as members of a locationscatter family. Another interesting measure for cluster difference is, , the (entropy) regularized Wasserstein distance (see Supplementary material Section A) where clusters are understood as empirical distributions.However, any other dissimilarity measure can be used, for example the symmetric KullbackLeibler was used in [7]
or FriedmanRafsky test statistic was used in
[20], in the context of cluster comparison in flow cytometry. When we see clusters as collections of points, the Adjusted Rand Index, the Jaccard distance or other similar can be used, at the expense of loosing spatial information.The clustering of cytometries is presented in lines 117 in Algorithm 1, resulting in a partition . Lines 1216 are concerned with the obtention of a distances matrix that in line 17 is used to perform hierarchical clustering. Classical agglomerative algorithms can be used, but also density based algorithms as DBSCAN and HDBSCAN.
2.1.2 Template obtention through consensus clustering
Once we have a partition, , of the collection of cytometries , we want to obtain a prototype cytometry, , for every group of cytometries, , in the partition (lines 1821 in Algorithm 1). To address this goal we resort to kbarycenters using Wasserstein distance, which provide a suitable tool for consensus on probability distributions (see Supplementary Material Section A for more details). We propose three different methods on how to obtain a template cytometry from a group of cytometries. That is, on how to do consensus (ensemble) clustering on flow cytometries. More on this topic can be found in Section B in the Supplementary material.

Pooling: Suppose that . This is the case for a set of gated cytometries with identified cell populations.
Input: ,
1:for do2: set of all clusters associated with label for the cytometries in group .3: if then4: take 1barycenter of the clusters in viewed as multivariate normals.5: else6: is empty7: end if8:end for9:Output:

Density based hierarchical clustering:
Input: ,
1: set formed by every cluster of every cytometry in group .2:for do3:4:end for5: density based hierarchical clustering based on .6:for do7: barycenter of elements with label in .8:end for9:Output:

kbarycenter:
Input: , ,
1: set formed by every cluster of every cytometry in group .2: barycenter of the elements in .Output:
The intuition behind pooling, once we have groups of similar cytometries and cell types are known, is the following. A prototype of a cell type is the 1barycenter (barycenter), a consensus representation, of the clusters (multivariate distributions) representing the same cell type in the cytometries that are members of the same group in . A prototype cytometry is the collection of prototypes of each cell type.
However, our templates could be obtained even when we have gated cytometries but without cell type identification between them. Density based hierarchical clustering and kbarycenter are based on the idea that clusters that are close in Wasserstein distance should be understood as representing the same cell type, although we may not know which cell type. When using kbarycenters we have to specify the number of cell types, , that we want for the artificial cytometry. However, when using density based hierarchical clustering as HDBSCAN (see [11]) or DBSCAN (see [15]) the selection of the number of cell types for the prototype cytometry is automatic. Recall that both kbarycenters, through trimming, and density based hierarchical clustering are robust clustering procedures.
2.2 optimalFlowClassification
Our goal is to do supervised classification, i.e., assign cell types to a new cytometry , using the information given in a database of gated cytometries . The different sources of variability, mainly those of technical nature and those which are properly due to different cohorts present in the database, advise to search for different cytometric structures. Hence, we should assign to the group of cytometries that is more similar to it and then use supervised techniques. Indeed, this is the purpose of optimalFlowClassification, as shown in Algorithm 2. As an input we apply optimalFlowTemplates to the database in order to obtain the partition and the templates .
Lines 14 in Algorithm 2 are dedicated to finding an unsupervised partition of the new cytometry
using as initialization for tclust the prototypes of the database. We favour the use of tclust over kmeans since it allows for nonspherical clusters and for trimming, making partitions more robust to outliers. Additional information about tclust is given in Supplementary Material Section
C. Initializing with the database entries attempts to use optimally the available information. Hence, if is similar to some of the cytometries in the database, supervised initialization should be advantageous. However, some other suitable unsupervised initializations can be used, as the ones proposed in FLOCK, flowPeaks or flowMeans. We need to cluster in order to compare it with the template cytometries.In lines 58 we look to assign , using the clustering , produced in the previous step, to the template that is closest in similarity distance to . With this we hope to use only the most relevant information of the database, summarized in and .
The last step in algorithm 2, line 9, is concerned with assigning cell types to . To do this we have several options. We can try to relabel in an optimal way using or , i.e, do label transfer. Alternatively, we can use to do Quadratic Discriminant Analysis (QDA) or we can find the most similar partition in similarity distance (2) from to and use it to do QDA or random forest classification. In short, we can do label transfer or supervised classification.
For supervised classification we use standard tools, random forest and QDA, however, other methods can be used in a straightforward fashion. We remark that when using QDA and , we are using nonlinear multidimensional gating regions obtained from in order to classify . This can be taught as an extension of the method presented in [26] where only linear onedimensional regions are used. Another interesting fact is that the use of allows us to select the most similar real cytometry to , hence supervised tools should be more effective.
The problem of relabelling a clustering with respect to another clustering is usually stated as a weighted bipartite matching problem, where weights are related to the similarity between clusters in the two partitions. This problem can be solved by the hungarian method [21]. Generalized edge cover is another possible solution to relabelling (see [7]).
Additionally we introduce an approach to obtain a fuzzy relabelling based on solving the optimal transport linear program (2.1.1). The solution, , is the base for this fuzzy relabelling. We define the score of cluster in to come from cluster in as . In words, is the proportion of probability coming from cluster , with respect to the probability in cluster , that arrives at cluster . Clearly, , and the closer to 1 the score is the more evidence we have that cluster and represent the same cluster. A fuzzy relabelling for cluster in is the collection of all the scores . A variation of the previous score is , where we are weighting by the proportion of cluster that goes to cluster , with respect to the probability contained in cluster . In this way we downweight the effect of a small proportion of a big cluster with respect to a big proportion of a small cluster arriving to . From these fuzzy relabellings a hard relabelling can be easily obtained.
Again, a suitable distance between clusters can be the Wasserstein distance as in (3). However, another possibility is to use
(4) 
(3) is computationally very efficient but does not allow to label very small clusters in .(4) does allow labelling small clusters in , at the price of using subsampling to compare bigger clusters (for example more than 10000 points).
3 Results
In this section we present several experiments and comparisons of our methods with other state of the art procedures on a real database. In Figure 1 we provide visual intuition of what a cytometry looks like when understood as a mixture of normal distributions. We also provide an example of a prototype cytometry. In Figure 2 we provide comparisons between the result of optimalFlowTemplates, flowMatch and what can be considered a ground truth. In Table 1 we provide comparisons between state of the art unsupervised gating as given by flowMeans and our proposal for initializing tclust with supervised information. We also provide comparisons between the state of the art supervised method deepCyTOF and our own supervised procedure optimalFlowClassification.
3.1 Data
Our database is formed by 21 gated flow cytometries, , obtained following the Euroflow protocols, kindly provided by Centro de Investigación del Cancer (CIC) in Salamanca, Spain. All 21 cytometries have been obtained in a BD FACSCanto flow cytometer but in three different centres. The size of the cytometry datasets vary from 50,000 cells to 254,450 cells. The samples are from adult male and female individuals, with a varied range of ages, that have been diagnosed as healthy. More information about the data set can be found in Table 2 in the Supplementary material.
Clearly, there is biological variability, since there are different individuals with different ages and other different characteristics. Moreover, we have technical variability since we have different centres, different dates of measurement and different incubation times. However, we remark that all individuals belong to the same class of healthy people.
3.2 Measures of performance
We need appropriate measures of the performance of the different automated gating procedures that appear in this work. We recall that we use both unsupervised and supervised methods. In this setup an appropriate tool is the Fmeasure statistic which has been used in [1, 2, 19, 23]. With our notation we have
with . We make the convention and . Another appealing measure is the median Fmeasure used in [26] specifically for supervised learning. The formal definition is
where is the considered ground truth, in our case a manual gating, and is another classification of the same data.
To measure how similar are two cytometries, i.e., how well we do when learning from one to classify the other and how well we do when learning with the later to classify the former we introduce the following distance.
where is the partition resulting from the classification of the data in using a random forest learned in . is the partition resulting from the classification of the data in using a random forest learned in . This measure gives us a notion of how close in terms of being good predictors for one another are two cytometries. We have that , and two cytometries are interchangeable for learning if is close to 0. A variation of this measure is
3.3 Clustering cytometries and template obtention
We start with a visual example to grasp the intuition behind our approach to flow cytometry. Top left in Figure 1 we have a three dimensional projection of five cell types present in : Basophils (black), CD4+CD8 (red), Eosinophils (green), Monocyts (Blue), Neutrophils (Cyan). From the picture we can view each cell type as a cluster, hence having , omitting weight information for simplicity, where corresponds to Basophils, to CD4+CD8 and so on…On the other hand, top right in Figure 1 we have a representation of the same groups when we understand them as multivariate normals, that is where are the multivariate mean and covariance of the points gated as Basophils, of points gated as CD4+CD8… It becomes apparent that the multivariate geometrical information is a reasonable representation of the cytometry, even more, it allows to obtain templates in a computationally efficient way.
Suppose that we have a database, which is a subset of 15 cytometries, given by . We also have a partition of the dabase, , where one group is . If we plot together the five cell types for all three cytometries in we get the representation bottom left in Figure 1. A prototype for cluster , obtained by pooling as indicated in Section 2.1.2, is represented bottom right in Figure 1. We see that the artificial cytometry (template, prototype) gives a nice representation of the information contained in bottom left of Figure 1. Even more, this artificial cytometry is quite similar to any of the original cytometries in .
Once we have seen the intuitive meaning of clustering cytometries and the obtention of templates, we want to compare different methods to cluster the database . We use for a ground truth the simple linkage hierarchical clusterings obtained using , bottom right of Figure 2, and using , bottom left of Figure 2. For a state of the art comparison, we use flowMatch, described in [7], using Mahalanobis distance, depicted top right in Figure 2. The clustering obtained by optimalFlowTemplates when using single linkage hierarchical clustering is shown top left in Figure 2. More comparisons can be seen in Figure 4 in the Supplementary material. At a first glance it is clear that the results form optimalFlowTemplates are much more similar to the ground truth than those of flowMatch. This should be interpreted as the fact that optimalFlowTemlates captures more accurately the similarity between cytometries than flowMatch. Two additional facts should be stated: first, the similarity distance is independent of parameters, something that is not the case for the generalized edge cover distance used in flowMatch. Second, optimalFlowTemplates produces templates only at one stage, once the number of clusters is determined, while flowMatch produces templates at every stage of the hierarchical clustering procedure.
3.4 Gating and classification
We will apply optimalFlowTemplates+optimalFlowClassification to the database introduced in the previous section. We will use as a test set . For the cytometries in , we also perform and unsupervised gating given by flowMeans and an unsupervised procedure given by tclust initialized with the templates obtained by optimalFlowTemplates.
Results can be seen in columns 35 of Table 1. In Table 3 and 4 in the Supplementary material we have a full description of the results of optimalFlowClassification. We see that tclust initialized with optimalFlowTemplates is competitive with flowMeans, but more importantly, optimalFlowTemplates+optimalFlowClassification is superior in every of the test cytometries, giving 5 form 6 Fmeasures higher than 0.95 and the other higher than 0.92. Clearly our supervised procedure is working well and, as expected, is giving better performance than state of the art unsupervised alternatives.
However, we also want to compare with a state of the art supervised procedure. In this case we will use deepCyTOF, with some bug corrections and some adaptations to our setting of the github version, implemented in Python with tensorflow 0.12 and keras 1.2.2. In order to use deepCyTOF we need cytometries with the same number and type of cell types so we use a data set
, where we have eliminated one group from each cytometry. We recall that deepCyTOF only uses the supervised information of one of the cytometries in to classify all other members. We see the results of deepCyTOF, with domain adaptation and without denoising, since all entries are classified, in column 1 of Table 1. DeepCyTOF’s performance is rather poor, achieving worst Fmeasure than flowMeans in 6 of the 9 cases and also for all applicable cases (cytometries 1,18,20) than optimalFlowTemplates+optimalFlowClassification.However, these poor results are due to the high variability of the cytometries that can not be accommodated by the domain adaptation procedure of deepCyTOF. Hence if we were able to reduce this variability, deepCyTOF should give better results. Indeed, if we use flowMeans to gate the cytometries in , and then we use optimalFlowTemplates, we obtain the hierarchical three presented in Figure 3. It suggests to split into and . We recall that until now we have not used supervised information. Applying deepCyTOF to and we obtain the results in column 2 of Table 1. Now deepCyTOF performs better than flowMeans in 7 of the 9 cases, however it is better than optimalFlowTemplates+optimalFlowClassification only for Cytometry 1, which is the one that deepCyTOF uses for learning in .
DeepCyTOF  DeepCyTOF 2  flowMeans  tclust  optimalFlowC  
Citometry 1  0.9641  0.9857  0.9501  0.9504  0.9740 
Citometry 2  0.9420  0.9585  0.8988  
Citometry 5  0.8728  0.8720  0.8977  
Citometry 6  0.9195  0.9335  0.9522  
Citometry 7  0.8763  0.8062  0.9508  
Citometry 10  0.8610  0.8141  0.9595  
Citometry 11  0.8653  0.9170  0.9256  
Citometry 14  0.9825  0.9295  0.9004  
Citometry 17  0.6982  0.9816  0.8978  
Citometry 18  0.6840  0.9797  0.9003  0.8716  0.9853 
Citometry 20  0.6884  0.9760  0.9223  0.8661  0.9834 
Citometry 21  0.6942  0.9699  0.9269 
Supplementary material
Appendix A Notions on optimal transport
Following [33], let us take , the space of probability distributions on . For in , let us define the set of all probability measures on with first marginal and second marginal . The optimal transport cost between the two measures is defined as
(5) 
where is the cost of transporting one unit of mass from to . A probability that achieves the minimum in (5
) is called an optimal coupling, with an associated random variable
that has joint distribution
. When and are discrete, i.e., and , with , the optimal transport problem can be solved as the folowing linear program (see [9])are the solutions of the optimal transport linear program
For a Polish metric space and , the Wasserstein distance between and is defined as
where refers to the law of .
We present the entropy regularized Wasserstein distance, since it is strictly convex and there are efficient solutions based on the Sinkhorn algorithm (see [13]). For a fixed the regularized Wasserstein distance is defined as
where are the solutions of the optimal transport linear program
Let us denote the set of probability measures on
with finite second moment and let us take
for . In [8] the notions of barycenter and trimmed barycenter were introduced, building on the concept of Wasserstein barycenter introduced in [10, 22]. A barycenter of probabilities in with weights is any kset in such that for any we have that(6) 
An trimmed barycenter of with weights as before is any kset with weights such that
(7) 
where .
Broadly speaking kbarycenters can be thought of as an extension of kmeans to the space of probabilities with finite second order, since we can rewrite (6) as
(8) 
where is a partition of and is the barycenter of the elements in . Therefore, trimmed kbarycenters may be matched to trimmed kmeans. As stated in [8], efficient computations can be done when dealing with locationscatter families of absolutely continuous distributions in
. A notable example being the family of multivariate Gaussian distributions.
Appendix B Consensus clustering
Consensus clustering is the problem of combining different partitions. These partitions (clusterings), can come from different clustering algorithms applied to the same data, the same algorithm looking for different number of clusters or with a different initialization or a combination of the previous. A possible approach to the problem is to use some type of relabelling of the original partitions and an optimization procedure to extract a consensus clustering that is the closest in some sense to the relabelled partitions [6, 32]. Other alternatives comprising clusterbased similarity partitioning algorithm, hypergraph partitioning algorithm and metaclustering algorithm (clustering of partitions) were proposed in [31]. Consensus clustering results in a gain in robustness and in the possibility of parallelization and reuse of information.
A shortcoming of the previous procedures is the difficulty to handle a combination of clusterings of different data. This shortcoming can be addressed when clusters in a partition can be viewed as a distribution. This is the case when mixture models are used for clustering. Also, in many cases we can represent clusters, obtained by some clustering procedure, by their means and covariance matrices. Even more, some fields work with data represented as probability distributions.
In this setting, we would have several partitions, coming from the same or from different data, where ,…, are subsets of some space of probability measures. A natural idea is to pool together all the distributions, obtaining the data set with , and to do clustering in the abstract space of probabilities. The resulting partition of can be used to obtain a set of distributions that will represent the consensus clustering.
Let us fix that . A particular implementation of the previous strategy was presented in [8] where the consensus cluster is taken to be the (trimmed) barycenter as in ((7))(8). Here we present some alternative but related strategies to obtain a consensus clustering. We can define the distance matrix , where , and use hierarchical clustering to obtain a partition of . Then taking the barycenter (1barycenter) of the elements in each we obtain the consensus clustering . With hierarchical clustering we mean single linkage, complete linkage or average linkage, but also density based hierarchical clustering as DBSCAN [15] or HDBSCAN [11]. The former are able to select automatically the appropriate number of clusters, which can be seen as a desirable quality, even more, they are also robust as the trimmed barycenter.
Appendix C Initializing an unsupervised procedure
The unsupervised clustering procedure we have selected is tclust, introduced in [18], which is a robust model based clustering procedure that allows for non spherical clusters. Nontheless, it is possible to use any other unsupervised procedure that allows an initialization with a clustering defined by probability distributions. For example, this is the case for the popular mclust [16, 30], a finite Gaussian mixture model based clustering solved by an EMalgorithm.
tclust searches for a partition of , with , vectors , positive definite matrices and weights that approximately maximize the pseudolikelihood
(9) 
under restrictions over the scatter matrices . By we denote the density function of the multivariate normal . is the cluster of trimmed observations, where the trimming level is .
The details of the algorithm can be found in [17]. For us it is relevant to recall only the initialization step, i.e, to provide an initial . Hence, to initialize tclust we only need a set of weights with corresponding means and covariances.
Participant  Final diagnosis  Type of tested sample  Type of coagulant (preservation)  Sex  Age (years)  Incubation period  Type of Flow Cytometer 
Centre 1  HD  PB  EDTA  M  53  30 min  BD FACSCanto 
Centre 1  HD  PB  EDTA  M  50  30 min  BD FACSCanto 
Centre 1  HD  PB  EDTA  M  61  30 min  BD FACSCanto 
Centre 2  HD  PB  Heparin  M  29  30 min  BD FACSCanto 
Centre 2  HD  PB  Heparin  M  38  30 min  BD FACSCanto 
Centre 2  HD  PB  Heparin  F  27  30 min  BD FACSCanto 
Centre 2  HD  PB  Heparin  F  NA  30 min  BD FACSCanto 
Centre 2  HD  PB  Heparin  M  NA  30 min  BD FACSCanto 
Centre 2  HD  PB  Heparin  F  NA  30 min  BD FACSCanto 
Centre 2  HD  PB  Heparin  F  NA  30 min  BD FACSCanto 
Centre 3  HD  PB  NA  M  34  15 min  BD FACSCanto 
Centre 3  HD  PB  NA  F  33  15 min  BD FACSCanto 
Centre 3  HD  PB  NA  M  32  15 min  BD FACSCanto 
Centre 3  HD  PB  NA  M  33  15 min  BD FACSCanto 
Centre 3  HD  PB  NA  F  35  15 min  BD FACSCanto 
Centre 3  HD  PB  EDTA  NA  Adult  15 min  BD FACSCanto 
Centre 3  HD  PB  EDTA  NA  Adult  15 min  BD FACSCanto 
Centre 3  HD  PB  EDTA  NA  Adult  15 min  BD FACSCanto 
Centre 3  HD  PB  EDTA  NA  Adult  15 min  BD FACSCanto 
Centre 3  HD  PB  EDTA  NA  Adult  15 min  BD FACSCanto 
Cytometry 1  
Cytometries clustering  Templates formation  Method 1  Method 2  Method 3  Method 4  Method 5 
complete, k = 5  pooling  0.9064  0.9339  0.8992  0.9196  0.8521 
HDBSCAN  pooling  0.9064  0.9323  0.8992  0.9196  0.8521 
complete, k = 5  HDBSCAN  0.0000  0.9398  0.8825  0.3542  0.7429 
HDBSCAN  HDBSCAN  0.0000  0.9111  0.7235  0.5263  0.9186 
complete, k = 5  37barycenter, alpha = 0.05  0.0000  0.9498  0.9512  0.2298  0.8707 
HDBSCAN  37barycenter, alpha = 0.05  0.0000  0.9485  0.7755  0.6846  0.5988 
Ctometry 6  
Cytometries clustering  Templates formation  Method 1  Method 2  Method 3  Method 4  Method 5 
complete, k = 5  pooling  0.7692  0.8524  0.7613  0.8373  0.8373 
HDBSCAN  pooling  0.8787  0.8571  0.7762  0.8427  0.8427 
complete, k = 5  HDBSCAN  0.7212  0.9265  0.8398  0.0653  0.8163 
HDBSCAN  HDBSCAN  0.8270  0.9276  0.8399  0.7924  0.8034 
complete, k = 5  37barycenter, alpha = 0.05  0.0000  0.9184  0.8399  0.8110  0.7572 
HDBSCAN  37barycenter, alpha = 0.05  0.1468  0.9112  0.7849  0.6314  0.6314 
Citometry 10  
Cytometries clustering  Templates formation  Method 1  Method 2  Method 3  Method 4  Method 5 
complete, k = 5  pooling  0.9525  0.9431  0.9440  0.9306  0.9306 
HDBSCAN  pooling  0.9525  0.9421  0.9440  0.9306  0.9306 
complete, k = 5  HDBSCAN  0.9009  0.9451  0.9445  0.0739  0.8181 
HDBSCAN  HDBSCAN  0.9009  0.9423  0.9445  0.0739  0.8181 
complete, k = 5  37barycenter, alpha = 0.05  0.0000  0.9345  0.9407  0.9358  0.1163 
HDBSCAN  37barycenter, alpha = 0.05  0.0000  0.6694  0.6325  0.7901  0.1610 
Citometry 11  
Cytometries clustering  Templates formation  Method 1  Method 2  Method 3  Method 4  Method 5 
complete, k = 5  pooling  0.9069  0.9506  0.9336  0.9378  0.9363 
HDBSCAN  pooling  0.9069  0.9496  0.9336  0.9378  0.9363 
complete, k = 5  HDBSCAN  0.4304  0.7500  0.6864  0.1264  0.5928 
HDBSCAN  HDBSCAN  0.4304  0.7556  0.6864  0.1264  0.5928 
complete, k = 5  37barycenter, alpha = 0.05  0.6319  0.9383  0.9116  0.6206  0.9095 
HDBSCAN  37barycenter, alpha = 0.05  0.0000  0.9494  0.9116  0.7151  0.8975 
Citometry 18  
Cytometries clustering  Templates formation  Method 1  Method 2  Method 3  Method 4  Method 5 
complete, k = 5  pooling  0.9088  0.9611  0.9038  0.6433  0.3232 
HDBSCAN  pooling  0.8465  0.8585  0.7045  0.7901  0.4740 
complete, k = 5  HDBSCAN  0.9043  0.9618  0.9171  0.6415  0.7912 
HDBSCAN  HDBSCAN  0.9043  0.9617 
Comments
There are no comments yet.