1 Introduction
Manually labeling and counting cells based on morphology is an essential component in diagnosing blood diseases such as leukemia or anemia. Consequently, the quality of labels directly affects diagnoses and patient outcomes. In many cases, however, cell types from the same development line are difficult to distinguish, and in combination with image noise, lack of concentration, and subjectivity, humans are prone to make labeling mistakes. The labels are often inconsistent and not reproducible, and for example, in an experiment about classifying lymphocytes, ”
of the morphologists were not able to reproduce their previous classification” [vanderMeer.2007].Existing approaches to improve labeling reproducibility and consistency focus on automation via machine learningbased classification, such as decision trees, support vector machines, or neural networks
[Akrimi.2014, Habibzadeh.2018, Kihm.2018, Shpilman.2017, Tomari.2014, Wheeless.1994]. However, training these algorithms relies on manually generated, noisy labels, and propagating this noise through training may result in biased predictions. To mitigate the adverse effect of label noise on predictive power, different approaches exist [Frenay.2014]: Algorithms can be applied that are relatively robust to label noise [Abellan.2010, Ratsch.2003, Sastry.2010], filtering methods can be used to remove mislabeled instances from the training set [Sanchez.1997b, Thongkam.2008, Wilson.2000], or label noise can be modeled explicitly [Bouveyron.2009, Lawrence.2001, Rantalainen.2011]. For example, the algorithms nearest neighbors (NN) and nearest centroid neighbors (NCN) [Sanchez.1997] find instances whose classes do not agree with the classes of their neighbors [Saez.2015, Sanchez.2003]. Alternative approaches find labeling errors based on confident learning [cleanlab, Northcutt.2019, Northcutt.2017], which estimates the joint distribution between noisy labels and true labels to identify noisy labels and improve training. While these algorithms are able to find noisy or wrong labels, they do not provide suggestions on what the correct labels should be. In addition, they are based on the premise of complete automation, which, especially in the medical domain, is often not feasible due to regulatory constraints requiring human oversight and accountability.
In this work, we follow an orthogonal, humancentered approach: Rather than taking the human (the hematologist performing the labeling task) out of the loop, we develop a humancentered interpretable AI algorithm that comprises two stages. In the first stage, our algorithm identifies those labels that are inconsistent with the morphology of the cells in an unsupervised manner, i. e., without training on noisy labels. In the second stage, an alternative, consistent label is suggested to the hematologist performing the labeling task, based on the labels of the cells in the direct neighborhood of the cell in question. We ensure interpretability of the method by explicitly incorporating prior knowledge on the biological system and an expertdriven error model into the state space model.
More specifically, pseudotime inference with Markov models for labeling consistency
(TIMELY) combines pseudotime inference algorithms with inhomogenenous hidden Markov trees, which is an extension of hidden Markov models. Hidden Markov trees are used since blood stem and progenitor cells differentiate into more mature cell types during their development process, where blood cell lineages can branch into functionally distinct lineages. The differentiation process itself is commonly described as a stochastic process following the Markov assumption, where cells can either remain in the present type or differentiate into a child cell type
[abkowitz1996evidence]. Hidden Markov trees reflect this differentiation process and can be used to model the true, unobservable cell types (i.e. true labels) together with the noisy, observed expert labels. Given a set of cell images and noisy labels, we establish an intrinsic ordering of the cells based on pseudotime inference methods. The ordered cells are used as input for the hidden Markov tree. We extend standard hidden Markov trees to the inhomogeneous case, propose parametric transition matrices and derive an efficient inference scheme. Finally, we identify inconsistent labels and propose alternative, consistent labels to the practitioner. An overview of TIMELY is shown in Figure 1.The manuscript is structured as follows. In Section , we first describe how we establish an intrinsic ordering of the cells using pseudotime inference algorithms. We then briefly review hidden Markov trees and introduce an inhomogeneous extension with parametric transition matrices. Our method TIMELY, which combines both, is then described in detail. In Section , we demonstrate, based on extensive simulations, that our modeling approach is able to identify and correct noisy labels with higher precision and recall than stateoftheart methods for identifying noisy labels. Finally, we apply our algorithm to two realworld datasets of white blood cells with noisy labels in Section and demonstrate that our method is able to identify inconsistent labels and suggest alternative, consistent labels to practitioners in an interpretable manner. We validate these suggested new labels by reclassification via a domain expert.
2 Methods
2.1 Pseudotime inference
The pseudotime of a cell describes the developmental progress of the cell along a dynamic process such as cell differentiation. The greater the pseudotime of a cell, the more mature is the cell. By using pseudotime inference algorithms, we can create a pseudotemporal ordering for all cells in a population. Pseudotime inference algorithms are usually applied on singlecell gene expression similarity measurements [Haghverdi.2016], where adjacent cells have higher expression similarity. We can apply these algorithms to medical images by interpreting the pixels of a cell image as information about the cell, similar to gene expression data, to obtain an ordering of the cells along trajectories.
There is a multitude of pseudotime inference methods to date, which differ in the requirement of existing prior information, scalability, and type of topology [Saelens.2019]. Most pseudotime inference methods consist of two parts. The first part is the calculation of a lowdimensional representation from the given expression data of the cells, and the second part is the ordering of the cells along an inferred trajectory.
Here, we use the algorithms SCORPIUS [Cannoodt.2016] and STREAM [Chen.2019]. SCORPIUS shows very good performance for linear datasets [Saelens.2019], while STREAM is wellsuited for datasets with treelike topologies.
In brief, given the expression profiles of the cells, SCORPIUS obtains a lowdimensional representation using multidimensional scaling (MDS). Next, SCORPIUS applies means clustering and sets the initial trajectory by connecting the cluster centers. The final trajectory results from an iterative refinement through the principal curves algorithm [Hastie.1989]. The pseudotime is calculated by projecting the lowdimensional representations onto the trajectory.
Similarly, STREAM first determines relevant features and then performs dimensionality reduction using modified locally linear embedding (MLLE). In the new embedding, an implementation of elastic principal graphs (ElPiGraph) [Albergante.2018] is used to infer the trajectory and branching points. ElPiGraph approximates datasets with complex topologies by minimizing the elastic energy of the embedding and applying graph transformations. The cells are then projected onto the resulting tree according to their pseudotimes and their assigned branches.
2.2 Hidden Markov trees
Hidden Markov trees are used to describe the differentiation process of the cells, which is a stochastic process following the Markov assumption [abkowitz1996evidence]. There is one root cell type, and all other cell types develop from it and can be mapped onto a treelike topology reflecting their respective progeny. Assume that we know the topology of the dataset, i. e., we know the shape of the Markov tree.
Definition 1.
A tree is a Markov tree
if for each leaf, the directed path connecting the root and the leaf is a Markov chain.
A hidden Markov tree is an extension of a Markov tree, and it is used for applications where the Markov property does not hold or where the states can only be observed indirectly. The model consists of observed variables and hidden variables, where only the hidden variables follow the Markov property. The present observed variable depends on the present hidden state, but neither on previous observed states nor on previous hidden states.
Define and for to be the hidden tree and the observed tree, respectively. The roots of the trees are and , and both trees have the same indexing structure.
Definition 2.
Let and be two trees, where is the observed tree and is the hidden tree. The pair is a hidden Markov tree (HMT) if

is a Markov tree, and

the distribution of the observed variable depends only on the hidden variable for all .
For the application on cell image labels, the variable corresponds to the noisy (observed) expert label, and represents the true (unobservable) label of the image, which may be different from the expert label. The sequence of images is sorted by increasing pseudotime, which has been calculated before by a suitable pseudotime inference algorithm. Let be the number of cell types and be the number of images in the dataset.
Definition 3.
The hidden Markov tree is governed by the parameters , , and . For , , we define
(1)  
(2)  
(3) 
where denotes the parent of node . We call the
start probabilities
, the transition matrix at node , and the emission matrix. If the transition matrix is independent of , the model is called homogeneous; otherwise, the model is called inhomogeneous.The transition matrix describes the probability of staying in the present cell type or changing to a child cell type. The emission matrix represents the expert labeling error model, where is the probability that the expert predicts label when the true cell type of the cell in the image is .
A hidden Markov model (HMM) is a special case of an HMT, where the underlying topology is a chain. Figure 2 shows a visualization of an HMM for our application.
Timedependent transition matrices
We use the following information to set up the parametric transition matrices. We know the topology of the dataset, and following the Markov assumption of blood cell differentiation [abkowitz1996evidence], it is only possible for a cell to stay in the same cell type or to transition to one of the child cell types (see Figure 4 for a topology example). There is no way to skip one cell type or to go back to a previous cell type. Once one of the end stages is reached, there are no transitions anymore.
Standard homogeneous HMMs/HMTs are based on the assumption that the transition between states is independent of , which would correspond to cells sampled uniformly across the development trajectory. However, in practice, this sampling (i. e., the labeled cells) are from arbitrary points on the development trajectory, which is reflected by large variation in pseudotime difference between neighboring cells. This difference directly affects the probability of a cell to transition to a different cell type. The larger the pseudotime difference between two cells, the greater is the likelihood for a transition (and the lower is the likelihood to remain in the same cell type). Consequently, the entries of the transition matrix at node should not only depend on the cell type of the previous cell, but also on the pseudotime difference between the present cell and the previous cell. To model the dependency of the transition matrix on the pseudotime, we extend the algorithms for HMMs and HMTs to the inhomogeneous case and derive appropriate parametric transition matrices.
Define as the pseudotime difference between node and node , after they have been ordered by increasing pseudotime. To find reasonable entries for the transition matrices, we rewrite the transition probabilities at node :
(4)  
(5)  
(6)  
(7) 
Define to be the transition probability from cell type to cell type . Let be a constant independent of , with condition for all .
For the probability , we know that the support of is
. Since we have no more information about the distribution of the pseudotime difference, we use the maximum entropy probability distribution. The least informative distribution for a random variable with support
and meanis the exponential distribution with rate
. Let the rate be dependent on the cell types and .Then, for each possible transition in the cell lineage tree, the entry in the transition matrix after normalization has the form
(8) 
for and .
The parameters in (8) are learned using the generalized EM algorithm [Neal.1998] since the corresponding objective function is intractable. The generalized Viterbi algorithm [Durand.2004] then computes the most probable hidden variables .
2.3 Our algorithm: TIMELY
TIMELY combines pseudotime inference methods with inhomogenenous HMTs. The pseudotime inference algorithm establishes an intrinsic ordering of the cells based on morphology, and the HMT then finds inconsistent labels and proposes correct labels of the cells corresponding to the true cell types.
The input of TIMELY is a set of images together with noisy expert labels. First, we use a convolutional neural network to learn meaningful feature representations of the cell images that are consistent with the morphology of the cells. The convolutional network consists of three convolutional layers with
filters each, where the filter size is. After each convolutional layer, there is a maxpooling layer with pooling size
. A bottleneck of units, which provide the resulting feature vectors, is followed by two dense layers withhidden units each and an output layer. As an alternative, we also explored unsupervised methods such as autoencoders to learn feature representations of the images so that the training is not affected by noisy labels; this yielded qualitatively similar findings (results not shown).
Next, a suitable pseudotime inference method is applied to calculate the pseudotimes, and the cells are ordered increasingly according to the pseudotime. We use either SCORPIUS or STREAM, depending on the topology of the data. The sorted expert labels serve as the observed information in the HMT, and the hidden labels are the true cell types to be determined. We can use our background information about the dataset to fix the start probabilities and the emission matrix , while the parameters of the transition matrices are learned by the generalized EM algorithm. Through the generalized Viterbi algorithm, we find the most probable true labels and the estimated cell type borders, which are unique due to the Markov assumption [abkowitz1996evidence].
Any inconsistencies between the true labels and the expert labels are potential mistakes by the expert (Figure 3). Hematologists can reconsider the affected images and, if necessary, correct the labels of the cells.
The method is summarized in Algorithm 1. We implemented TIMELY in Python, and the library SciPy is used for maximizing the objective function in the generalized EM algorithm^{1}^{1}1The source code for the implementation can be found on
https://github.com/liuyushan/labelconsistency..
3 Comparison to Baselines
3.1 Baseline methods
We compare TIMELY to three baseline methods. As discussed in Section , most algorithms are either robust to noise, find and remove noisy labels, or model label noise explicitly, but they do not propose new labels.
The algorithms nearest neighbors (NN) and nearest centroid neighbors (NCN) [Sanchez.1997] find neighbors for each instance for a given distance measure. A commonly used distance measure for NN is the Euclidean distance, while for NCN, we add instances to the set of nearest neighbors for which the centroid of the new set is nearest to the considered instance. The label of the considered instance is then obtained by a majority vote. If the majority vote yields a different label than the initial label of the instance, or if there is a tie, the instance might be incorrectly labeled. To compare our method with other methods that also propose corrections, we extend these two methods with generalized editing [Koplowitz.1981], i. e., we choose numbers and with for NN and NCN. For each instance, if there are at least nearest neighbors from a different cell type, the cell type of the instance is changed to that type. Unlike in [Koplowitz.1981], we do not delete any samples. For both methods, we choose the numbers and , which are common values in the literature [Saez.2015, Sanchez.2003].
We also compare TIMELY to cleanlab [cleanlab, Northcutt.2019], which is based on confident learning [Northcutt.2017] and finds labeling errors. It estimates the noise rates by calculating the joint distribution between noisy and uncorrupted labels and then prunes inconsistent samples.
3.2 Simulation data
Since expert labels from realworld datasets are often noisy, we do not know the ground truth labels of the images. For comparing our algorithm to other methods in finding inconsistent labels, we simulate three datasets with different noise levels that mimic the cell differentiation setting. Each dataset consists of samples from five cell types, where the underlying topology is a chain. The process of simulating the datasets is the following:

Sample , where

Sort the columns of by increasing , .

Define the corresponding ground truth labels , where the entries are for .

Apply mapping to project to a higherdimensional space: . We choose to be consistent with our realworld datasets.

Add noise level to the ground truth labels by randomly changing of the entries in to different labels. The steps to are repeated for each noise level.
The idea is that the samples have a lowdimensional ordering, corresponding to the pseudotemporal ordering, which can be retrieved by dimensionality reduction of the higherdimensional feature vectors.
3.3 Simulation results
The results of the comparison is shown in Table 1. The methods NN+edit and NCN+edit modify the labels during application, while NN, NCN, and cleanlab only find possible labeling errors. TIMELY finds labeling errors and proposes new labels without changing them directly.
We compare the proposed labels with the ground truth labels to calculate the accuracy. The selected items are the instances that the algorithm marked as labeling errors. While TIMELY finds errors in a magnitude that is similar to the noise level, the other methods mostly find too many errors, without increasing the recall. Only in one case, NCN has a higher recall than TIMELY. Our method has the highest accuracy, precision, recall, and score in all the other cases.
Editing in NN and NCN often improves the score compared to the versions without editing. However, edition of labels during application influences the classification of subsequent samples so that the accuracy drops if there are too many false positives.
Noise level  Metric  TIMELY  NN  NN+edit  NCN  NCN+edit  cleanlab 

10  Accuracy        

Selected items  

Precision  

Recall  

score  
20  Accuracy        

Selected items  

Precision  

Recall  

score  
30  Accuracy        

Selected items  

Precision  

Recall  

score 
4 Application to RealWorld Datasets
We apply TIMELY to two image datasets of stained white blood cells. All images were generated from a digital microscope (Cellavision, Siemens Healthineers AG) and labeled by an expert. Due to the challenges in manual labeling outlined in Section , the labels are noisy and partly incorrect. For the preparation of the images, a thin blood film is wedged on a glass slide and stained. The digital microscope then locates the blood cells and creates corresponding images. Our datasets contain images from several patients. TIMELY is applied on the whole dataset to first find an ordering of all images, then it suggests a label for each image. For a new patient, images from the same developmental tree can be mapped onto the already calculated tree, and consistent labels can be read off the tree directly by making use of the already computed transition borders.
4.1 Cell lineage line
Dataset
The first dataset consists of cell images that contain five cell types of the development line granulopoiesis. The topology is a linear chain, and there are images labeled by an expert as belonging to each of the cell types promyelocyte (PMY), myelocyte (MY), metamyelocyte (MMY), band neutrophil (BNE), and segmented neutrophil (SNE). Figure 4 shows the differentiation hierarchy; example images in this order are shown in Figure 5.
Parameters in HMM
We use our background knowledge about the dataset to fix the start probabilities and the emission matrix . The dataset has five cell types, and we know that the root type in the development process is PMY. So we can fix the start probabilities to be
(9) 
The first cell should be in the first cell type with high probability and in the other cell types with low probability.
The constant emission matrix is based on estimations of an expert who could realistically estimate the probability of labeling errors. The emission matrix for the first dataset is shown in (10). The more mature cell types band neutrophil and segmented neutrophil are fairly easy for humans to differentiate, while the first three cell types, especially myelocytes, are more difficult to label.
(10) 
Pseudotime inference
We use the algorithm SCORPIUS to compute the pseudotimes. Before SCORPIUS is applied, we use diffusion maps [Coifman.2006, Haghverdi.2015] for dimensionality reduction, after which SCORPIUS directly infers the trajectory without performing MDS.
Figure 6(a) shows the trajectory (black line) that is inferred by SCORPIUS for the cell development line. Each cell is represented by a point in the plot, which is colored according to its expert label. The pseudotime is normalized to be between and . The value stands for the beginning of the development process, while the value represents the end of the development process. In Figure 6(b), the cells are grouped by cell type, and we see that with the development of the cells, the pseudotime increases. For myelocytes, for example, there are some cells with smaller pseudotimes; these might be instances with wrong labels.
Visualization tool
After parameter optimization, the HMM finds unique transition borders between the cell types. We provide a visualization tool for viewing the images, which is shown in Figure 7. The images are ordered by the inferred pseudotime. Each image corresponds to one point, which is highlighted in the color corresponding to the expert label. The inferred transition borders from the HMM are integrated, and the spaces between the borders are colored according to the proposed cell types. Inconsistent classifications can be identified by mismatching colors. The expert can click on each point to see the corresponding image and navigate to neighboring cells by clicking on the arrows, thereby getting an intuition why a specific cell was marked as having an inconsistent label.
Inconsistent labels
The amount of consistent labels, where the hidden labels and expert labels coincide, is according to the HMM , which means that there are images with potentially wrong labels.
The confusion matrix in Figure
8 shows that the consistency for myelocytes and metamyelocytes is particularly low. Overall, the tendency of the values are similar to the expert’s estimation of the emission matrix (10). Experiments showed that the results are quite robust with respect to the emission matrix so that small changes in the estimations will not affect the results significantly.The inconsistent images were given to a domain expert for reclassification. For of these images (), the expert confirmed the previous labels. For the remaining cells, the expert either relabeled them as the cell types the HMM proposed, or she could not give a label with high confidence, which means that up to of the inconsistent images might have wrong labels. Most of the reclassifications affect the first three cell types in the development line, where the changes in the morphology can be very subtle.
4.2 Cell lineage tree
Dataset
The second dataset consists of cell images from ten classes that are part of a development process with branching points. There are images labeled by an expert as belonging to each of the cell types promyelocyte (PMY), myelocyte (MY), metamyelocyte (MMY), band neutrophil (BNE), segmented neutrophil (SNE), blast (BL), basophil (BA), eosinophil (EO), and lymphocyte (LY). For the last class plasma cell (PC), there are only images. Figure 4 shows the underlying lineage tree of the dataset, where we have four end stages with no further transitions. Eosinophils and basophils also have the precursors myelocyte, metamyelocyte, and band neutrophil. However, these would have different staining colors as the precursors of the segmented neutrophil. Because those cell types are quite rare in the blood, they are not included in the dataset.
Parameters in HMT
As we can see in Figure 4, the root cell should be a blast cell so that the entry for blast is very high in . The constant emission matrix is again based on discussions with an expert and is a consistent extension of (10). The five additional cell types should not be too difficult to differentiate from the cell types of the first dataset because they are part of different development lines. Only the blasts have some similarities to the promyelocytes, which are descendants of the blasts. The end stages segmented neutrophil, basophil, and eosinophil should be easy for experts to classify.
Pseudotime inference
We apply the algorithm STREAM for inferring a reasonable tree for the dataset. Two of the three branching points in Figure 6(c) match the branching points from the cell lineage tree in Figure 4. However, the last one where the eosinophils branch off from the metamyelocytes is different. Generally, the eosinophils are far away from the other cell types in the feature space after dimensionality reduction. The connection point to the remaining tree might not be so accurate. Another reason could be that the precursor cells of segmented neutrophils and eosinophils look alike. Eosinophils have the same progenitor stages as the neutrophils that are only stained in a different color. The algorithm might identify the metamyelocyte also as a previous development stage of the eosinophil. The range of the pseudotimes are still plausible for all cell types.
Inconsistent labels
The percentage of consistent labels according to the HMT is , which means that there are images with potentially wrong labels. The blasts and promyelocytes seem to be mixed up often, while basophils and eosinophils have high agreement between hidden labels and expert labels, probably because of their distinct staining colors. The agreement for lymphocytes is also very high since cells from different development lines are usually easier to differentiate.
The inconsistent images were given to a domain expert for reclassification. The expert confirmed for images their previous labels so that up to of the inconsistent images might have wrong labels. Most reclassifications affect promyelocytes, myelocytes, and metamyelocytes, the first three cell types in the granulopoiesis line. The cells are mostly reclassified as the progenitor of the cell type that the experts have determined. Figure 9 shows five example images that the expert has reclassified with the proposed labels from the HMT.
5 Discussion
We have introduced TIMELY, a humancentred approach for increasing labeling consistency in medical imaging for cell type classification. TIMELY takes as input cell microscopy images along with noisy expert labels, identifies inconsistent labels and suggests alternative, consistent labels based on a twostep procedure. In the first step, TIMELY establishes an intrinsic order between cells using a pseudotime inference algorithm. In the second step, TIMELY builds a Markov model upon the ordered cells and their noisy labels. Depending on the complexity of the dataset’s topology, an HMM or an HMT is used.
We combine pseudotime estimations with interpretable HMTs to establish a system that assists the annotating hematologist to generate more consistent cell classifications. By sorting the cells according to the pseudotime, we enable the hematologist to consider each cell in a neighborhood of cells that have a similar morphology, thereby assisting him in making consistent decisions (Figure 7). In addition, we transparently and explicitly encode domain knowledge in form of differentiation hierarchies (Figure 4), start probabilities (9), and an expertdriven emission matrix (10), reflecting prior experience on the likelihood of labeling errors. Taken together, this allows the hematologist to develop an intuitive understanding on why specific cells are suggested as being inconsistently labeled and helps a more readily adoption in practice.
Manually labeling cells is also a timeconsuming process, and our method can be applied to reduce the time experts spend on this task. Thus, once parameters of an HMT are optimized, new images from the same developmental tree can be mapped onto the already calculated tree, and consistent labels can be read off the tree directly by making use of the already computed transition borders.
Some modern digital microscopes have a functionality that automatically suggests labels for cell images. An additional use case of TIMELY is the application to such automatically generated labels since they are often noisy, and in addition, the classification algorithm does not include all possible cell types. These labels would then serve as the observed information of the HMT, and only the inconsistent labels will be given to the expert for reclassification. Such a system would be a further step towards an automated machine learning method that supports humans in a meaningful way.
6 Conclusion
TIMELY is a probabilistic approach for improving the cell labelings of experts that combines pseudotime inference algorithms and hidden Markov trees. Using pseudotime for ordering cells intrinsically leads to labels that are consistent with the morphology of the cells. We incorporate necessary background information about the data into an inhomogeneous hidden Markov tree, which makes use of the pseudotemporal ordering. Our model does not only find noisy labels like other filtering methods, but it also suggests alternative, consistent labels. It is able to identify and correct noisy labels with higher accuracy and precision than stateoftheart methods for identifying noisy labels. Application on two realworld datasets and the subsequent reclassification by an expert demonstrate that the labelings have indeed been improved by our algorithm.
Comments
There are no comments yet.