Malaria is a mosquito-borne disease caused by Plasmodium species (P. falciparum, P. vivax, P. ovale and P. malariae in humans) infecting more than 200 million and killing nearly half a million people annually . Manual microscopy examination of Giemsa-stained blood films is a widespread malaria diagnosis method. Key use-cases include diagnosis; species identification (ID) to guide treatment ; and quantitation of parasites for drug resistance studies, to track how fast a drug clears parasites from the blood. However, a lack of training, high inter-sample variability in preparation and presentation, and difficult field conditions can result in poor accuracy [4, 3]. Also, lack of trained personnel limits the number of drug resistance sentinel sites.
Malaria microscopy is a difficult task for automated image-processing and machine learning (ML) systems for two reasons: Field-prepared blood films vary widely in quality and presentation; and parasites are small (with feature size close to optical limits of resolution), rare, highly variable, and easily confused with non-parasite objects (artifacts). But it is also a high-value target, due to the potential benefit for so many people, and also because automated systems have some concrete advantages: They can be widely deployed, solving the expert-training bottleneck; they can examine more blood volume per patient, reducing variability in quantitation caused by Poisson statistics; and their results are reproducible.
Thin and thick blood films have distinct uses. Thick films are typically used for diagnosis and for quantitation of low-parasitemia infections, because the larger blood volume gives a lower limit of detection (LoD) and more stable parasite counts. Thin film are used for species ID, and for quantitation of high-parasitemia infections .
Field-prepared Giemsa-stained thin films vary greatly in presentation, e.g. in red blood cell (RBC) color and morphology, parasite appearance, type and number of artifacts, and amount of RBC clumping. Fig. 1 shows typical thin film fields of view (FoVs).
Malaria parasites display several developmental stages in blood [5, 7] starting as ring-stage trophozoites (hereafter “rings”), seen in Fig. 2a, then maturing into trophozoites, schizonts, and gametocytes (hereafter “late stages”). Rings predominate in P. falciparum infections, and are the targets for quantitation in drug resistance studies. Late stages are used for species ID. Artifacts (“distractors”) are very common in field-prepared slides, and often heavily outnumber parasites. Examples are seen in Fig. 2b and in Supplementary Information (S.I.) .
A recent review  highlights several key problems with published automated malaria detection studies: (i) datasets are too small; (ii) reported metrics are often incomplete and not comparable between studies; (iii) reported metrics are often object-based (not patient-based) and are thus not relevant to the clinical task, which is entirely patient-focused; and (iv) train and validation sets often contain objects from the same patient. As with other applications of ML to health care tasks, an understanding of the domain-specific constraints and use-cases is vital but is often missing .
Studies including [13, 12, 14, 17, 15, 18, 16, 11, 6] discuss issues central to automated processing of thin films. These issues include (i) the importance of sample-level variability [11, 6]; (ii) the importance of false-positive (FP) rates per unit blood [12, 6] as a determinant of both LoD and quantitation accuracy; (iii) the centrality of patient-level (not object-level) metrics [13, 11, 14, 6]; and (iv) irregularity and clumping of RBCs in thin films and the high computational cost of separating clumped RBCs [13, 16, 15].
This paper describes a complete, fully-automated thin film malaria assessment system, intended to complement the thick film system in 
. Its goals are quantitation and species ID (thick film handles diagnosis, since the greater blood volume yields a lower LoD). The system includes two branches, one for rings and one for late stages, and modules for: FoV quality control (QC); RBC counting; object detection; distractor filters; CNN classifiers; species ID; branch arbitration; and patient-level disposition. Fig3 gives a schematic.
The main contributions of this work include: (1) Use of field-prepared (rather than in-house) slides, sourced from clinics across four continents; (2) A complete, fully-automated system for patient-level results on thin blood films; (3) Three classifiers arranged in series, as a means to handle the high numbers of distractors vs relatively rare parasites; (4) Use of convolutional neural nets (CNNs), supplied with sufficient data for the task complexity; (5) Quantitation and species ID results close to sufficiently accurate, on field-prepared slides, to meet drug resistance study and case management use-cases; (6) Analysis of machines’ advantage over microscopists due to reduced Poisson variability; and (7) Description of patient-level metrics that realistically target malaria microscopy use-cases, for use in algorithm evaluation.
A field-deployable malaria assessment system (especially one with data-hungry CNNs) requires a large and diverse set of training images, because high variability in field-prepared slides is a central challenge, and because algorithms need to generalize to as-yet unexamined slides from new clinics and regions. In conjunction with our field partners, we assembled 798 image sets of field-prepared blood films from 765 patients, totaling 323k thin film FoV images. Regions included South America, Africa, Asia, and London (returning travellers), encompassing over 12 countries. The slide collection was rich in negative, , and samples, but suffered from a relative lack of and samples since these are rarer species. We annotated over 92k objects of the four major malaria species. Slide metadata and object annotations were stored in a SQL database, which simplified maintenance of patient-level structure during algorithm development.
Ii-a Image capture
Field-prepared slides were mounted with #0 coverslips and digitized with Motic EasyScan-Go  automated scanning microscopes. The microscope has a 40x, NA=0.75 objective, infinity-corrected optical train, and 10W LED Kohler illumination. A CMOS camera captures images (20481536 pixels) at an approximate pixel pitch of 8.3 pixels/m. A single high-quality FoV can contain 400 well-separated RBCs. Each FoV consists of a stack of 7 z-slices, to ensure that every object has at least one in-focus version (a stack is needed to handle non-level slides and artifacts that derail the auto-focus). All slides were prepared at field locations, and were scanned at either field locations or a central lab.
Ii-B Object types
Because parasites have visually distinct stages, we used three categories (ring, late stage, and transitional) for algorithm development. When stained with Giemsa, rings have one or two distinct purple nuclei and (ideally) a characteristic blue cytoplasm. They are typically inside RBCs but sometimes appear to be just outside (applique forms). We lumped all mature stages together as “late stages.” Parasites intermediate on the continuum between ring and late stage were labelled “transitional.” Because objects in blood films are not always identifiable, we added an important fourth annotation category, “doubtful”. Excluding transitional and doubtful objects from the training set yielded classifiers with improved accuracy.
Ii-C Train/Validation sets
The malaria use-case requires that objects from a given blood sample be used in either training or validation, never in both. Dividing objects from a single sample can substantially improve object-level classifier results, but it is highly unrealistic. At every point in our framework, sample-level integrity was preserved in training-validation splits.
For training, ground-truth was defined as follows: Positives came from parasites (rings or late stages) annotated in our malaria database. Parasite annotations were examined by at least two trained humans and were further vetted at least once. Distractors were defined as any detected object that had no annotation. This definition of distractor required complete parasite annotations of training images, to prevent unmarked parasites being included in the distractor pool. A way to relax the annotation task (on validation) is described in S.I. .
Transitional stages and doubtful objects (i.e. objects that may or may not have been parasites) were excluded from both positive and negative pools. This allowed the classifiers to focus on a single parasite type (ring or late stage). Transitional parasites, though excluded from training, were still accurately identified as parasites by one or other of the branches.
Iii Algorithm framework
Iii-a Constraints on architecture
Algorithm architecture was constrained by the need for low computational complexity. On average, each blood slide generates 200 FoVs, each a stack of 7 images. Each FoV may yield multiple objects to process. Processing time was limited to 15 minutes on a standard laptop CPU (no GPU). Since parasites are rare and greatly outnumbered by distractors, the object detector and classifiers must have high sensitivity (percentage of parasites correctly classified) and also very high specificity, i.e. a low False Positive (FP) rate.
Generic deep learning based detection methods[22, 23, 21, 20] have achieved strong results on natural images. However, two-stage detectors, such as SPPnet  and Faster R-CNN  are too slow for this application, while faster single-stage detectors, such as SSD  and YoLo  have lower performance. Both single- and two-stage detectors have unacceptably poor accuracy on small objects. To combine fast processing and sufficient object-level sensitivity and specificity, we used three classifiers in series: an initial detector and a distractor filter using low-cost manual features, and finally a CNN.
Iii-B System overview
Given an FoV, the system first runs quality control, counts RBCs, and runs initial detection of objects. It then splits into two branches, one for rings (for quantitation), and one for late stages (for species ID). Each branch has two parts: First, a high-sensitivity distractor filter culls the bulk of the more obvious distractors; then a CNN classifies the remaining objects as parasites or distractors. The detected late stage parasites also pass through a species ID module. Finally, the various outputs are combined to deliver patient-level quantitation and species ID predictions.
The following subsections describe the various modules.
Iii-C Quality control module
Quality of slide preparation and image acquisition varies widely in the field. The QC module identifies and culls FoVs that are blurry or empty. A FoV may be blurry when the thickness of the cover slip on the slide is not compatible with the scanning microscope, the microscope is unable to focus on the slide, or when it focuses on objects on a higher plane than the blood film. FoVs that are empty or blurry tend to have pixel values within narrow ranges.
, blur detection remains challenging because image content can affect image sharpness. Further, our application requires a no-reference, fast method. We apply a two-stage QC process. First, we calculate the standard deviation (std dev) of grayscale pixel values and the dynamic range of the gradient of the grayscale FoV. If either of these is lower than pre-set thresholds, the FoV is rejected. Second, we calculate focus metrics on the FoV and on a corresponding artificially-blurred version. Using the method in  we selected a small, computationally-efficient subset of these focus metrics as features for blur detection, then trained gradient boosted trees (GBT)  classifier to identify blurry FoVs.
Iii-D RBC counter module
Accurate RBC counts are needed for quantitation of Pf
rings. (Species ID depends on distinguishing different parasite morphologies, not RBC count.) Only high parasitemia samples are quantitated on thin film because at low parasitemias, the lower blood volume causes high Poisson variance in ring counts (low densities are quantitated on thick film). We can thus assume that parasitemia is high: for microscopists, over 5k p/L or 16k p/L depending on protocol ; for automated systems (e.g. ), over 80k p/L.
We estimate ring branch parasitemia perL = as follows (late stage is similar):
= number of alleged rings found by the algorithm, i.e. the number of ring branch objects with CNN scores above some threshold ,
= expected number of FPs/L,
= expected sensitivity of the ring classifier,
5e6 = number of RBCs/L, and
= number of RBCs counted.
Hyperparameters such as are determined on a validation set ( uses negative samples only). could be the mean (or median) of sample sensitivities over all positive samples , and could be the mean (or median) of sample FP rate over all negative samples .
Error in the RBC count directly impacts quantitation accuracy through the denominator of the second term of Eqn.1, and must be minimized.
The biggest challenge in RBC counting is to accurately segment clumps of overlapped RBCs, which are common in field-prepared thin films. However, automated scanning microscopes can image more FoVs than needed, which enables the following strategy: Ignore clumped RBCs and use only single (and double) RBCs, which are easily counted . We count the single RBCs as we scan, directing the microscope to continue collecting FoVs until 20k single RBCs have been tallied (a human microscopist examines 5k RBCs). This is generally possible even on field slides, and suffices to mitigate Poisson variance error given high parasitemia (at 80k p/L, 20k RBCs yield roughly 320 parasites).
In a given FoV, we detect RBCs with simple binary gray-scale clustering, and divide the detected RBCs into singles and clumps based on blob size. We count the single RBCs, and create a ‘quantitation mask’ containing only these RBCs plus a margin to capture applique rings. This RBC count is highly accurate (i.e. in Eqn.1 has very low error). Only rings within the quantitation mask are included in in Eqn.1.
We also detect and classify all objects in the FoV. For species ID, we use all the suspected late stages, whether in single or clumped RBCs, since only their morphology matters.
Iii-E Object detector module
The object detection module generates a list of candidate objects in each FoV. A domain-specific detail enables a simple yet effective object-of-interest detector. Giemsa stain colors DNA (e.g. parasite nuclei, WBCs) purple and RNA (e.g. in cytoplasm) blue, modulo variations due to pH. RBCs and background stain to pink, green, or gray.
We project the color image to gray scale via
where R, G and B are the red, green and blue channels of the color image. This grayscale image highlights purple pixels and suppresses green pixels in the image, and detects most parasite candidates (as well as distractors that stain purple). The grayscale image is thresholded pixel-wise using dynamic local thresholding . Candidate objects are chosen by finding connected-components in the thresholded image.
This method is applied to all z-slices in a given FoV. Distance-based clustering groups together instances of the same object detected in multiple z-slices, to account for microscope stage jitter and variation in x, y coordinates due to focus. The most in-focus object, i.e. the object with the highest Brenner focus score , is retained for further processing.
Iii-F Distractor filter module(s)
Malaria parasites are rare in thin blood film images (1 per 100 RBCs in high parasitemia cases, and more commonly fewer than 1 per 1000 RBCs), so distractors typically outnumber parasites. The detector needs high sensitivity, high specificity, and computational efficiency.
Many detected distractors can be efficiently culled via manual features. We trained GBT classifiers, one for each branch, using region properties of the detected objects as features (see S.I. for details). The distractor filters achieved 0.96 (ring) and 0.94 (late stage) areas under the ROC curve on a validation set, and culled most potential distractors.
Iii-G CNN classfier module(s)
In the second stage of each branch, a CNN classifier distinguishes parasites from the remaining, most difficult distractors. CNNs are state-of-the-art technology in many computer vision and biomedical image processing 
applications. Published CNN architectures are most often designed for large-scale datasets (e.g. ImageNet with 1000 output classes), so they overfit our dataset. We therefore tailored CNN architectures for our domain-specific case, with two output classes (parasite vs distractor). We explored various architectures, including Inception-style networks , fully convolutional networks  and VGG-style networks 
. We developed and tested CNNs in Caffe
using cross entropy loss with stochastic gradient descent.
The ring branch CNN had 3 convolutional layers, followed by two Inception modules and one fully connected layer. This architecture enabled identification of features at multiple scales, e.g. small features near the nucleus and larger features in the cytoplasm of the parasite. The Inception modules had convolutional layers with kernel sizes 11, 33, and 55 and a dimensionality reduction kernel of size 11. The number of kernels within each branch of the Inception module was chosen such that the number of parameters to be learned in each Inception module was equal. Thus not all multi-scale features were weighted the same. Deeper layers had more filters per convolutional layer. Thumbnails were 6464 pixels.
The late stage branch CNN was a fully convolutional neural network with 7 convolutional layers and increasing number of kernels as network depth increased. Spatial reduction was achieved by strided convolutions which could learn the spatial reduction operation. Thumbnails were 144144 pixels.
The thumbnails had 4 channels, namely, red, blue, green and an inverse gray channel. The 4th channel gave stronger test accuracy. Thumbnails were augmented in three ways: Flipping and rotating (90 degree increments); random horizontal or vertical spatial translations of pixels; and random gamma transformation of each color channel as in .
Weights were initialized by the Xavier method . Other parameters included: Weight decay 5e-4, momentum 0.9, batch size 128, “poly” learning policy with learning rate 1e-3, dropout 0.3 (late branch) and 0.5 (ring branch). CNN architectures are shown in S.I. .
Distractors typically outnumber total parasites, because distractors are derived from all (not just positive) samples, parasites are relatively rare objects, and field slides can be distractor-rich. The number of distractors selected from each training sample was capped, both to keep training imbalances within 2- or 3-to-1, and to ensure that a few very dirty samples did not dominate training. To get a training set that covered the whole distribution of distractors, while emphasizing difficult types, we randomly selected 80% distractors from those that passed the distractor filter (i.e. from the relatively difficult distractors), and the remaining 20% from the remainder.
To avoid the CNN training set being dominated by a few high parasitemia samples, the number of parasites each sample could contribute to training was capped. When assessing network training, we watched for sample-level effects. For example, if one high parasitemia validation sample had faintly-stained parasites that largely went undetected, it would disproportionally affect the object-level statistics. However, it would only represent one failure mode in the CNN, viz failure to detect faint parasites.
Iii-H Species ID module
The four malaria species have very similar ring forms, while the mature (late) stages exhibit distinctive features. Thus the species ID module tries to identify species of objects detected and classified as parasites by the late stage branch. Geographic priors are not used (though these can be very informative ). Due to the many species and variety of late stage forms, the classifier has 13 categories: Four ring categories, i.e. one for each species (Pf, Pv, Po, Pm); similarly four transitional and four late stage categories; plus distractors.
We trained a 13-class GBT classifier that used manual features (details in S.I.) on the segmented late stage objects in each thumbnail. Each training sample was allowed to contribute a maximum of 100 objects per output class, to ensure wide sample-level variety. To aid segmentation of objects, we enhanced contrast as follows: We converted the thumbnail to the luminance and chrominance space; performed adaptive histogram equalization so that the pixels values followed a Raleigh distribution; converted back to RGB color space; and morphologically eroded using a ‘ball’ structuring element. Foreground and background were found via k-means using the luminance and chrominance of this enhanced image, and the greyscale image (Eqn.2), as features.
We also built a CNN for species ID, using transfer learning due to the smaller numbers of late stage parasites for training. Calendar constraints prevented us from testing it.
The parasite stage classification probabilities are used by the patient-level disposition module to predict malaria species.
Iii-I Object Arbitration module
Because the ring and late stage branches of the decision tree architecture each apply their own distractor filters to the set of detected objects, there will be three kinds of objects:(1) detected only by the ring branch; (2) detected only by the late branch; and (3) detected by both branches. The arbitration module decides, for objects detected by both branches, whether they should be judged as possible rings or as possible late stages. Modulo complications (not discussed) due to the species ID module, an object’s proper category is decided simply by which branch gave it a higher CNN score.
Iii-J Patient-level disposition module
Estimated parasitemia is the sum of ring and late stage parasitemias, each made according to Eqn. 1. Drug resistance studies require only quantitation of Pf rings.
Iii-J2 Species ID
Species ID primarily uses late stage forms. But it also considers ring counts (sometimes from thick film) because of two unique Pf traits: mature Pf parasites sequester in the microvasculature, so Pf typically presents only ring parasites in blood samples ; and Pf can reach much higher ring densities than other species.
Thus, predicting the species of malaria parasites infecting a patient is based on three factors: the species probabilities of the late stage parasites; the ring density per L; and the ratio of ring and late stage parasite counts.
For the late stage species prediction, we sum the species probabilities of all late stage parasites (i.e. objects with late branch scores above a threshold ). This sum up-weights objects with strong likelihood of one species and down-weights objects with uncertain species. The highest sum gives the predicted species.
If either the density of ring forms or the ratio of rings to late stages are above empirically determined thresholds, the species is reported as Pf. Otherwise, the late stage species prediction is reported. An exception occurs when there is both a high density of parasites, indicating the presence of Pf, but also a high number of late stages (atypical for Pf). Then a mixed infection is reported: Pf and the late stage species prediction.
The algorithm delivers a patient report with parasitemia, species, and thumbnails of top-scoring objects for use by technicians. Typical reports are shown in S.I. .
Iv-a Important metrics
Patient-level results are by far the most relevant to the malaria use-case, for diagnosis, quantitation, and species ID. This section describes key Figures of Merit (FoMs) which guided our development and assessment of algorithms.
Iv-A1 Quantitation error
Three forms of error affect quantitation: (i) RBC counts; (ii) parasite counts; and (iii) irreducible Poisson error.
RBC counting errors contribute to quantitation error in a straightforward way via the second term of Eqn.1, .
Parasite counting errors (first term of Eqn.1) stem from sample-level variations in sensitivity and in FP rate (derivation is given below). Thus, two FoMs for quantitation error are
= std dev of sample sensitivities (over all samples),
= mean of sample sensitivities (over all samples),
= std dev of sample FP rates per L (over all samples).
= parasitemia per L.
At high parasitemias (i.e. the thin film use-case) the first term dominates because the second term shrinks as .
Irreducible Poisson variation affects the actual number of parasites contained in the examined blood, and will result in different counts if a perfect counter examines two distinct sections of a film. The magnitude of this variation depends on parasitemia and number of RBCs examined. It can be mitigated by high parasitemias and by examining high numbers of RBCs. Automated systems have a powerful advantage over microscopists in this regard. See S.I. for discussion.
Iv-A2 Derivation of Eqn.3
Eqn.1 gives the estimated parasitemia of a patient . The first term contains error from multiple sources. For a patient , let = number of TPs found, = number of FPs found, = sensitivity on this sample, with other terms defined as in Eqn.1. Then the first term
Let , i.e. the discrepancy between our actual TP count and the count we would get if .
Let , i.e. the discrepancy between our actual FP count and the count we would get if . Then
The leading term of Eqn. 6 is the true parasitemia . So the relative error on patient is
between actual sensitivity and expected sensitivity.
This implies that for the population of samples, parasite counting error can be characterized by
= mean of sample sensitivities (standing in for ),
= std dev of sample FP rates (over all samples).
The two terms of Eqn. 9 are the FoMs given in Eqn.3. The first term is the error due to variance (over all samples) of sample sensitivities, scaled by mean sample sensitivity. This error can be reduced by increasing overall sensitivity and/or by reducing variation in sensitivity by sample.
The second term is the error due to variation in FP rates. It decreases as , so it is a dominant effect at low parasitemias (it is a noise floor in diagnosis and LoD calculations) but a minor effect at high parasitemias.
As discussed in , we can trade off sensitivity and specificity at the object level according to our goal, by varying threshold operating points. For diagnosis, very low in Eqn. 9 is needed to achieve low LoD. On the other hand, to quantitate high parasitemia samples one must minimize in Eqn. 9, while can be larger. Operating points with low FP rates typically have lower sensitivities (lower left of the ROC curve). Since the two terms of Eqn.9 move in opposite directions, operating point depends on the goal.
Iv-B Comparison to other methods
It is customary to provide a comparison of results for a proposed method vs other methods in the literature. This is problematic here because, as discussed in  most studies do not give patient-level results, due to data limitations and/or chosen methodologies. For example, if a method used a train/val split that allows a sample’s objects into both train and validation, or if it did not report patient-level results, then its results are not comparable to ours.
Due to this lack of common metrics, we do not provide a comparison table. This is not to cast shade on prior work: The lack of comparability is in large part due to our good fortune in having a large, varied dataset. We can offer some comparisons to certain prior results from the literature, with a caveat that the prior studies listed here used clean in-house slides and counts, while our results used field-prepared slides, which vary more widely and often contain more distractor objects.
Iv-B1 Median quantitation error
Linder et al.  attain 21% median quantitation error on 17 slides (20k - 40k RBCs per sample). Le et al.  report 20% median quantitation error (but with very small samples, 1800 RBCs per sample). Our method had 18% median quantitation error vs in-house counts on 24 holdout slides (20k - 80k RBCs per sample), and 31% median error vs field counts on 81 holdout slides (see Fig. 4).
Iv-B2 FP rates
By Eqn. 3 the key FoM for FP rate is . However, previous studies do not report this, and we can only calculate a (very) rough proxy, namely , as follows (ring case): 5k (Linder ), 12k (Tek ), 15k (Anggraini ), 25k (Ross ), 70k (Gopakumar ). Our method, set to a diagnosis operating point, has = 1.6k. This operating point gives 90% sample-level specificity on holdout sets (i.e. 90% of negative samples are correctly diagnosed as negatives), a requirement based on the “WHO 56” evaluation method .
Iv-C Quantitation results
We report results for Pf rings since these are the most important quantitation target, as used in drug resistance studies, where error should ideally be under 25% [40, 41, 42]. Rings are a more difficult target than late stages, due to their small size and similarity to distractors.
Iv-C1 Object-level results
Object-level results are relevant only as an interim step to patient-level results. Also, object-level specificity and area under ROC curve (AUC) depend on the raw number of distractors but do not reflect their difficulty, so these metrics can be boosted arbitrarily by surplus easy distractors. Considering the distractors that passed both object detection and the distractor filter, the CNNs had 0.99 AUC (both ring and late stage CNNs), while validation accuracy was 94.8% (ring) and 96.6% (late).
Iv-C2 Patient-level results
In the ring branch, , and for P 60k/µL.
Based on Eqn. 9, we expect our ring quantitation error to usually be less than
Estimated quantitations, on a holdout set of 81 Pf from 10 clinics, are shown in Fig. 4. As noted in , in-house parasitemia counts are preferable for use as ground truth since field counts are highly variable due to Poisson variability and the difficulty of manual RBC counting. Also, quantitations on the thick vs thin film can differ by 30% due to wash-off and different methods used on the two types of films. Here we compare to field counts to allow a larger holdout set. 38% of holdout samples had under 25% discrepancy vs field counts, and 50% of the holdout set had under 33% discrepancy.
Algorithm undercounts are typically due to poor sensitivity, e.g. on a slide with faintly-stained parasites. Overcounts are typically due to high FP rates. Discrepancies may also be due to Poisson variability  and errors in the field counts.
Iv-D Species ID results
WHO’s 56-slide evaluation requires 90% accuracy for expert level . Our algorithm attains this accuracy on Pf and Pv, but not Po and Pm (perhaps due to less training data). Table I shows our algorithm’s species ID results on 42 holdout samples (10 , 20 , 9 , 3 ). Since the thick film
Malaria assessment using microscopy blood films is a difficult but high-value target for machine learning. The fully-automated thin film system presented here delivers accuracy that is close to sufficient for quantitation and species ID use-cases in the field. Crucially, it works with field-prepared slides.
We have found that to produce clinically useful algorithms, one must focus on the particular needs of the malaria use-case, including (i) patient-level deliverables, (ii) computational contraints, and (iii) the high variability of field slides.
Use-case deliverables requires metrics focused on the patient-level, since standard ML metrics such as ROC curves are insufficient to assess patient-level accuracy.
Computational constraints sometimes require forgoing certain methods (e.g. R-CNN, Hough circle detection) and finding simpler, faster approaches.
High slide variability requires sufficient variety and quantity of training slides from many clinics to capture patient-level variation; training and validation sets organized at the patient (not object) level; pre-processing methods to normalize images, and classifiers robust to variations in slide presentation; and methods to minimize inter-sample variance in parasite sensitivity and FP rates, since these are the main sources of error in quantitation and diagnosis.
Conversely, one can leverage domain-specific details to simplify the task. Examples include using the particular effects of Giemsa staining, shortcuts to RBC counting, and the assumption of high parasitemias during thin film quantitation. Also, machines have some intrinsic advantages over human microscopists (to offset their nontrivial drawbacks ), including reduced Poisson variance and lower RBC counting error. Also, algorithms do not fatigue, their results are replicable, and new units need no extra training.
Honoring the constraints imposed, and leveraging the advantages offered, by the use-case requires close consultation with experts working in the field.
In our experience, their domain expertise is of first importance when developing algorithms.
Vi Supplementary Information
Vi-a Malaria species and stages
The malaria parasite has various developmental stages during its human blood life-cycle . In the early stages after entering the blood stream, the parasites are in the ring stage (immature trophozoite), examples of which are shown in Fig. 5 and 6. Later stages of development include trophozoite, schizont, and gametocyte (sexual reproductive form). We collectively refer to these as “late stage”. Example late stage parasites from the four species are shown in Fig. 5. In the ring stage, there are few differences between the various species of malaria, and they cannot be reliably distinguished by eye. They start to differentiate in subsequent stages of development (trophozoite, schizont, and gametocyte forms) which have marked distinctions between the different species. An automated system must also differentiate between actual parasites and distractor objects that resemble parasites. Some examples of distractors are shown in Fig. 7.
Vi-B Thin blood film
Malaria can be diagnosed with two types of blood films: thick film and thin film. Thick film allows for a larger volume of blood to be examined and thus provides a lower the limit of detection in terms of parasites per microliter (
L) of blood. A calculation using the Poisson distribution (see sectionVI-F) indicates that, at a limit of detection (LoD) = 50 p/L one must examine about 0.1 L of blood to be fairly certain of at least one parasite being present. This corresponds to 800 White Blood Cells (WBCs), which is easy to do on thick film; or 5e5 Red Blood Cells (RBCs) on thin film, which is unworkable since thin films are (ideally) a monolayer and also have large unusable regions.
Thus thick film is used for diagnosis. However, distinguishing the three species P. vivax, P. ovale and P. malariae is very difficult on thick films. The thin film preparation preserves the morphology of parasites and RBCs, thus permitting species identification (ID).
It is too time-consuming to scan a large volume of blood on thin film, making it less useful for quantitating low parasitemia samples. Thin films are used for quantitation when the parasite load is high because the number of parasites per field-of-view (FoV) is more manageable.
A good thin film slide is difficult to prepare. One places a drop of blood on a microscope slide, then uses another slide to spread the drop across the slide by capillary action. The RBCs on the edge of the blood film form a monolayer where distinct RBCs can be seen. The slide is then dried, fixed and stained with a Romanowsky-type stain, such as Giemsa. Due to protocols in different labs, stain pH level, and technician skill the background color of the thin film varied as observed in Figs. 5 - 7.
Under field conditions, an automated system must handle color variation of background and RBCs, blurriness, out-of-focus images, overly-clumped RBCs, and distractors.
Vi-C Distractor filter manual features
Manual features included area, intensity, extent, Euler number, eccentricity, mean gradient, prominence, bumpiness, roundness, ridgeness, donut shapeness, external contrast, internal contrast, mean of red, green and blue channels, coefficient of variation of gray scale, gradient, red, green and blue channels.
Vi-D Species ID module manual features
The manual features included standard region properties such as elongation, perimeter, major and minor axis length, histogram of the grayscale image, pixel statistics of the gray-scale image; and properties of the gray-level co-occurrence matrix  such as contrast, correlation, energy and homogeneity.
Vi-E Method for relaxing the annotation task
Annotating ground truth on large datasets is expensive and time-consuming. We were able to relax our parasite vetting task as follows: when assessing algorithm performance (see “Important metrics” section) on the validation set, we used false positive (FP) rates from negative samples only. Missed (i.e. unannotated) parasites were treated as distractors by the algorithm. But if the algorithm detected and classified these objects as parasites they counted as FPs and were thus disregarded, since they were on positive samples. The relevant metric on positive validation samples was sensitivity, which by definition considered only annotated parasites. So while distractors mislabeled as parasites were harmful, missed parasites (on validation samples only) did not affect algorithm evaluation. This method allowed imperfect annotations of validation samples, and thus allowed us to focus annotation resources on the more important group of training set samples.
Vi-F Poisson variability and irreducible quantitation error
Rare events are governed by the Poisson distribution:
where = the probability of an event in one draw, and number of draws.
This can be thought of as the limit of the binomial distribution:
as . This pushes the binomial probability mass function up against 0, i.e. becomes asymmetrical, with highest for small .
Let the parasitemia = parasites/L. Let be the probability that a particular RBC contains a parasite. Then , assuming 5e6 RBCs/L and at most one ring in any RBC (i.e. ignore the case of multiple rings in one RBC). Consider each RBC as a coin toss with likelihood of coming up as “parasite”. Then the total number of actual parasites in the RBCs examined is a binomial distribution . This implies that even a perfect annotator will count different numbers of parasites in different groups of RBCs from the same sample. The variation depends on and , and decreases as and/or increase. A similar situation holds for thick film counts, where the probability that a particular volume of blood (corresponding to one WBC) contains a parasite. Then , assuming 8000 WBCs/L, a “coin toss” is examining 1/8000 L of blood, and = number of WBCs counted (as a proxy for this blood volume).
We can quantify the amount of this “irreducible Poisson error” in quantitation using the relative standard error
In thick films, microscopists typically count 500 WBCs, while an automated scanner plus algorithm counts 1000 or 2000 WBCS. Values of for = 500, 1000, and 2000 are given in Fig. 8. The advantage of machines, due to their ability to scan larger areas, is clear.
Microscopists switch to thin films for quantitation at around = 8k or 16k due to the difficulty of keeping track of counts on thick films. On thin film, microscopists typically count 1000 RBCs, while an automated scanner plus algorithm can count 10k to 20k RBCs. Values of for = 1k, 10k, 20k are given in Fig. 9, again showing the clear advantage of machines. Machines have an additional advantage due to their ability to accurately count parasites on thick films at parasitemias up to 80k/uL. This drastically reduces Poisson error vs microscopists in the 16k 80k range, because of the much larger volume of blood examined on thick vs thin film. This is seen in Fig. 10, which combines Figs 8 and 9 and plots relative standard error for thick and thin film quantitation, at all parasitemias and for a variety of WBC and RBC counts.
Vii CNN architectures
architecture. The late stage branch network is a fully convolutional architecture with strided convolutions to achieve size reduction. Each convolutional layer also includes ReLU, and convolutional layers 2-7 are followed by dropout (0.3%-0.5%).
Vii-a Output Report
Two sample reports generated by our thin film malaria assessment system are shown in Figs. 12 and 13.
The report lists the detected malaria species, the ring and late-stage quantitations, and number of RBCs examined.
It also shows a mosaic of the highest-scoring thumbnails from both the ring branch and the late stage branch.
These thumbnails can serve as a decision aid to a microscopist in low resource setting: it is a method of collecting objects-of-interest from a large region of blood film for visual examination.
It can also serve to reassure a trained technician that the algorithm (nominally a “black box”) is delivering reasonable results.
-  World Health Organization, “World malaria report 2018”, Geneva: World Health Organization’, WHO, 2018.
-  CDC, https://www.cdc.gov/malaria/diagnosis_treatment/clinicians2.html
-  G. Nazare-Pembele, L. Rojas, F.A. Nunez, “Lack of knowledge regarding the microscopic diagnosis of malaria by technicians of the laboratory network in Luanda, Angola”, Biomedica, 2016.
-  D. Zurovac, B. Midia, S.A. Ochola, M. English, R.W. Snow, “Microscopy and outpatient malaria case management among older children and adults in Kenya”, Trop Med Int Health. 2006.
-  World Health Organization and Center for Disease Control, “Basic Malaria Microscopy: Tutor’s guide”, WHO, 2010.
-  C. Mehanian, et al., ”Computer-Automated Malaria Diagnosis and Quantitation Using Convolutional Neural Networks”, CVPR, 2017.
-  P.C.C. Garnham, ”Malaria parasites and other haemosporidia”, Blackwell Scientific Publications Ltd., 1966.
-  C.B. Delahunt, M.S. Jaiswal et al., “Supplementary Information for ‘Fully-automated patient-level malaria assessment on field-prepared thin blood film microscopy images’ ”, arXiv, 2019.
-  M. Poostchi, K. Silamut, R. Maude, S. Jaeger and G. Thoma, “Image analysis and machine learning for detecting malaria”, Transl Res, 2018.
-  D. Koller and Y. Bengio, ”A fireside chat with Daphne Koller”, ICLR 2018, https://www.youtube.com/watch?v=N4mdV1CIpvI
-  M.T. Le, T. R. Bretschneider, C. Kuss and P. R. Preiser, ”A novel semi-automatic image processing approach to determine P. falciparum parasitemia in Giemsa-stained thin blood smears”, BMC Cell Biol, 2008.
-  F. B. Tek, A. G. Dempster and I. Kale, “Computer vision for microscopy diagnosis of malaria”, Malaria J, 2009.
-  N. Linder et al., “A malaria diagnostic tool based on computer vision screening and visualization of Plasmodium falciparum candidate areas in digitized blood smears”, PLoS One, 2014.
-  D. Anggraini et al., “Automated status identification of microscopic images obtained from malaria thin blood smears using bayes decision: A study case in Plasmodium falciparum”, ICEEI, 2011.
-  G. P. Gopakumar et al., ”Convolutional neural network-based malaria diagnosis from focus stack of blood smear images acquired using custom-built slide scanner”, J Biophotonics, 2018.
-  N. Abbas, et al., ”Machine aided malaria parasitemia detection in Giemsa-stained thin blood smears”, Neural Comp & Application, 2018.
-  A. Loddo, C. Di Ruberto and M. Kocher. “Recent Advances of Malaria Parasites Detection Systems Based on Mathematical Morphology”, Sensors 2018.
-  N. E. Ross, C. J. Pritchard, D. M. Rubin and A. G. Duse, ”Automated image processing method for the diagnosis and classification of malaria on thin blood smears”, Med & Bio Engineering & Computing, 2006.
-  Motic Optical, https://www.motic.com
-  K. He, X. Zhang, S. Ren and J. Sun, ”Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV, 2014.
-  S. Ren, et al., ”Faster R-CNN: Towards real-time object detection with region proposal networks”, Adv Neural Inf Process Syst , 2015.
-  W. Liu, et al., ”SSD: Single shot multibox detector”, ECCV, 2016.
-  J. Redmon, et al., ”You only look once: Unified, real-time object detection”, CVPR, 2016.
S. Pertuz, D. Puig and M. A. Garcia, ”Analysis of focus measure operators for shape-from-focus”, Pattern Recognit, 2013.
-  L. Kang, P. Ye, Y. Li and D. Doermann, ”Convolutional neural networks for no-reference image quality assessment”, CVPR, 2014.
M. Jaiswal, et al., ”Characterization of cervigram image sharpness using multiple self-referenced measurements and random forest classifiers”, Optics and Biophotonics in Low-Resource Settings IV, 2018.
T. Chen and C. Guestrin, ”XGBoost: A Scalable Tree Boosting System”, Proc 22nd ACM SIGKDD, San Francisco, 2016
-  J. F. Brenner, et al., ”An automated microscope for cytologic research a preliminary evaluation”, Journal of Histochem & Cytochem, 1976.
-  Y. LeCun, Y. Bengio and G. Hinton, ”Deep learning”, Nature, 2015.
-  G. Litjens, et al., ”A survey on deep learning in medical image analysis”, Med Image Analysis, 2017.
-  O. Russakovsky, et al., ”ImageNet Large Scale Visual Recognition Competition”, International Journal of Computer Vision, 2015.
-  C. Szegedy, et al., ”Going Deeper with Convolutions”, CVPR, 2015
-  J. T. Springenberg, A. Dosovitskiy, T. Brox and M. Riedmiller, ”Striving for simplicity: The all convolutional net”, arXiv:1412.6806, 2014
-  K. Simonyan and A. Zisserman, ”Very deep convolutional networks for large-scale image recognition”, arXiv:1409.1556, 2014.
-  Y. Jia, et al., ”Caffe: Convolutional architecture for fast feature embedding”, Proc ACM Intern’l Conf Multimedia, 2014
-  X. Glorot and Y. Bengio, ”Understanding the difficulty of training deep feedforward neural networks”, Proc 13th Intern’l Conf AI & Stats, 2010.
-  B. Autino, A. Noris, R. Russo, F. Castelli, “Epidemiology of malaria in endemic areas”, Mediterr J Hematol Infect Diseases. 2012.
-  R. Haralick, K. Shanmugam and D. Its’hak, ”Textural features for image classification”, IEEE Trans on Systems, Man, and Cybernetics, 1973.
-  J. Yosinski, J. Clune, Y. Bengio and H. Lipson, ”How transferable are features in deep neural networks?”, Adv Neural Inf Process Syst, 2014.
-  WHO, “Malaria Microscopy Quality Assurance Manual - Ver2”, 2016.
-  M. Dhorda, WWARN, Private Communication
-  N. White, ”The parasite clearance curve”, Malaria J, 2011.
-  K. Torres et al., “Automated microscopy for routine malaria diagnosis: a field comparison on Giemsa-stained blood films in Peru”, Malaria J, 2018.