Artificial Intelligence in Dry Eye Disease

09/02/2021 ∙ by Andrea M. Storås, et al. ∙ Simula Research Lab 18

Dry eye disease (DED) has a prevalence of between 5 and 50%, depending on the diagnostic criteria used and population under study. However, it remains one of the most underdiagnosed and undertreated conditions in ophthalmology. Many tests used in the diagnosis of DED rely on an experienced observer for image interpretation, which may be considered subjective and result in variation in diagnosis. Since artificial intelligence (AI) systems are capable of advanced problem solving, use of such techniques could lead to more objective diagnosis. Although the term `AI' is commonly used, recent success in its applications to medicine is mainly due to advancements in the sub-field of machine learning, which has been used to automatically classify images and predict medical outcomes. Powerful machine learning techniques have been harnessed to understand nuances in patient data and medical images, aiming for consistent diagnosis and stratification of disease severity. This is the first literature review on the use of AI in DED. We provide a brief introduction to AI, report its current use in DED research and its potential for application in the clinic. Our review found that AI has been employed in a wide range of DED clinical tests and research applications, primarily for interpretation of interferometry, slit-lamp and meibography images. While initial results are promising, much work is still needed on model development, clinical testing and standardisation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

page 8

page 36

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dry Eye Disease (DED) is one of the most common eye diseases worldwide, with a prevalence of between 5 and 50%, depending on the diagnostic criteria used and study population stapleton2017tfos. Yet, although symptoms stemming from DED are reported as the most common reason to seek medical eye care stapleton2017tfos, it is considered one of the most underdiagnosed and undertreated conditions in ophthalmology geerling2011international. Symptoms of DED include eye irritation, photophobia and fluctuating vision. The condition can be painful and might result in lasting damage to the cornea through irritation of the ocular surface. Epidemiological studies indicate that DED is most prevalent in women matossian2019dry and increases with age stapleton2017tfos. However, the incidence of DED is likely to increase in all age groups in coming years due to longer screen time and more prevalent use of contact lenses, which are both risk factors nichols2005self. Other risk factors include diabetes mellitus zhang2016dry and exposure to air-pollution mandell2020impact. DED can have a substantial effect on the quality of life, and may impose significant direct and indirect public health costs as well as personal economic burden due to reduced work productivity.

DED is divided into two subtypes defined by the underlying mechanism of the disease: (i) aqueous deficient DED, where tear production from the lacrimal gland is insufficient and (ii) evaporative DED (the most common form), which is typically caused by dysfunctional meibomian glands in the eyelids. Meibomian glands are responsible for supplying meibum, which is a concentrated substance that normally covers the surface of the cornea to form a protective superficial lipid layer that guards against evaporation of the underlying tear film. The ability to reliably distinguish between aqueous deficient and evaporative DED, their respective severity levels and mixed aqueous/evaporative forms is important in deciding the ideal modality of treatment. A fast and accurate diagnosis relieves patient discomfort and also spares them unnecessary expense and exposure to potential side effects associated with some treatments. A tailor made treatment plan can yield improved treatment response and maximize health provider efficiency.

The main clinical signs of DED are decreased tear volume, more rapid break-up of the tear film ( fluorescein tear break-up time (TBUT)) and microwounds of the ocular surface (willcox2017tfos). In the healthy eye, the tear film naturally ‘breaks up’ after ten seconds and the protective tear film is reformed with blinking. Available diagnostic tests often do not correlate with the severity of clinical symptoms reported by the patient. No single clinical test is considered definitive in the diagnosis of DEDstapleton2017tfos. Therefore, multiple tests are typically used in combination and supplemented by information gathered on patient symptoms, recorded through questionnaires. These tests demand a significant amount of time and resources at the clinic. Tests for determining the physical parameters of tears include TBUT, the Schirmer’s test, tear osmolarity and tear meniscus height. Other useful tests in DED diagnosis include ocular surface staining, corneal sensibility, interblink frequency, corneal surface topography, interferometry, aberrometry and imaging techniques such as meibography and in vivo confocal microscopy (IVCM), as well as visual function tests.

artificial intelligence (AI) was defined in 1955 as “the science and engineering of making intelligent machines” mccarthy2006AIproposal, where intelligence is the “ability to achieve goals in a wide range of environments” legg2007intelligence. Within AI, machine learning denotes a class of algorithms capable of learning from data rather than being programmed with explicit rules. AI

, and particularly machine learning, is increasingly becoming an integral part of health care systems. The sub-field of machine learning known as deep learning uses deep artificial neural networks, and has gained increased attention in recent years, especially for its image and text recognition abilities. In the field of ophthalmology, deep learning has so far mainly been used in the analysis of data from the retina to segment regions of interest in images, automate diagnosis and predict disease outcomes 

schmidt2018artificial. For instance, the combination of deep learning and optical coherence tomography (OCT) technologies has allowed reliable detection of retinal diseases and improved diagnosis deFauw2018clinically. Machine learning also has potential for use in the diagnosis and treatment of anterior segment diseases, such as DED. Many of the tests used for DED diagnosis and follow-up rely on the experience of the observer for interpretation of images, which may be considered subjective yedidya2007automatic. AI tools can be used to interpret images automatically and objectively, saving time and providing consistency in diagnosis.

Several reviews have been published that discuss the application of AI in eye disease, including screening for diabetic retinopathy nielsen2019review, detection of age-related macular degeneration pead2019review and diagnosis of retinopathy of prematurity gensure2020review. We are, however, not aware of any review on AI in DED. In this article, we therefore provide a critical review of the use of AI systems developed within the field of DED, discuss their current use and highlight future work.

2 Artificial intelligence

AI is informational technology capable of performing activities that require intelligence. It has gained substantial popularity within the field of medicine due to its ability to solve ubiquitous medical problems, such as classification of skin cancer esteva2017melanom, prediction of hypoxemia during surgeries lundberg2018hypoxemia and identification of diabetic retinopathy gulshan2016diabetic. Machine learning is a sub-field of AI encompassing algorithms capable of learning from data, without being explicitly programmed. All AI systems used in the studies included in this review, fall within the class of machine learning. The process by which a machine learning algorithm learns from data is referred to as training. The outcome of the training process is a machine learning model, and the model’s output is referred to as prediction

s. Different learning algorithms are categorised according to the type of data they use, and referred to as supervised, unsupervised and reinforcement learning. The latter is excluded from this review, as none of the studies use it, while the two former are introduced in this section. A complete overview of the algorithms encountered in the reviewed studies is provided in 

Figure 1, sorted according to the categories described below.

2.1 Supervised learning

Supervised learning denotes the learning process of an algorithm using labelled data, meaning data that contains the target value for each data instance, e.g., tear film lipid layer category. The learning process involves extracting patterns linking the input variables and the target outcome. The performance of the resulting model is evaluated by letting it predict on a previously unseen data set, and comparing the predictions to the true data labels. See Section 2.5

for a brief discussion of evaluation metrics. Supervised learning algorithms can perform regression and classification, where regression involves predicting a numerical value for a data instance, and classification involves assigning data instances to predefined categories.

Figure 1 contains an overview of supervised learning algorithms encountered in the reviewed studies.

2.2 Unsupervised learning

Unsupervised learning denotes the training process of an algorithm using unlabelled data, i.e. data not containing target values. The task of the learning algorithm is to find patterns or data groupings by constructing a compact representation of the data. This type of machine learning is commonly used for grouping observations together, detecting relationships between input variables, and for dimensionality reduction. As unsupervised learning data contains no labels, a measure of model performance depends on considerations outside the data (see hastie2009Unsupervised, chap. 14), e.g., how the task would have been solved by someone in the real world. For clustering algorithms, similarity or dissimilarity measures such as the distance between cluster points can be used to measure performance, but whether this is relevant depends on the task palacio2019evaluation. Unsupervised algorithms encountered in the reviewed studies can be divided into those performing clustering and those used for dimensionality reduction, see Figure 1 for an overview.

Figure 1: An overview of the machine learning algorithms used in the reviewed studies.

2.3 Artificial neural networks and deep learning

Artificial neural networks are loosely inspired by the neurological networks in the biological brain, and consist of artificial neurons organised in layers. How the layers are organised within the network is referred to as its

architecture. Artificial neural networks have one input layer, responsible for passing the data to the network, and one or more hidden layers. Networks with more than one hidden layer are called deep neural networks. The final layer is the output layer, providing the output of the entire network. Deep learning is a sub-field of machine learning involving training deep neural networks, which can be done both in a supervised and unsupervised manner. We encounter several deep architectures in the reviewed studies. The two more advanced types are convolutional neural networks and generative adversarial networks. CNN denotes the commonly used architecture for image analysis and object detection problems, named for having so-called convolutional layers that act as filters identifying relevant features in images. CNNs have gained popularity recently and all of the reviewed studies that apply CNNs were published in or later. Advanced deep learning techniques will most likely replace the established image analysis methods. This trend has been observed within other medical fields such as gastrointestinal diseases and radiology le2020application; thrall2018artificial. A GAN is a combination of two neural networks: A generator and a discriminator competing against each other. The goal of the generator is to produce fake data similar to a set of real data. The discriminator receives both real data and the fake data from the generator, and its goal is to discriminate the two. GANs can be used i.a. to generate synthetic medical data, alleviating privacy concerns (thambawita2021id).

2.4 Workflow for model development and validation

The data used for developing machine learning models is ideally divided into three independent parts: A training set, a validation set and a test set. The training set is used to tune the model, the validation set to evaluate performance during training, and the test set to evaluate the final model. A more advanced form of training and validation, is -fold cross-validation. Here, the data is split into parts, of which one part is set aside for validation, while the model is trained on the remaining data. This is repeated times, and each time a different part of the data is used for validation. The model performance can be calculated as the average performance for the different models (see hastie2009Unsupervised, chap. 7). It is considered good practice to not use the test data during model development and vice versa, the model should not be tuned further once it has been evaluated on the test data (see hastie2009Unsupervised, chap.7). In cases of class imbalance, i.e., unequal number of instances from the different classes, there is a risk of developing a model that favors the prevalent class. If the data is stratified for training and testing, this might not be captured during testing. Class imbalance is common in medical data sets, as there are for instance usually more healthy than ill people in the population gianfrancesco2018classImbalance. Whether to choose a class distribution that represents the population, a balanced or some other distribution depends on the objective. Various performance scores should regardless always be used to provide a full picture of the model’s performance.

2.5 Performance scores

In order to assess how well a machine learning model performs, its performance can be assigned a score. In supervised learning, this is based on the model’s output compared to the desired output. Here, we introduce scores used most frequently in the reviewed studies. Their definitions as well as the remaining scores used are provided in Section A.1. A commonly used performance score in classification is accuracy, eq. (A.3), which denotes the proportion of correctly predicted instances. Its use is inappropriate in cases of strong class imbalance, as it can reach high values if the model always predicts the prevalent class. The sensitivity, also known as recall, eq. (A.4), denotes the true positive rate. If the goal is to detect all positive instances, a high sensitivity indicates success. The precision, eq. (A.5), denotes the positive predictive value. The specificity, eq. (A.6), denotes the true negative rate, and is the negative class version of the sensitivity. The F1 score, eq. (A.7

), is the harmonic mean between the sensitivity and the precision. It is not symmetric between the classes, meaning it is dependent on which class is defined as positive.

Image segmentation involves partitioning the pixels in an image into segments geron2019hands. This can for example be used to place all pixels representing the pupil into the same segment while pixels representing the iris are placed in another segment. The identified segments can then be compared to manual annotations. Performance scores used include the Average Pompeiu-Hausdorff distance, (A.17), the Jaccard index and the support, all described in Section A.1.

2.6 AI regulation

Approved AI devices will be a major part of the medical service landscape in the future. Currently, many countries are actively working on releasing AI regulations for healthcare, including the European Union (EU), the United States, China, South Korea and Japan. On 21 April 2021, the EU released a proposal for a regulatory framework for AI AI_regulation_EU. The US Food and Drug Administration (FDA) is also working on AI legislation for healthcare AI_regulation_FDA.

In the framework proposed by the EU, AI systems are divided into the four categories low risk, minimal risk, high risk and unacceptable risk AI_regulation_EU. AI systems that fall into the high risk category are expected to be subject to strict requirements, including data governance, technical documentation, transparency and provision of information to users, human oversight, robustness and cyber security, and accuracy. It is highly likely that medical devices using AI will end up in the high risk category. Looking at the legislation proposals AI_regulation_EU; AI_regulation_FDA from an AI research perspective, it is clear that explainable AI, transparency, uncertainty assessment, robustness against adversarial attacks, high quality of data sets, proper performance assessment, continuous post-deployment monitoring, human oversight and interaction between AI systems and humans, will be major research topics for the development of AI in healthcare.

3 Methods

3.1 Search methods

A systematic literature search was performed in PubMed and Embase in the period between March and May , . The goal was to retrieve as many studies as possible applying machine learning to DED

related data. The following keywords were used: All combinations of “dry eye” and “meibomian gland dysfunction” with “artificial intelligence”, “machine learning”, “computer vision”, “image recognition”, “bayesian network”, “decision tree”, “neural network”, “image based analysis”, “gradient boosting”, “gradient boosting machine” and “automatic detection”. In addition, searches for “ocular surface” combined with both “artificial intelligence” and “machine learning” were made. See also an overview of the search terms and combinations in 

Figure 2. No time period limitations were applied for any of the searches.

Figure 2: Search term combinations used in the literature search. Three of the studies found in the searches including “ocular surface” were also found among the studies in the searches including “dry eye”.

3.2 Selection criteria

The studies to include in the review had to be available in English in full-text. Studies not investigating the medical aspects of DED were excluded (e.g., other ocular diseases and cost analyses of DED). Moreover, the studies had to describe the use of a machine learning model in order to be considered. Reviews were not considered. The studies were selected in a three-step process. One review author screened the titles on the basis of the inclusion criteria. The full-texts were then retrieved and studied for relevance. The search gave studies in total, of which were regarded as relevant according to the selection criteria. After removing duplicates, studies were left. The three-step process is shown in Figure 2(a).

4 Artificial intelligence in dry eye disease

4.1 Summary of the studies

Most studies were published in recent years, especially after 2014, see Figure 2(b). An overview of the studies is provided in Tables 4, 3, 2 and 1 for the clinical, biochemical and demographical studies, respectively. Information on the data used in each study is shown in Table 5. We grouped studies according to the type of clinical test or type of study: TBUT, interferometry and slit-lamp images, IVCM, meibography, tear osmolarity, proteomics analysis, OCT, population surveys and other clinical tests. We found most studies employed machine learning for interpretation of interferometry, slit-lamp and meibography images.

(a)
(b)
Figure 3: (fig:flowchart) Illustration of the three steps in the study selection process and number of studies (N) included in each step, and (fig:timechart) the number of studies published over time, counting the studies included in this review.
Study Objective N Clinical Tests Type of Data Type of Algorithm Performance Score(s)
Aggarwal S et al. (aggarwal2021immunecell DED mechanism, effect of therapy Subjective symptoms, Schirmer´s test with anasthesia, TBUT, vital staining of cornea and conjunctiva, laser IVCM images, subbasal layer of cornea: DC density and morphology Images of cornea GLM, MLR GLM: p-values < for DC density and number of DCs, MLR: p-values < between DC density and CFS, number of DCs and CFS, DC size and CFS, DC density and conjunctival staining, number of DCs and TBUT, corresponding -coefficients = , , , and
Deng X et al. (DENG2021TearMeniscus Estimate tear meniscus height Oculus Keratograph Tear meniscus images CNN (U-net) Accuracy = %, sensitivity = , precision = , F1 score =
Elsawy A et al. (ELSAWY2021252 Diagnose DED AS-OCT Ocular surface images Pretrained CNN (VGG19) AUCROC = (model ) and (model ), AUCPRC = (model ) and (model ), F1 score = (model ) and (model )*
Khan ZK et al. ()Khan2021image Detect MGD Meibomian gland 3D IR-images, lower and upper eyelid Meibomian gland images GAN F1 score = , P-HD = , aggregated JI = , r = (clincian1) and (clinician2), p-values < , mean difference = (clincian1) and (clincian2)
Xiao P et al.(xiao2021Meibo Detect MGD (images) Oculus Keratograph IR meibography images Prewitt operator, Graham scan algorithm, fragmentation algorithm and SA (used sequentially) Gland area: KI = , FPR = %, FNR = %. Gland segmentation: KI = , FPR = %, FNR = %*
Yeh C-H et al. (yeh2021Meibo Detect MGD (images) Oculus Keratograph IR meibography images

Nonparametric instance discrimination, pretrained CNN (ImageNet), hierarchical clustering

: Accuracy: meiboscore grading = %, 2-class classification = %, 3-class classification = %, 4-class classification = %*
da Cruz LB et al. ()dacruz2020interferometer Classify tear film patterns (images) Doane interferometer Tear film lipid layer images

SVM, RF, RT, Naive Bayes, DNN, simple NN

RF: accuracy = %, SD = %, F1 score = , KI = , AUCROC = **
da Cruz LB et al. (dacruz2020ripleysk Classify tear film patterns (images) Doane interferometer Tear film lipid layer images SVM, RF, RT, Naive Bayes, DNN, simple NN RF: accuracy = %, SD = %, F1 score = , KI = , AUCROC = ***
Fu P-I et al. (Fu2020LLT Compare methods Oculus Keratograph Tear film lipid layer images (with and without preprocessing) GLM -coefficients = ,
Fujimoto K et al. (fujimoto2020comparison Compare methods Pentacam vs AS-OCT CCT, TCT, thinnest point of cornea Multivariable regression Severe DED: -coefficients = (CCT) and (TCT), p-values = (CCT) and (TCT), % CI = (CCT) and (TCT)
Maruoka S et al. (Maruoka2020Meibo Detect MGD IVCM Meibomian gland images Combinations of CNNs Single CNN: AUROC = , sensitivity = , specificity = , ensemble CNNs: AUROC = , sensitivity = , specificity =
Prabhu SM et al. (Prabhu2020Meibo Quantify and detect MGD (images) Oculus Keratograph, digital camera CNN (U-net) -values > between model output and clinical experts
Stegmann H et al. (Stegmann2020TearMeniscus Detect tear meniscus in images Optical coherence tomography Tear meniscus images CNNs Meniscus localization: JI = , sensitivity = , meniscus segmentation best CNN: accuracy = , sensitivity = , specificity = , JI = , F1 score = , support = *, ***
Wei S et al. (wei2020therapeutic DED mechanism, effect of therapy Corneal IVCM with anesthesia Images of cornea Pretrained CNN (U-net) AUROC = , sensitivity = %
Giannaccare G et al. (Giannaccare2019SubbasalNerve Subbasal nerve plexus characteristics for diagnosing DED IVCM Images of subbasal nerve plexus Earlier developed method involving RF and NN Chen2017ACCMed; Dabbah2011NerveFiber Nan

Abbreviations: N = number of subjects; DED = dry eye disease; IVCM = in vivo confocal microscopy; DC = dendritic cell; GLM = generalized linear model; MLR = multiple linear regression; CFS = corneal fluorescein score; AS-OCT = anterior segment optical coherence tomography; CNN = convolutional neural network; AUROC = area under reciever operating characteristic curve; AUPRC = area under precision-recall curve; MGD = meibomian gland dysfunction; GAN = generative adversarial network; P-HD = average Pompeiu-Hausdorff distance; JI = Jaccard index; KI = Kappa index; CTRL = healthy; FPR = false positive rate; FNR = false negative rate; SVM = support vector machine; RF = random forest; RT = random tree; DNN = deep neural network; SD = standard deviation; CCT = central corneal thickness; TCT = thinnest corneal thickness; r = Pearson’s correlation coefficient;Nan = not available; NN = neural network; RMSE = root mean squared error; CI = confidence interval; TBUT = fluorescein tear break-up time; PA = pruning algorithm; SA = skeletonization algorithm; FFA = Flood-fill algorithm; * = standard deviations not included in table; ** =

% confidence intervals not included in table; *** = metrics are calculated as the average of repetitions;**** = metrics are calculated as the average of repetitions; ***** = metrics are calculated as the average from -fold crossvalidation; = metrics are calculated as the average from -fold crossvalidation = metrics are calculated as the average of models)
Table 1: Overview of the reviewed studies using clinical investigations, part 1 of 2.
Study Objective N Clinical Tests Type of Data Type of Algorithm Performance Score(s)
Llorens-Quintana C et al. (Llorens-QuintanaClara2019ANAA Evaluate meibomian gland atrophy Oculus Keratograph Meibography images Sobel operator, polynomial function, fragmentation algorithm, Otsu’s method (used sequentially) -values < between automatic method and clinicians
Wang J et al. (Wang2019Meibo Evaluate meibomian gland atrophy (images) Oculus Keratograph Meibography images CNNs Meiboscore grading: accuracy = %, eyelid detection: accuracy = %, JI = , atrophy detection: accuracy = %, JI = , RMSE = (average across meiboscores)
Yabusaki K (yabusaki2019diagnose Diagnose DED (images) Tear interferometer Tear film lipid layer images SVM KI = , CTRL: F1 score = , SD = , aqueous-deficient DED: F1 score = , SD = , evaporative DED: F1 score = , SD = ****
Yang J et al. (yang2019meniscus Estimate tear meniscus height for DED Slit-lamp images with fluorescence staining Ocular surface images Connected component labelling Mean: p-value < (x and x magnification), r = (x) and (x), max: p-value < (x and x), r = (x) and (x)
Szyperski PD (Szyperski2018fractal Diagnose DED Interferometry Videos from lateral shearing interferometry different fractal dimension estimators, linear regression Best estimator: AUROC =
Hwang H et al. (hwang2017image Estimate tear film lipid layer thickness Lipiscanner , slit-lamp microscope Tear film lipid layer videos Flood-fill algorithm, Canny edge detection p-value < between all MGD groups
Koprowski R et al. (Koprowski2017Meibo Detect MGD Oculus Keratograph Meibography images Riesz pyramid (?), Bezier curve (used sequentially) Accuracy = %, sensitivity = , specificity =
Peteiro-Barral D et al. (peteiro2017evaluation Classify tear film patterns (images) Tearscope plus images Tear film lipid layer images SVM, Decision tree, Naive Bayes, simple NN, Fisher´s linear discriminant NN: accuracy = %, sensitivity = %, specificity = %, precision = %, F1 score = , AUCROC =
Koprowski et al. (Koprowski20161Meibo Detect MGD Oculus Keratograph Meibography images Otsu’s method, SA, watershed algorithm (used sequentially) Sensitivity = , specificity =
Remeseiro B et al. (remeseiro2016iDEAS Classify tear film patterns (images) Tearscope-plus images Tear film lipid layer images SVM accuracy = %, precision = %, sensitivity = %, specificity = %, F1 score = %, processing time = s
Remeseiro B et al. (remeseiro2016CASDES Classify tear film patterns (images) Tearscope-plus images Tear film lipid layer images SVM accuracy = %, sensitivity = %, precision = %, specificity = %
Kanellopoulos AJ et al. (kanellopoulos2014invivo Diagnose DED Fourier-domain AS-OCT system: corneal and corneal epithelial thickness maps Corneal examination Linear regression (correlation between DED and thickness) Nan
Ramos L et al. (ramos2014automatic Estimate TBUT (videos) Videos from TBUT (slit-lamp) TBUT videos Polynomial function specificity = % (parameter b) and % (parameter e), specificity = % and %
Ramos L et al. (Ramos2014CCLRU Estimate TBUT (videos) Videos from TBUT (slit-lamp) TBUT videos Polynomial function accuracy = "more than %"
Remeseiro et al. (remeseiro2014tearfilm Classify tear film patterns (images) Tearscope-plus images Tear film lipid layer images Markov random field, SVM (used sequentially) accuracy = %, accuracy (noisy data) = *****
García-Resúa C et al. (garcia2013new Classify tear film patterns Tearscope-plus images Tear film lipid layer images K-nearest neighbors Cramér’s V = , r = , p-value < , accuracy = %
Rodriquez JD (Rodriguez2013Redness Evaluate ocular redness Slit-lamp, digital camera Images of conjunctiva Sobel operator, MLR (used sequentially) Accuracy = %, r = , concordance correlation = (compared to investigators)*
Koh YW et al. (koh2012detection Detect MGD Slit-lamp biomicroscope, upper eye lid IR meibography images PA, SA, FFA, SVM (used sequentially) specificity = %, SD = %, sensitivity = %, SD = %
Yedidya T et al. (yedidya2009enforcing Estimate TBUT (videos) Video from TBUT TBUT videos Markov random field average difference in TBUT = s
Yedidya T et al. (yedidya2007automatic Detect dry areas (videos) Video from TBUT TBUT videos Levenberg-Marquardt accuracy = % (%), SD = %
Mathers WD et al. (mathers2004cluster Investigate DED Schirmer´s test, meibomian gland drop-out, lipid viscosity and volume, tear evaporation Clinical test results Hierarchical clustering, decision tree Nan

Abbreviations: N = number of subjects; DED = dry eye disease; IVCM = in vivo confocal microscopy; DC = dendritic cell; GLM = generalized linear model; MLR = multiple linear regression; CFS = corneal fluorescein score; AS-OCT = anterior segment optical coherence tomography; CNN = convolutional neural network; AUROC = area under reciever operating characteristic curve; AUPRC = area under precision-recall curve; MGD = meibomian gland dysfunction; GAN = generative adversarial network; P-HD = average Pompeiu-Hausdorff distance; JI = Jaccard index; KI = Kappa index; CTRL = healthy; FPR = false positive rate; FNR = false negative rate; SVM = support vector machine; RF = random forest; RT = random tree; DNN = deep neural network; SD = standard deviation; CCT = central corneal thickness; TCT = thinnest corneal thickness; r = Pearson’s correlation coefficient;Nan = not available; NN = neural network; RMSE = root mean squared error; CI = confidence interval; TBUT = fluorescein tear break-up time; PA = pruning algorithm; SA = skeletonization algorithm; FFA = Flood-fill algorithm; * = standard deviations not included in table; ** =

% confidence intervals not included in table; *** = metrics are calculated as the average of repetitions;**** = metrics are calculated as the average of repetitions; ***** = metrics are calculated as the average from -fold crossvalidation; = metrics are calculated as the average from -fold crossvalidation = metrics are calculated as the average of models)
Table 2: Overview of the reviewed studies using clinical investigations, part 2 of 2.
Study Objective N Clinical Tests Type of Data Type of Algorithm Performance Score(s)
Cartes C et al. (cartes2019dry Diagnose DED Tear-Lab Osmometer Tear osmolarity measurements LR, Naive Bayes, SVM, RF LR: accuracy = %
Jung JH et al. (jung2017proteomic Detect protein patterns in DED Pooled tear and lacrimal fluid, analysed with LC-MS, trypsin digestion, RP-LC fractionation Proteins in tears and lacrimal fluid "Network model" based on betweenness centrality Nan
Gonzalez N (Gonzalez2014Protein Diagnose DED Peptide/protein analysis: gel electrophoresis (SDS-PAGE) Peptides and proteins in tears Discriminant analysis, PCA, NN Accuracy = %, CTRL: sensitivity = , specificity = , MGD: sensitivity = , specificity = , aqueous-deficient DED: sensitivity = , specificity = *
Grus FH et al. (grus2005seldi Diagnose DED Schirmer´s test with anesthesia, tears analysed by LC-MS Proteins in tears Discriminant analysis, DNN (used sequentially) AUROC = , sensitivity and specificity = “approx. % each”
Grus FH et al. (grus1999analysis Diagnose DED Protein analysis: gel electrophoresis (SDS-PAGE) Proteins in tears DNN, discriminant analysis DNN: accuracy = %, discriminant analysis: accuracy = %
Grus FH et al. (grus1998clustering Diagnose DED Protein analysis: gel electrophoresis (SDS-PAGE) Proteins in tears Principal component analysis, K-means clustering (used sequentially), discriminant analysis K-means: accuracy = % (DED vs CTRL) and % (DED, diabetes-DED, CTRL), discriminant analysis: accuracy = % (DED vs CTRL) and % (DED, diabetes-DED, CTRL)

Abbreviations: N = number of subjects; DED = dry eye disease; LR = logistic regression; SVM = support vector machine; RF = random forest; AUROC = area under reciever operating characteristic curve; MGD = meibomian gland dysfunction; CTRL = healthy; DNN = deep neural network; Nan = not available; NN = neural network; LC-MS = liquid chromotography mass spectometry; RP-LC = reverse-phase liquid chromatography; SDS-PAGE = sodium dodecyl sulphate-polyacrylamide gel electrophoresis; OSDI = ocular surface disease index; * = metrics are calculated as the average of

repetitions
Table 3: Overview of the reviewed studies using biochemical investigations.
Study Objective N Clinical Tests Type of Data Type of Algorithm Performance Score(s)
Choi HR et al. (choi2020association Investigate DED and dyslipidemia association OSDI score, health examination, questionnaire Population studies, Korea GLM, LR Nan
Nam SM et al. (nam2020explanatory Detect risk factors for DED Health examination, health survey, nutrition survey National health survey, Korea Decision tree, Lasso, LR (used sequentially) AUROC = , % CI = , specificity = %, sensitivity = %
Kaido M et al. (Kaido2015computer Diagnose DED Blink frequency, visual maintenance ratio, questionnaire Functional VA measurement and questionnaire, Japanese visual display terminal workers Discriminant analysis sensitivity = %, specificity = %, precision = %, NPV = %
Abbreviations: N = number of subjects; DED = dry eye disease; GLM = generalized linear model; AUROC = area under reciever operating characteristic curve; Nan = not available; CI = confidence interval; LR = logistic regression; OSDI = ocular surface disease index; VA = visual acuity; NPV = negative predictive value
Table 4: Overview of the reviewed studies using demographical investigations.
Study Type of Input Data Training Dataset Testing Dataset Reference Standard
Clinical Investigations
Aggarwal S et al. (aggarwal2021immunecell Tabular Nan Nan (clinical test results, subjective report)
Deng X (DENG2021TearMeniscus Images (images) (images) Senior clinician
Elsawy A et al. (ELSAWY2021252 Images (train), (val) Certified cornea specialist
Khan ZK et al. ()Khan2021image Images Clinician
Xiao P et al. (xiao2021Meibo Images Nan ophthalmologists
Yeh C-H et al. (yeh2021Meibo Images (train), (val) Trained clinician
da Cruz LB et al. ()dacruz2020interferometer Tabular (-fold CV) Nan Optometrist
da Cruz LB et al. (dacruz2020ripleysk Tabular (-fold CV) Nan Optometrist
Fu P-I et al. (Fu2020LLT Tabular Nan Nan (clinical test results, subjective report)
Fujimoto K et al. (fujimoto2020comparison Tabular Nan Nan (kerato-conjunctival staining for DED)
Maruoka S et al. (Maruoka2020Meibo Images (-fold CV) Nan eyelid specialists
Prabhu SM et al.(Prabhu2020Meibo Images Clinical experts
Stegmann H et al. (Stegmann2020TearMeniscus Images (images) (-fold CV) Nan Experienced investigator
Wei S et al. (wei2020therapeutic Images * ( per patient) Experienced investigator
Giannaccare G et al. (Giannaccare2019SubbasalNerve Tabular Nan Experienced investigator Chen2017ACCMed
Llorens-Quintana et al. (Llorens-QuintanaClara2019ANAA Images Nan Clinicians
Wang J et al. (Wang2019Meibo Images (train) (val) Experienced clinician
Yabusaki K et al. (yabusaki2019diagnose Tabular ** ** Skilled ophthalmologist
Yang J et al. (yang2019meniscus Images Nan ImageJ software
Szyperski PD (Szyperski2018fractal Tabular Nan Nan
Hwang H et al. (hwang2017image Frames Nan Meibomian gland expert
Koprowski R et al. (Koprowski2017Meibo Images (images) Nan Specialized clinicians
Peteiro-Barral D et al. (peteiro2017evaluation Tabular (LOO CV) Nan Experts
Koprowski R et al. (Koprowski20161Meibo Images (images) Nan Ophthalmology expert
Remeseiro B et al. (remeseiro2016iDEAS Tabular Nan Optometrists
Remeseiro B et al. (remeseiro2016CASDES Tabular Sampled from test set optometrists
Kanellopoulos AJ et al. (kanellopoulos2014invivo Tabular Nan Ophthalmologist
Ramos L et al. (ramos2014automatic Videos Nan / experts
Ramos L et al. (Ramos2014CCLRU Videos experts
Remeseiro et al. (remeseiro2014tearfilm Tabular (-fold CV) Nan Experts
García-Resúa C et al. (garcia2013new Tabular (-fold CV) Nan Experienced investigator
Rodriguez R et al. (Rodriguez2013Redness Tabular (images) Nan trained investigators
Koh YW et al. (koh2012detection Tabular *** *** Experts
Yedidya T et al. (yedidya2009enforcing Videos Nan Clinician
Yedidya T et al. (yedidya2007automatic Frames **** Nan Optometrist (evaluated of the patients)
Mathers WD et al. (mathers2004cluster Tabular (-fold CV) Nan Nan (clinical test results)
Biochemical Investigations
Cartes C et al. (cartes2019dry Tabular (noise added) (no noise) Nan (clinical test results, subjective report)
Jung JH et al. (jung2017proteomic Tabular Nan Ophthalmologist
Gonzalez N et al. (Gonzalez2014Protein Tabular % of ** % of ** Nan (clinical tests)
Grus FH et al. (grus2005seldi Tabular % of % of Nan (clinical test results, subjective report)
Grus FH et al. (grus1999analysis Tabular Nan (clinical test results, subjective report)
Grus FH et al. (grus1998clustering Tabular Nan (clinical test results, subjective report)
Demographical Investigations
Choi HR et al. (choi2020association Tabular Nan Nan (subjective report)
Nam SM et al. (nam2020explanatory Tabular % of % of Ophtalmologist
Kaido M et al. (Kaido2015computer Tabular Nan Dry eye specialists
Abbreviations: Nan = not available; val = validation; CV = crossvalidation; DED = dry eye disease; LOO = leave one out; * = pretraining images; ** = randomly selected samples, process repeated times; *** = randomly selected samples, process repeated times; **** = sequences of video per patient;

= For multivariate analysis model, but the number of samples was not mentioned

Table 5: Overview of the data applied for the analyses.

4.2 Fluorescein tear break-up time

Shorter break-up time indicates an unstable tear film and higher probability of

DED. Machine learning has been employed to detect dry areas in TBUT videos and estimate TBUT yedidya2007automatic; yedidya2009enforcing; ramos2014automatic; Ramos2014CCLRU. Use of the Levenberg-Marquardt algorithm to detect dry areas achieved an accuracy of % compared to assessments by an optometrist yedidya2007automatic. Application of Markov random fields to label pixels based on degree of dryness was used to estimate TBUT resulting in an average difference of seconds compared to clinician assessments yedidya2009enforcing. Polynomial functions have also been used to determine dry areas, where threshold values were fine-tuned before estimation of TBUT ramos2014automatic. This method resulted in more than % of the videos deviating by less than seconds compared to analyses done by four experts on videos not used for training Ramos2014CCLRU. Taken together, these studies indicate that TBUT values obtained using automatic methods are within an acceptable range compared to experts. However, we only found four studies, all of them including a small number of subjects. Further studies are needed to verify the findings and to test models on external data.

4.3 Interferometry and slit-lamp images

Interferometry is a useful tool that gives a snapshot of the status of the tear film lipid layer, which can be used to aid diagnosis of DED. Machine learning systems have been applied to interferometry and slit-lamp images for lipid layer classification based on morphological properties garcia2013new; remeseiro2014tearfilm; remeseiro2016CASDES; remeseiro2016iDEAS; peteiro2017evaluation; dacruz2020interferometer; dacruz2020ripleysk, estimation of the lipid layer thickness hwang2017image; Fu2020LLT, diagnosis of DED Szyperski2018fractal; yabusaki2019diagnose, determination of ocular redness Rodriguez2013Redness and estimation of tear meniscus height yang2019meniscus; DENG2021TearMeniscus.

Diagnosis of DED can be based on the following morphological properties: open meshwork, closed meshwork, wave, amorphous and color fringe guillon1998lipid. Most studies used these properties to automatically classify interferometer lipid layer images using machine learning. Garcia et al. used a K-nearest neighbors model trained to classify images resulting in an accuracy of garcia2013new. Remeseiro et al. explored various support vector machine (SVM) models for use in final classification remeseiro2014tearfilm; remeseiro2016CASDES; remeseiro2016iDEAS. In one of the studies, the same data was used for training and testing, which is not ideal remeseiro2016CASDES. Another study did not report the data their system was trained on remeseiro2016iDEAS. Peteiro et al. evaluated images using five different machine learning models peteiro2017evaluation. In this study, the amorphous property was not included as one of possible classifications, as opposed to the other studies. A simple neural network achieved the overall best performance with an accuracy of %. However, because leave-one-out cross validation was applied, the model may have overfitted on the training data hastie2009Unsupervised. da Cruz et al. compared six different machine learning models and found that the random forest was the best classifier, regardless of the pre-processing steps used dacruz2020interferometer; dacruz2020ripleysk

. The highest performance was achieved by application of Ripley’s K function in the image pre-processing phase, and Greedy Stepwise technique used simultaneously with the machine learning models for feature selection 

dacruz2020ripleysk. Since all models were evaluated with cross validation, the system should be externally evaluated on new images before being considered for routine use in the clinic.

Hwang et al. investigated whether tear film lipid layer thickness can be used to distinguish meibomian gland dysfunction (MGD) severity groups hwang2017image. Machine learning was used to estimate the thickness from Lipiscanner and slit-lamp videos with promising results. Images were pre-processed and the flood-fill algorithm and canny edge detection were applied to locate and extract the iris from the pupil. A significant difference between two MGD severity groups was detected, suggesting that the technique could be used for the evaluation of MGD. Keratograph images can also be used to determine tear film lipid layer thickness. Comparison of two different image analysis methods using a generalized linear model showed that there was a high correlation between the two techniques Fu2020LLT. The authors concluded that the simple technique was sufficient for evaluation of tear film lipid layer thickness. However, only subjects were included in the study.

The use of fractal dimension estimation techniques was investigated for feature extraction from interferometer videos for diagnosis of

DED Szyperski2018fractal. The technique was found to be fast and had an area under the receiver operating characteristic curve (AUC) value of , compared to a value of for an established method (See Figures 3(a) and A.1

for a description of the receiver operating characteristic curve). Tear film lipid interferometer images were analysed using an

SVM yabusaki2019diagnose. Extracted features from the images were passed to the SVM model, which classified the images as either healthy, aqueous-deficient DED, or evaporative DED. The agreement between the model and a trained ophthalmologist was high, with a reported Kappa value of . The model performed best when detecting aqueous-deficient DED.

Ocular redness is an important indicator of dry eyes. Only one of the reviewed studies described an automated system for evaluation of ocular redness associated with DED Rodriguez2013Redness. Slit-lamp images were acquired from subjects with a history of DED. Features representing the ocular redness intensity and horizontal vascular component were extracted with a Sobel operator. A multiple linear regression model was trained to predict ocular redness based on the extracted features. The system achieved an accuracy of %. The authors suggested that an objective system like this could replace subjective gradings by clinicians in multicentered clinical studies.

The tear meniscus contains % of the aqueous tear volume Holly1985meniscus. Consequently, the tear meniscus height can be used as a quantitative indicator for DED caused by aqueous deficiency. When connected component labelling was applied to slit-lamp images, the Pearson’s correlation between the predicted meniscus heights and an established software methodology (ImageJ imageJref) was high, ranging between and  yang2019meniscus. The machine learning system was found to be more accurate than four experienced ophthalmologists. The tear meniscus height can also be estimated from keratography images using a CNN DENG2021TearMeniscus. The automatic machine learning system achieved an accuracy of % and was found to be more effective and consistent than a well-trained clinician working with limited time.

Many of the studies apply SVM as their type of machine learning model without testing how other machine learning models perform. However, three of the studies tested several types of models and found that SVM did not perform the best peteiro2017evaluation; dacruz2020interferometer; dacruz2020ripleysk. It is difficult to compare the studies due to different applications and evaluation metrics. Despite promising results, most of the studies garcia2013new; remeseiro2014tearfilm; remeseiro2016CASDES; peteiro2017evaluation; dacruz2020interferometer; dacruz2020ripleysk; hwang2017image; Fu2020LLT; Szyperski2018fractal; Rodriguez2013Redness; yang2019meniscus did not evaluate their systems on external data. The systems should be tested on independent data before they can be considered for clinical application. Moreover, some studies were small Rodriguez2013Redness; Fu2020LLT or pilots yang2019meniscus; DENG2021TearMeniscus, and the suggested models should be tested on a larger number of subjects.

4.4 In vivo confocal microscopy

IVCM is a valuable non-invasive tool used to examine the corneal nerves and other features of the cornea Cruzat2016IVCM. IVCM images were used in a small study to assess characteristics of the corneal subbasal nerve plexus for diagnosis of DED Giannaccare2019SubbasalNerve. Application of random forest and a deep neural network Chen2017ACCMed gave promising results with an AUC value of for detecting DED Giannaccare2019SubbasalNerve. IVCM images of corneal nerves can also be analyzed by machine learning models to estimate the length of the nerve fiber wei2020therapeutic. Authors used a CNN with a U-net architecture that had been pre-trained on more than IVCM images of corneal nerves. The model showed that nerve fiber length was significantly longer after intense pulsed light treatment in MGD patients, which agreed with manual annotations from an experienced investigator with an AUC value of and a sensitivity of . High-resolution IVCM images were also used to detect obstructive MGD Maruoka2020Meibo. Combinations of nine different CNNs were trained and tested on the images using -fold cross validation. Classification by the models was compared to diagnosis made by three eyelid specialists. The best performance was achieved when four different models were combined, with high sensitivity, specificity and AUC values, see Table 1. These promising results suggest that CNNs can be useful for detection and evaluation of MGD. Deep learning methods such as CNNs have the advantage that feature extraction from the images prior to analysis is not required as this is performed automatically by the model.

IVCM images have been investigated for changes in immune cells across different severities of DED for diagnostic purposes aggarwal2021immunecell. A generalized linear model showed significant differences in dendritic cell density and morphology between DED patients and healthy individuals, but not between the different DED subgroups, see Table 1. While results using machine learning to interpret IVCM images are promising, larger clinical studies are needed to validate findings before clinical use can be considered.

4.5 Meibography

The meibomian glands are responsible for producing meibum, important for protecting the tear fluid from evaporation. Reduced secretion of meibum due to a reduced number of functional meibomian glands and/or obstruction of the ducts is a major cause of evaporative DED and MGD. Classification of meibomian glands using meibography is routine for experienced experts, but this is not the case for all clinicians. Moreover, automatic methods can be faster than human assessment.

Meibography images may require several pre-processing steps before they can be classified. One study trained an SVM on extracted features from the images koh2012detection. Pre-processing included the dilation, flood-fill, skeletonization and pruning algorithms. The model achieved a sensitivity of and specificity of . However, in contrast to all other image analysis methods, this method is not completely automatic as the images need to be manipulated manually before they are passed on to the system.

A combination of Otsu’s method and the skeletonization and watershed algorithms was useful in automatically quantifying meibomian glands Koprowski20161Meibo. This method was faster than an ophthalmologist and achieved a sensitivity and specificity of and , respectively. Another automatic method applied Bézier curve fitting as part of the analysis Koprowski2017Meibo. The reported sensitivity was , while the specificity was . Xiao et al. sequentially applied a Prewitt operator, Graham scan, fragmentation and skeletonization algorithms for image analysis to quantify meibomian glands xiao2021Meibo. The agreement between the model results and two ophthalmologists was high with Kappa values larger than and low false positive rates (). The false negative rate was , suggesting that some glands were missed by the method. A considerable weakness of this study was that only images were used for model development, and consequently it might not work well on unseen data. Another study automatically graded MGD severity using a Sobel operator, polynomial functions, fragmentation algorithm and Otsu’s method Llorens-QuintanaClara2019ANAA. While the method was found to be faster, the results were significantly different from clinician assessments.

Deep learning approaches were used by four studies evaluating meibomian gland features Wang2019Meibo; Prabhu2020Meibo; yeh2021Meibo; Khan2021image. These systems are fully automated and apply some of the latest technologies within image analysis. Wang et al. used four different CNNs to determine meibomian gland atrophy Wang2019Meibo. The CNNs were trained to identify meibomian gland drop-out areas and estimate the percentage atrophy in a set of images. Comparison of model predictions with experienced clinicians indicated that the best CNN (ResNet50 architecture) was superior. Yeh et al. developed a method to evaluate meibomian gland atrophy by extracting features from meibography images with a special type of unsupervised CNN before application of a K-nearest neighbors model to allocate a meiboscore yeh2021Meibo. The system achieved an accuracy of %, outperforming annotations by the clinical team. Moreover, hierarchical clustering of the extracted features from the CNN could show relationships between meibography images. Another study used a CNN to automatically assess meibomian gland characteristics Prabhu2020Meibo. Images from two different devices collected from various hospitals were used to train and evaluate the CNN. This is an example of uncommonly good practice, as most medical AI systems are developed and evaluated on data from only one device and/or hospital. The only study to use a GAN architecture tested it on infrared 3D images of meibomian glands in order to evaluate MGD Khan2021image. Comparing the model output with true labels, the performance scores were better than for state of the art segmentation methods. The Pearson correlations between the new automated method and two clinicians were and .

Four of the studies did not evaluate their proposed systems on external data Koprowski20161Meibo; Koprowski2017Meibo; Llorens-QuintanaClara2019ANAA; xiao2021Meibo. Since the number of images used for model development was limited, the models can have overfit, and external evaluations should be performed to test how well the systems generalize to new data.

4.6 Tear osmolarity

Tear osmolarity is a measure of tear concentration, and high values can indicate dry eyes. Cartes et al. cartes2019dry investigated use of machine learning to detect DED based on this test. Four different machine learning models were compared. Noise was added to osmolarity measurements during the training phase, while original data without noise was used for final evaluation. The logistic regression model achieved % accuracy. However, since the models were trained and tested on the same data, the reported score is most likely not representative for how well the model generalizes to new data.

4.7 Proteomic analysis

Proteomic analysis describes the qualitative and quantitative composition of proteins present in a sample. Grus et al. compared tear proteins in individuals with diabetic DED, non-diabetic DED and healthy controls for discrimination between the groups grus1998clustering. The authors used discriminant analysis and principal component analysis combined with k-means clustering. Both models achieved low accuracies when predicting all three categories. However, classification into DED and non-DED achieved accuracies of % and % for discriminant analysis and k-means clustering, respectively. In another study by the same group, tear proteins analyzed using deep learning discriminated subjects as healthy or having DED with an accuracy of grus1999analysis. An accuracy of % was achieved using discriminant analysis. A combination of discriminant analysis for detecting the most important proteins and a deep neural network for classification was also investigated grus2005seldi. High accuracy, sensitivity and specificity were reported. Discriminant analysis was also used by Gonzalez et al. in analysis of the tear proteome Gonzalez2014Protein. The most important proteins were selected to train an artificial neural network to classify tear samples as aqueous-deficient DED, MGD or healthy. The model gave an overall accuracy of %. Principal component analysis yielded good separation of healthy controls, aqueous-deficient DED and MGD data-points, indicating that the proteins were good candidates for classification of the three conditions. This system achieved the highest accuracy of all the reviewed proteomic studies. Considered together, the results from the four studies grus1998clustering; grus1999analysis; grus2005seldi; Gonzalez2014Protein suggest that neural networks applied alone or together with other techniques perform better than discriminant analysis for detecting DED-related protein patterns in the tear proteome.

Jung et al. used a network model based on modularity analysis to describe the tear proteome with respect to immunological and inflammatory responses related to DED jung2017proteomic. In this study, patterns in tears and lacrimal fluid were investigated in patients with DED. Since only subjects were included, the study should be performed on a larger cohort of patients to verify the results.

4.8 Optical coherence tomography

Thickening of the corneal epithelium can be a sign of abnormalities in the cornea. Moreover, corneal thickness could potentially be a marker for DED. Kanellopoulos et al. developed a linear regression model to look for possible correlations between corneal thickness metrics measured using anterior segment optical coherence tomography (AS-OCT) and DED kanellopoulos2014invivo. However, neither the model predictions nor performance were reported, making it difficult to assess the usefulness of the study. The type of instrument used to determine the corneal thickness was found to affect the results fujimoto2020comparison. Measurements from AS-OCT and Pentacam were compared and multivariable regression was used to detect differences between the two techniques regarding the measured central corneal thickness and the thinnest corneal thickness. Individuals with mild DED, severe DED and healthy subjects were examined. The two techniques gave significantly different results in terms of the resulting -coefficients in the multivariable regression model for individuals with severe DED. Images from clinical examinations with AS-OCT were used to diagnose DED ELSAWY2021252. A pretrained VGG19 CNN vgg2017Elsawy was fine-tuned using separate images for training and validation. Two similar CNN models were developed, and evaluation was performed on an external test set. Both achieved impressively high performance scores. The AUC values were and . This is one out of two studies in this review that used an independent test sets after model development. Such practice is essential for a realistic impression of how well the model generalizes to new data not used during model development. The good performance is likely linked to the large amounts of training data ( images), which is essential for deep learning methods. Most of the reviewed studies use significantly smaller data sets, which constitutes a disadvantage. Stegmann et al. analysed OCT images from healthy subjects for automatic detection of the lower tear meniscus Stegmann2020TearMeniscus. Two different CNNs were trained and evaluated using -fold cross validation. The tear menisci detected by the models were compared to evaluations from an experienced grader. The best CNN achieved an average accuracy of %, sensitivity of and specificity of . The system is promising regarding fast and accurate segmentation of OCT images. However, more images from different OCT systems, including non-healthy subjects, should be used to verify and improve the analysis.

The two studies vgg2017Elsawy; Stegmann2020TearMeniscus showed that CNNs could be an appropriate tool for image analysis. CNNs are likely to increase in popularity within the field of DED due to promising results for solving image related tasks, including feature extraction.

4.9 Other clinical tests

Machine learning models were used to analyse results from a variety of clinical tests to expand understanding of the DED process mathers2004cluster. The study included subjects with DED and healthy subjects. Subjective cutoff values from clinical tests were used to assign subjects to the DED class. Hierarchical clustering and a decision tree were applied sequentially to group the subjects based on their clinical test results. The resulting groups were compared to the original groups. Because the analysis was based on objective measurements, it could be used to develop more objective diagnostic criteria. This could lead to earlier detection and more effective treatment of DED.

4.10 Population surveys

Population surveys can provide valuable insight regarding the prevalence of DED and help detect risk factors for developing the disease. Japanese visual terminal display workers were surveyed with the objective of detecting DED Kaido2015computer. Dry eye exam data and subjective reports were used for diagnosis. This was passed to a discriminant analysis model. When compared to diagnosis by a dry eye specialist, the model showed a high sensitivity of , but low specificity of . This is a very low specificity, but is not necessarily bad if the aim is to detect as many cases of DED as possible and there is less concern about misclassification of healthy individuals. Data from a national health survey were analysed in order to detect risk factors for DED nam2020explanatory. Here, individuals were regarded as having DED if they had been diagnosed by an ophthalmologist, and were experiencing dryness. Feature modifications were performed by a decision tree, and the most important features were selected using lasso. -coefficients from a logistic regression trained on the most important features were used to rank the features. Women, individuals who had received refractive surgery and those with depression were detected as having the highest risk for developing DED. Even though the models in the study were trained on data from more than participants, the reported performance scores were among the poorest in this review with a sensitivity of and a specificity of . A possible reason could be that the selected features were not ideal for detecting DED. However, the detected risk factors have previously been shown to be associated with DED matossian2019dry; dartt2004dysfunctional; wan2016depression. The findings suggest that the data quality from population surveys might not be as high as in other types of studies, which could lead to misinterpretation by the machine learning model.

The association between DED and dyslipidemia was investigated by combining data from two population surveys in Korea in choi2020association. A generalized linear model was used to investigate linear characteristics between features and the severity of DED. The model showed significant increase in age, blood pressure and prevalence of hypercholesterolemia over the range from no DED to severe DED. Evaluation of the association between dyslipidemia and DED

using linear regression showed that the odds ratio for men with dyslipidemia was higher than

compared to men without dyslipidemia. This association was not found in women. The study results suggest a positive association between DED and dyslipidemia in men, but not in women.

4.11 Future perspectives

In order to benchmark existing and future models, we advocate that the field of DED should have a common, centralized and openly available data set for testing and evaluation. The data should be fully representative for the relevant clinical tests. In order to ensure that models are applicable to all populations of patients, medical institutions, and types of equipment around the world, they must be evaluated on data from different demographic groups of patients across several clinics and, if relevant, from different medical devices. Moreover, the test data set should not be available for model development, but only for final evaluation. A common standard on these processes will increase the reproducibility and comparability of studies. In addition, a cross hospitals/centers data set would solve important challenges of applying AI in clinical practice, such as metrics not reflecting clinical applicability, difficulties in comparing algorithms, and underspecification. These have all been identified as being among the main obstacles for adoption of any medical AI system in clinical practice d2020underspecification; kelly2019key.

A possible challenge regarding implementation in the clinic is that hospitals do not necessarily use the same data platforms, which might prevent widespread use of machine learning systems. Consequently, solutions for implementing digital applications across hospitals should be considered.

Model explanations are important in order to understand why a complex machine learning model produces a certain prediction. For healthcare providers to trust the systems and decide to use them in the clinic, the systems should provide understandable and sound explanations of the decision-making process. Moreover, they could assist clinicians when making medical decisions lundberg2018hypoxemia. When developing new machine learning systems within DED, effort should be made to present the workings of the resulting models and their predictions in an easy to interpret fashion.

5 Conclusions

We observed a large variation in the type of clinical tests and the type of data used in the reviewed studies. This is also true regarding the extent of pre-processing applied to the data before passing it to the machine learning models. The studies analysing images can be divided into those applying deep learning techniques directly on the images, and those performing extensive pre-processing and feature extraction before the data is passed to the machine learning model in a tabular format. The number of studies belonging to the first group has increased significantly over the past years. As deep learning techniques become more established, these will probably replace more traditional image pre-processing and feature extraction techniques.

We noted that there was a lack of consensus regarding how best to perform model development, including evaluation. This made it difficult to estimate how well some models will perform in the clinic and with new patients, and also to compare the different models. Comparison was further complicated by the use of different types of performance scores. In addition there was no culture of data and code sharing, which makes reproducibility of the results impossible. For the future, focus should be put on establishing data and code sharing as a standard procedure.

In conclusion, the results from the different studies’ machine learning models are promising, although much work is still needed on model development, clinical testing and standardisation. AI has a high potential for use in many different applications related to DED, including automatic detection and classification of DED, investigation of the etiology and risk factors for DED, and in the detection of potential biomarkers. Effort should be made to create common guidelines for the model development process, especially regarding model evaluation. Prospective testing is recommended in order to evaluate whether proposed models can improve the diagnostics of DED, and the health and quality of life of patients with DED.

Disclosure

The authors report no conflicts of interest.

References

Appendix A Supporting information

a.1 Performance scores used

If there are two categories available, the task is referred to as binary classification, while more than two categories is referred to as multi-class. For binary classification, the true outcome belongs to one of two categories, e.g., healthy or ill, often referred to as positive (P) or negative (N). A binary classifier assigns new data instances to these two categories, and the prediction can be either true (T), meaning correct, or false (F), meaning incorrect. The outcome can then belong to one of the four categories true positive (TP), true negative (TN), false positive (FP) and false negative (FN), and sum to the total number of instances in the data set. From these, we can calculate a variety of performance scores, some of which are listed in Section 2.5. We provide mathematical expression for these below. The remaining performance scores encountered in the reviewed studies are outlined after.

Positive predictive value (A.1)
Negative predictive value (A.2)
Accuracy (A.3)
Sensitivity (A.4)
Precision (A.5)
Specificity (A.6)
F1 score (A.7)
False positive rate (A.8)
False negative rate (A.9)

Although binary classification tasks involve assigning instances to one of two classes, e.g., and , most machine learning classifiers can output the distance of an instance to the decision boundary, i.e., a decimal number in the interval

. A common interpretation of this number is class probability or classification confidence, meaning that an output close to either number indicates a confident classification, while an output closer to the classification threshold indicates that the classifier is not capable of assigning the instance to a class. The classification threshold is thus the numerical value that separates the two classes, and the confusion matrix entries vary with this threshold. Unless otherwise specified, its value is usually

. Here, we introduce two metrics that can be constructed by varying this threshold from to . First, the receiver operating characteristic curve is constructed from the curves of the true and false positive rates obtained by varying the classification threshold. Optimally, the true positive rate is for any threshold, while a classifier which always guesses randomly produces a diagonal line, as shown in Figure 3(a). The AUC value is calculated by summing the area under the receiver operating characteristic curve, and its maximum value is .

There is a trade-off between the precision and sensitivity: A high precision minimizes the false positives, which might result in missing positive instances, while a high sensitivity minimizes the false negatives, which can result in a increased number of false alarms. Which one should be prioritised depends on the problem at hand, and a study prioritising or reporting only one of these should argue why. The precision and the sensitivity are visualised in Figure 3(b), which highlights the trade-off between the two. They can be combined into a single number, by plotting them against each other for different classification threshold values and calculating the area under the resulting so-called precision-recall curve.

(a)
(b)
Figure 4: (fig:auroc) A receiver operating characteristic curve, and (fig:prec_rec) A visual representation of the sensitivity, eq. (A.4), and the precision, eq. (A.5), highlighting the trade-off between the two.

Pearson’s correlation coefficient measures the linear correlation between two data sets, and is calculated as

(A.10)

where is the Pearson’s correlation coefficient, and are the observed values in each data set and and are the mean values for each data set. The value ranges from to , where indicates perfect negative linear correlation and perfect positive linear correlation, while indicates no linear correlation between the data. For binary classification, Pearson’s correlation coefficient takes on a simple form, referred to as Matthews correlation coefficient mcc. It measures the correlation between the true and predicted class instances, and ranges from to . Here, indicates that the classifier guesses randomly, and and indicate complete agreement and disagreement, respectively, between the model predictions and the true outcome. It can be calculated from the confusion matrix entries as

(A.11)

The concordance correlation coefficient measures the agreement between two data sets by measuring the variation around the degrees concordance line through the origin Lin1989ConcordanceCorr. The value ranges between and . When the two data sets share mean and standard deviation, the concordance correlation coefficient equals the Pearsons’s correlation coefficient. In all other cases, the concordance correlation coefficient will be lower than the Pearson’s correlation coefficient. The value is calculated as

(A.12)

where and are the mean values of the two data sets and , and

are the variances for each data set and

is the covariance between the data sets Lin1989ConcordanceCorr.

Root mean squared error is commonly used for regression problems and represents the difference between the model predictions and the observed values. The value is calculated as

(A.13)

where is the number of instances in the data set and and is the model prediction and observed value for instance , respectively.

The Kappa index measures the agreement between two raters, e.g., the model predictions and labels during classification kappa_index. It is calculated as

(A.14)

where is the observed probability of agreement, which equals the accuracy defined in eq. (A.3), and is the expected probability of agreement due to chance, defined as

(A.15)

where is the total number of instances. The highest possible value is , representing perfect agreement, and values above are typically regarded as excellent kappa_index. An illustration of the index values for the proportion of correct model predictions is provided in Figure 5.

Figure 5: Kappa values for different degrees of agreement. The illustration is based on (kappaIllustration, Figure 2).

Cramér’s V

measures the association between two categorical variables that belong to more than two categories each. When there are two categories for each variable, Cramér’s V equals the

coefficient Akoglu2018CramerGuide. It is calculated via

(A.16)

where is the usual chi-squared statistic, is the number of instances, and and are the number of possible categories for each variable. The value ranges from to , representing no and perfect correlation between the variables, respectively CramerV1945.

In hypothesis testing, the

-value is the probability under a specific model of obtaining test results at least as extreme as those observed, under the assumption that the null hypothesis

is true. is commonly defined as no difference between two data sets, while the alternate hypothesis states that there is a difference. Consequently, a low -value indicates that the result is not likely under the null hypothesis, and thus strengthens the belief in  introStatistics.

The Average Pompeiu-Hausdorff distance reflects the distance between estimated values and true values in a metric space pomp_hausd. Lower values imply small differences between the two metric spaces. The Pompieu-Hausdorff distance between the subsets and is calculated via

(A.17)

The aggregated Jaccard index is an extension of the global Jaccard index also used to measure the similarities between two sample sets jaccardOrig1912. A high value indicates small differences between the sample sets. The calculation of the aggregated Jaccard index is described by Kumar et al. agg_jaccard, and Figure 6 shows a visualisation.

Figure 6: A visual representation of the Jaccard index. FN = False Negative; TN = True Negative; TP = True Positive; FP = False Positive; Jacc = Jaccard index.

For image segmentation, the support for a segmented area can be calculated as the number of pixels in the segmented area divided by the number of background pixels Stegmann2020TearMeniscus.

a.2 Measuring model uncertainty

Uncertainty estimates are useful in order to evaluate how certain a machine learning model is about the predictions. High uncertainty might suggest that a human expert also should have a look at the instance uncertainty_ml_med. Among the reviewed studies, some choose not to use the model predictions of DED when the predicted probabilities are too close to , reflecting that the model is uncertain Kaido2015computer. Others report the standard deviation of the model performance scores yedidya2007automatic; dacruz2020interferometer; dacruz2020ripleysk; koh2012detection; xiao2021Meibo; yeh2021Meibo; Stegmann2020TearMeniscus; yabusaki2019diagnose; Rodriguez2013Redness. Some computes the confidence intervals for the model performance scores ELSAWY2021252; fujimoto2020comparison; nam2020explanatory; Rodriguez2013Redness; Rodriguez2013Redness. A comprehensive discussion about quantifying uncertainty for medical machine learning models can be found in uncertainty_ml_med.