The Stanford Acuity Test: A Probabilistic Approach for Precise Visual Acuity Testing

06/05/2019 ∙ by Chris Piech, et al. ∙ Stanford University 0

Chart-based visual acuity measurements are used by billions of people to diagnose and guide treatment of vision impairment. However, the ubiquitous eye exam has no mechanism for reasoning about uncertainty and as such, suffers from a well-documented reproducibility problem. In this paper we uncover a new parametric probabilistic model of visual acuity response based on measurements of patients with eye disease. We present a state of the art eye exam which (1) reduces acuity exam error by 75% without increasing exam length, (2) knows how confident it should be, (3) can trace predictions over time and incorporate prior beliefs and (4) provides insight for educational Item Response Theory. For patients with more serious eye disease, the novel ability to finely measure acuity from home could be a crucial part in early diagnosis. We provide a web implementation of our algorithm for anyone in the world to use.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reliably measuring a person’s visual ability is an essential component in the detection and treatment of eye diseases around the world. However, quantifying how well an individual can distinguish visual information is a surprisingly difficult task—without invasive techniques or expensive equipment to finely observe the eye, physicians rely on chart-based eye exams where patients are asked visual questions and their responses observed.

Historically, vision has been evaluated by measuring a patient’s visual acuity

: a measure of the letter size at which a patient can correctly identify the letter being shown to them with probability

(where varies slightly for different exams). To determine this statistic, traditional eye exams like the Snellen test march down a set of discrete letter sizes, asking the patient a fixed number of questions per row111Up to letters in a row, although this varies.

to estimate their probability of correctly identifying the letters—the letter size below which the patient’s accuracy drops below

is deemed their acuity. This approach is simple and has been used ubiquitously in the treatment of patients; yet, it suffers from some notable shortcomings. Acuity exams such as these exhibit high variance in their results due to: (1) the large role that chance plays in the final diagnosis (2) the approximation error incurred by the need to discretise letter sizes on a chart, and (3) the absence of a notion of uncertainty/confidence in the final acuity result.

The contributions of this paper are as follows. Firstly, we present a novel parametric form to model the human Visual Response Function (VRF): a function relating the size of a letter to the probability of a person identifying it correctly that both better fits the data, and also tells a more compelling generative story. After demonstrating the efficacy of this parametric form in modeling real patient data, we present an adaptive Bayesian approach to measuring a person’s acuity. This involves using likelihood weighted particles to determine a posterior distribution over an individual’s acuity, coupled with an “optimistic sampling” technique for determining the next letter size to query. This approach leads to a state-of-the-art eye exam which (1) reduces acuity exam error by 75% compared to the traditional Snellen exam without increasing exam length, (2) provides robust, calibrated notions of confidence and uncertainty, and (3) can trace predictions over time and incorporate a patient’s prior belief of their acuity. We also draw connections between our ideas here and work done in education based on Item Response Theory.

For patients with more serious eye disease, the novel ability to finely measure acuity from home could play a crucial role in early diagnosis and effective treatment. We provide a web implementation for anyone in the world to use: https://myeyes.ai

2 Background

2.1 Visual Acuity

Visual acuity is a measurement that captures a patient’s visual ability in a succinct manner. This is defined to be the letter size at which the patient can correctly identify which optotype (letter) is shown with probability (where varies slightly for different exams).

2.2 Chart-based Acuity Exams

In 1862 Herman Snellen developed the ubiquitous eye exam: a chart is placed at 6 meters (20ft) from the patient. The patient attempts to identify optotypes (specifically chosen letters) of progressively smaller sizes written on different "lines," i.e. 20/20, 20/25, 20/30, etc. The goal is to find the optotype size at which the user can no longer identify at least half of the letters on the line. To keep the exam a reasonable duration, there is a small, discrete set of lines that are substantially different in size. The Snellen chart continues to be the most common acuity test, but there are other charts (LogMar ETDRS Chart [6, 7], Tumbling-E, Lea, HOTV) that generally use the same procedure with a different set of optotypes [10]. They all share the same core limitations:

Guessing.

Making guesses is a critical part of an acuity exam. As the patient progresses to smaller optotypes, there is a gradual decrease in how easy it is to identify the letters. As optotype size decreases, the probability of correctly guessing decreases. This has the predictable problem that chance plays a large role in the final acuity score. As a concrete example, imagine an optotype size where the patient has a 0.5 probability of correctly identifying a letter. Using the binomial theorem we can calculate that after guessing for five letters, there is a 50% chance that they “pass" the current line (3 out of 5 correct) and a 50% chance that they do not.

Discretization.

Because of the limits of printing, it is necessary to discretize acuity scores in printed eye-chart exams. This makes it hard to have an acuity measure more precise than the predetermined discretisation. Discretization is particularly limiting for patients who need to detect a small decreases in vision, as such a change could be indicative of a time sensitive need for an intervention.

Confidence.

Another limitation of all current tests is their inability to articulate their confidence in the final measured acuity. Contemporary eye exams result in a “hard" number for their acuity prediction as opposed to a “soft" probability distribution. As an example, a soft prediction can make claims such as, “there is a 75% chance that the patient’s true vision is within one line of our predicted acuity score." Current tests can only say how many letters were missed on the last line, but don’t provide probabilistic uncertainty.

Figure 1: a) ETDRS, b) Snellen and c) StAT eye exams.

2.3 Digital Acuity Challenge

Computers enable digital, adaptive eye exams. A digital acuity eye exam proceeds as follows: the computer chooses an optotype font size, then the user guesses it (either correct or incorrect). The computer then gets to incorporate that response and chose a new font size to present. The test continues until either a fixed number of letters has been shown, or, the model has determined an acuity score with sufficient accuracy. A digital exam has two potential advantages, (1) a computer can draw optotypes of any continuous size and (2) a computer can adaptively choose the next letter size to show.

The digital acuity challenge is to develop a policy for a digital eye exam that can hone in on a patient’s true acuity statistic as fast, and as accurately as possible.

In this paper, we present a new digital visual acuity algorithm with the following novel additions:

  1. [noitemsep,topsep=0pt,parsep=5pt,partopsep=0pt]

  2. Uses a new parametric form of the human Visual Response Function (VRF).

  3. Uses a posterior sampling algorithm to tradeoff between exploration and greedy search.

  4. Returns a soft inference prediction of the patient’s acuity, enabling us to represent our confidence in the result.

  5. Accepts a patient’s prior belief of their acuity, or alternatively, trace their vision over time.

  6. Incorporates "slip" estimation for unintended mistakes in the eye test process.

Each of these additions lead to a more precise, accurate and descriptive acuity exam. We measure the improvement of each individual component as well as the benefit achieved by combining all ideas. The different acuity exams discussed so far can be seen in Figure 1.

2.4 Prior Work

The current state of the art digital optotype size discrimination exam, the Freiburg Acuity Contrast Test (FrACT), was first developed in 1996 and has been successfully used in medical contexts since [3]. FrACT is growing in popularity and has been used relatively unchanged since its conception [2].

FrACT builds an underlying model of human visual acuity which assumes that the probability a human correctly identifies a letter is a function of the letter size and two parameters that change from person to person ( and ):

(1)

Here, is the probability that the human randomly guesses a letter correctly. When choosing a next item size, the item with the highest predicted probability slope is selected222The FrACT paper calls this the bestPEST choice, and, in this context, it reduces to maximum likelihood estimation. Note that the original FrACT paper uses “decimal" units (1/visual angle) and equation (1) is the FrACT assumption written for visual angle units. Digital exams, like FrACT, work especially well for patients with low vision [11].

The FrACT test can be shown to reduce to Birnhaum’s 3PL model which is the basis for Item Response Theory (IRT) literature [4]

. From an IRT perspective, each letter shown to a patient is an “item." The goal of the test is to uncover the latent “ability" of a patient to see optotypes, whose “size" is a function of their difficulty. The 3PL model, which is the most popular of all the IRT models, makes the same logistic assumption for the relationship between difficulty and probability that is made by the FrACT algorithm. There have been several advances in IRT that may also be relevant such as non parametric models

[8]. The improvements that we developed for the eye exam outlined in this paper are likely to be relevant for the many applications of IRT beyond Ophthalmology.

2.5 Units of Acuity.

Visual acuity, and optotype size, measure visual angle subtended at the eye by an optotype, in minutes of visual arc. Because the semantic meaning of vision loss is better expressed in a logrithmic space, logMAR, the of the minimum visual angle of resolution, is a popular choice of units. In the Snellen chart, the visual angle is articulated via a fraction in meters (e.g. 6/6) or in feet (e.g. 20/20). The ETDRS chart represents acuity in logMAR units. In this paper we use visual angle (arcmins) as our unit space and its log, logMAR.

3 Human Visual Acuity Curve

A central assumption of a visual exam is the function which relates the size of a letter, , to the probability that the person being tested correctly identifies the letter . This psychometric function is called the Visual Response Function (VRF). [3].

Figure 2:

An example of a single person’s visual response function. The logistic FrACT model is inaccurate for low probabilities. Error bars are Beta distribution standard deviation after > 500 measurements.

For a single human, with enough patience, one can precisely observe their VRF. This is very different from a typical psychometric exam where it would be unreasonable to ask a patient the same question hundreds of times. Previous studies have measured VRF curves and concluded that they are best fit by a logistic function, an assumption that was adopted by FrACT [14].

For our research, we conducted an IRB-approved experiment at the Stanford University Eye Institute and carefully measured the visual response function for patients with different vision-limiting eye diseases. We tested three charts (1) the semi-electronic digital Snellen chart used for the typical patient, (2) an ETDRS chart used for research experiments, and (3) a digital tumbling E chart.

Patients were shown randomly selected optotypes of a fixed size until we were confident in their probability response. We represented our uncertainty about their correct-probability for a fixed optotype size as a Beta distribution (the conjugate prior of the Bernoulli) and continued testing until our uncertainty about the probability response was below a fixed threshold. Surprisingly, we found that the traditional assumption for the VRF—a logistic curve—struggled to fit responses to letters that were small enough where the probability of the patient getting the right answer was equal to random chance.

Figure 2 shows an example of a single patient who volunteered to answer over 500 answers to optotypes questions of varying sizes (we took breaks between questions and randomized the order of letter size to remove confounds such as tear film dryness that can lead to vision decrease over the course of an exam). Based on these results, we developed a theory that posits the VRF as a mixture of two processes. For tiny letters ( arcmins), the patient was unable to see and just guessed, with . For letters where the patient could discern information ( arcmins) their probability seemed to be determined by an exponential function, parameterised by location and scale i.e. . The resulting equation can be reparameterised with values that eye care providers find meaningful:

Floored Exponential A floored exponential is a maximum between a constant floor and an exponential function. For visual acuity we parameterise it as:
(2)
Where is the font size of the letter being tested and is the probability the patient correctly identifies the letter. is the probability of a correct answer when guessing randomly, is the font size at which a patient can start to discern information. In an acuity test, we are trying to identify . is the font size at which a patient can see with probability . is a constant “target probability" for such that . In this paper we use which means at font size , a patient can correctly guess letters with 80% probability.

We observed that this theory held for a range of 12 different patients with different forms of eye disease on three different test types (see Figure 5).

4 The Stanford Acuity Test (StAT)

The StAT Test is a novel eye exam based on an improved model of acuity and an intelligent inference process, named after the city it was invented in. StAT uses the Floored Exponential as its VRF and likelihood weighted sampling to determine the posterior of

given the patient’s responses so far. The next letter size to query is then selected by sampling from this posterior. Such an approach balances exploration-exploitation in an optimistic way, in a manner similar to Thompson sampling. We also include a probability term for the chance that a user “slips” and chooses the wrong answer.

Algorithm.

We formalize the algorithm as follows. At all times, a StAT digital eye exam keeps track of its belief for the visual acuity of the test taker based on the sequence of answers seen so far . Each observation is a tuple of the size of letter shown to the patient and whether the letter was correctly guessed .

This past data is used both to determine which letter size to query next and also to diagnose the final acuity of the patient at the end of the exam.

The StAT algorithm is formally defined in Algorithm 1.

Inputs:

  • A patient with an unknown VRF, , in the Floored Exponential family.

  • A length of maximum questions to ask the patient.

Algorithm:

  1. Inititalise belief of with prior.

  2. For :

    1. [label=)]

    2. Sample from current belief of .

    3. Query patient at letter size and record whether response was correct as . Store .

    4. Update posterior belief of to get .

  3. Return

Algorithm 1 The Stanford Acuity Test (StAT)
Computing posterior.

The continuous distribution for the joint assignment of our two latent variables and given a set of observations can be calculated by applying Bayes rule:

(3)
(4)

Where

Recall that is given by equation (2).

Likelihood Weighting.

Exact inference of the marginalized posterior of given is:

To the best of our knowledge this equation does not have an analytical solution. However using likelihood weighting [13], we can sample particles from the joint posterior given by Equation (4). We first sample from it’s prior and then sample from , weighting the particles by . We sample a total of 5,000 particles which densely covers the two parameters. After drawing particles from the posterior, the values of those particles represent the distribution and as such these particles approximate a soft belief about acuity over the continuous range of possible acuity scores.

We don’t discard any particles for a patient between patient queries. After we receive a new datapoint , we simply re-weight each particle by multiplying their previous weight by , using the particle’s values for and . This makes the computation time of the update step grow linearly with the number of particles and constant with respect to the length of the exam.

Figure 3 shows an example of the posterior distribution for (the statistic for visual acuity) changing over the course of one acuity exam. Initially there is an uncertain belief about the patient’s acuity. As the exam progresses, the posterior converges to the true acuity.

Figure 3: Our model maintains a soft belief about the posterior at each timestep in the test.
Prior over .

This Bayesian approach requires us to provide a prior probability for

. Thanks to Bach et al. [2], we obtained over a thousand acuity scores of patients. Based on this data, we observed that the log of the acuity score was well fit by a Gumbel distribution. The best-fit prior for the data was . In acknowledgement of the fact that we can’t be sure that users of our test will come from the same distribution collected by FrACT, we set our Gumbel prior to be less confident .

Although we fit a generic prior, if a patient (or doctor) has a belief about the patient’s acuity score, they can express that belief via a different Gumbel prior where is the best guess acuity (in LogMAR units) and is a reflection of confidence in the prior. If a user has a reasonable idea of their vision, our acuity algorithm will be quicker and more accurate.

Slip Probability.

Even if a user can see a letter, they sometimes get the wrong answer because either they “slip” and accidentally provide the wrong response, or their answer is incorrectly entered. Explicitly modelling this source of uncertainty is as important for a digital eye exam, as it is in traditional test [5].

To account for this slip probability, We replace with where is the slip probability:

We included this extension after observing that slip mistakes would lead to inaccurate predictions unless explicitly modelled (see noSlip in Table 1).

Choosing query letter.

An essential step in the intelligence of this algorithm is to decide which next letter size to query the patient. One simple approach would be to query at the most likely MAP estimate of according to the current belief. Although sensible, this method suffers from being overly greedy in its search for the true acuity of the patient—an issue we notice in the performance of this algorithm (see greedyMAP in Table 1).

The problem with greedily using the current MAP estimate of a distribution comes up often in a different setting in Artificial Intelligence: that of multi-armed bandits problem. Specifically, the Thompson sampling algorithm models the posterior reward distribution of each arm and samples from this distribution rather than picking the most likely value in an effort to balance exploration and exploitation.

We use a similar idea in our algorithm—to determine the next letter size to query the patient, the algorithm samples from its current posterior belief over . This means the algorithms is likely to pick exploratory letter sizes at the start, when it is less confident (high variance), and become increasingly greedy as its confidence increases.

In contrast to this, the FrACT test uses a purely greedy variance minimization strategy for choosing the next letter size. In particular, it selects the optotype size that maximizes the likelihood of observation (and thus minimizes the variance in acuity belief). This is a reasonable strategy, but, because the test is steps long, it suffers from the aforementioned problem. We found that it tends to fail at initial exploration of the space.

5 Experiments

Figure 4: (a) The tradeoff between length of exam and error for the different algorithms. (b) A visualization of the predictions made by StAT. (c) Calibration test: StAT confidences correspond to how often it is correct.

5.1 Setup

To evaluate the performance of our algorithm with respect to other policies, we simulate patients by sampling parameters for the floored exponential VRF, in a manner similar to [12]. Specifically, for all experiments we sample 1000 random patients and use them to simulate the performance of each policy. Since we know the true VRF (and thus the true acuity), we can simulate the exam process and also measure the accuracy of each policy. Acuity scores, , are sampled from a high variance Gumbel, with a mode of 2 arcmins. We add a small slip probability to responses.

Measuring error.

After a virtual acuity test has been run, we have two numbers: the true acuity of the virtual patient, and the acuity that the algorithm diagnosed. From these two numbers we calculate error. In this paper we use relative error which calculates the percentage deviation of the prediction from the true acuity:

For example, imagine a patient whose true acuity score is 2.0 arc mins (20/40). If our algorithm predicted the patient had acuity 2.5 arc mins (20/50), our prediction would have a relative error of 0.25. If instead our algorithm predicted an acuity of 2.1 arc mins (20/42), our prediction would have a relative error of 0.05.

We use relative error in place of absolute error because of the logrithmic nature of visual acuity. It is generally meaningful to say that a prediction is off by 10%. In contrast, a prediction which has an absolute error of 1.0 arc mins could be a terrible prediction for a patient with perfect vision (prediction: 20/40, truth: 20/20) but a great prediction for a patient with low vision (prediction: 20/110, truth: 20/100).

5.2 Baseline Acuity Tests

We use the following baselines and prior algorithms to compare against the StAT algorithm.

Const Policy.

This policy always predicts the most common visual acuity in our data. This serves as a true null model because it doesn’t take patient responses into account at all.

Snellen and ETDRS.

The Revised 2000 Series ETDRS charts and the Traditional Snellen Eye Chart were programmed so that we could simulate their response to different virtual patients. Both exams continue until the user incorrectly answers questions for more than half of the letters on a line. ETDRS has a function for predicted acuity score that takes into account both the last line passed, and how many lines were read on the last line not-passed. Both tests use 19 unique optotypes.

FrACT.

We use an implementation of the FrACT algorithm [3], with the help of code graciously shared by the original author. We also included the ability to learn the “” parameter as suggested by the 2006 paper [2], and verified that it improved performance.

6 Results and Evaluation

Acuity Error Test length
Const 0.536 0
Snellen 0.264 27
ETDRS 0.254 42
FrACT 0.212 20
StAT 0.069 20
StAT-noSlip 0.150 20
StAT-greedyMAP 0.132 20
StAT-logistic 0.125 20
StAT-noPrior 0.090 20
StAT-goodPrior 0.047 20
StAT-star 0.038 63
Table 1: Average relative error for each algorithm. Except for Snellen each test was allowed 20 letters. Results are average relative error after 1000 tests. Snellen and ETDRS used 19 unique optotypes.

The results of the experiments can be seen in Table 1.

Accuracy and error.

As can be seen from Table 1, the StAT test has substantially less error than all the other baselines. After 20 optotype queries, our algorithm is capable of predicting acuity with an average relative error of 0.069. This prediction is a 75% reduction in error from our implementation of the ubiquitous Snellen test (average error = 0.276), as well as a 67% reduction in error from the FrACT test (average error = 0.212). The improved accuracy of the StAT suggests our Bayesian approach to measuring acuity is a fruitful proposal. Figure 4 (b) visualizes what StAT’s small relative error means in terms of predictions. Each point in the plot is a single patient. The x-axis is the true acuity of the patient and the y-axis is the predicted accuracy. We can qualitatively observe that the predictions are often accurate, there are no truly erroneous predictions, and that the exam is similarly accurate for patients of all visual acuities.

Moreover, as seen in Figure 4 (a), StAT’s significant improvement in error rate holds even when the length of the exam is increased. It is also evident that increasing exam length reduces our error rate: if we increase the exam length to 200 letters, the average error of StAT falls to 0.020. While this is highly accurate, its far too long an exam, even for patients who need to know their acuity to high precision.

StAT Star Exam.

Our primary experiments had a fixed exam length of 20 letters. However, since our algorithm models the entire belief distribution over , we can run an alternative test that keeps asking the patient queries until it has a 95% confidence that the relative error is less than . We call this the StAT-star test, and it should be the preferred test for patients who want to have a high confidence in their score.

After running StAT-star 1000 times, 95.1% of results had error less than 0.10, suggesting that the algorithm’s confidence is well calibrated. The exam is longer with an average length of 63 optotypes, but had the lowest average error of all tests: 0.038.

Improved prior.

We experimentally verified that if a user already had a reasonable understanding of their vision, they could express this as a prior and get more accurate exam results. For example, we saw if a patient was able to guess their vision to within 1 line on a Snellen chart, then the average error of the standard 20 question StAT test would drop to 0.051.

More optotype choices.

StAT was evaluated using four unique optotype choices (the tumbling-e optotype set). Our algorithm improved slightly as the number of optotype options increased. If we instead use 19-letter optotype options (and thus a guess probability of ), error drops to an average error of 0.052.

Robustness to slip.

Our results proved to be quite invariant to an increase in slip probability, as long as the slip probability was bellow . For larger likelihood of slip, our performance started to degrade.

Importance Analysis.

Since our model contributed several extensions to the state of the art, we performed an importance analysis to understand the impact of each individual decision: (1) model slip or noSlip (2) posterior sample or greedyMAP (3) floored exponential VRF or logistic (4) gumbel prior or noPrior. For each decision we ran error analysis with that decision “turned-off".

All four decisions had a large increase in error when they were turned-off, suggesting that they were all useful in making a low error test.

When we turned-off the decision to explicitly model “slip", the low-probability event where a patient accidentally gives a wrong answer when they are able to see a letter, we had the largest increase in error. This suggests that out of our four tested decisions it was the most important. The least important decision was to use a Gumbel prior for our visual acuity parameter .

Calibrated uncertainty.

One of the novel abilities of the StAT algorithm, is that it can express its confidence in terms of probabilities.

To evaluate the reliability of the confidences computed by the StAT test, we plot a calibration curve for the algorithm (see Figure 4 (c)). We ran 10,000 experiments of the StAT algorithm: for each run, we recorded both the final predicted value of as well as the probability—according to the algorithm—that was within a relative error of of the true acuity . We then binned these probabilities and, for all the entries in a bin, compute the empirical fraction of times the algorithm was correct (“empirical success rate"). We compare the predicted confidence to the empirical success rate.

For a perfectly calibrated model, this plot should look like a straight line . As we can see in Figure 4 (c), the model’s confidence is well-calibrated and thus reliable as a measure of uncertainty. The figure also shows that after 20 questions, the algorithm often predicts an 80% probability that relative error is within 0.1.

Test/retest.

As a side remark, in related literature test/retest is used as a common evaluation metric for visual acuity algorithms. We argue this is a poor measure because it rewards algorithms that make the wrong prediction (as long as that wrong acuity score is repeated). In this case two wrongs shouldn’t make a right. The retest rate of the Snellen chart is 68%, the retest rate of the StAT test is 80%. To give an idea of how poor a measurement retest rate is, the “constant" algorithm, which predicts every individual to have an acuity of

has a retest rate of 100%.

7 Discussion and Future Work

The algorithm we have presented in the paper demonstrates a promising approach to measuring the visual acuity of patients, allowing for huge improvements in precision while also providing robust notions of uncertainty. In this section, we discuss the implications of this idea, highlighting important limitations and potential future work.

7.1 Real World Considerations

Although the work here has potential for huge impact in the way vision related illnesses are diagnosed and treated, caution must be taken before actually using this algorithm in a real-word setting.

Floored Exponential Assumption.

One of the biggest assumptions in our paper is that human VRFs match the floored exponential function. Although we tested this assumption on a number of actual patients with real eye diseases and found promising fits, more clinical trials at a larger scale would be needed to be confident in this facet of the algorithm and to understand if there are certain eye diseases for which it is not the correct parametric form. This same limitation exists in other eye exams as well, for example the “logistic" assumption built into the FrACT exam, which is used in clinical settings. See figure 5 for our initial results in this deeper exploration.

Peripheral vision.

A possible concern for medical practitioners in using a test like StAT involves the role peripheral vision plays in traditional eye exams. According to the literature, checking acuity with single optotypes instead of lines over-estimates true acuity due to an effect known as the crowding phenomenon [9]. If this consideration has medical significance, the scheme discussed in this paper can be used to present multiple letters at a time.

Convention.

Lastly, there is a huge consideration of how much the inaccuracies of traditional eye exams have permeated the medical sphere of Ophthalmology. Our results show that current measures of visual acuity might be susceptible to inaccuracies, yet people are still treated for vision-related illnesses successfully. This could suggest the possibility that the field has naturally adapted to the inaccuracies of traditional exams when designing appropriate diagnoses and prescriptions. Switching to a more accurate system like StAT could require a recalibration of the medical literature that was built on traditional acuity exams.

7.2 Beyond Eye Exams

The core idea behind the VRF extends beyond just visual acuity. In educational Item Response Theory, the probability of a student answering a multiple choice question correctly is also modelled in a similar manner as a sigmoid, with the input representing the easiness of the question and the output representing the probability of a correct answer from the student.

The effectiveness of the floored exponential function as a model for the acuity function suggests that it may be useful, even for education. Intuitively, the generative story makes sense: when the question is absurdly difficult, the best a student can do is guess. Otherwise, they possess useful information about the question which combines in an exponential manner. Exploring this model in the understanding student responses to questions is an interesting future direction.333As a small aside, we have used this problem as a context for teaching probability for computer scientists and it serves an excellent pedagogical example of using probabilistic reasoning in the real world.

7.3 Future Work

We hope the ideas here provide a foundations for even further research into improving our ability to diagnose and treat eye related diseases. We outline some seemingly fruitful directions of future research.

Clinical trials.

An essential next step in demonstrating the usefulness of these ideas is to actually try them on real patients with a series of controlled trials. These experiments would provide insight into the failure modes of our approach as well as other unforeseen factors such as the cognitive load of taking a StAT acuity test. Such research, in conjunction with input from the medical community, could truly transform the ideas in this paper into an actionable reality.

Smarter letter querying.

The StAT algorithm is adaptive in nature, meaning it decides what letter size to present to the patient based on its current belief of . Our current approach picks this letter by sampling from the current distribution of their acuity, in a manner similar to Thompson sampling. However, there is potential for investigating a more intelligent way to pick the next letter size based on current belief. One direction we want to explore is proving some optimality bounds on our approach. Another orthogonal investigation would involve learning a policy for picking the next letter size that optimises an objective like minimising test length or maximising confidence.

8 Conclusion

Vision-limiting eye diseases are prevalent, affecting billions of people across the world [1]. For patients with serious eye diseases, the ability to finely measure acuity could be a crucial part in early diagnosis and treatment of vision impairment. In this paper, we present a novel algorithm based on Bayesian principles for measuring the acuity of a patient. This algorithm outperforms all prior approaches for this task while also providing reliable, calibrated notions of uncertainty for its final acuity prediction. Our approach is incredibly accurate, easy to implement, and can even be used at home on a computer. With further research and input from the medical community, we hope for this work to be used as a foundation for revolutionising the way we approach visual acuity testing for people around the world.

Acknowledgements

On a personal note, we embarked on this research because several of the authors have experienced eye disease. This work is meant as a “thank you" to the many doctors who have helped us keep our vision. We also thank Jason Ford and Ben Domingue for their guidance.

Figure 5: As part of our ongoing research we are verifying that the Floored Exponential fits different patients with different eye diseases. Here are the curves from eight patients from our study.

References

  • [1] A. Stevens, G., White, R., R. Flaxman, S., Price, H., B. Jonas, J., Keeffe, J., Leasher, J., Naidoo, K., Pesudovs, K., Resnikoff, S., Taylor, H., and Bourne, R. Global prevalence of vision impairment and blindness : Magnitude and temporal trends, 1990–2010. Ophthalmology 120 (12 2013), 2377–2384.
  • [2] Bach, M. The freiburg visual acuity test-variability unchanged by post-hoc re-analysis. Graefe’s Archive for Clinical and Experimental Ophthalmology 245, 7 (2006), 965–971.
  • [3] Bach, M., et al. The freiburg visual acuity test-automatic measurement of visual acuity. Optometry and vision science 73, 1 (1996), 49–53.
  • [4] Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. Statistical theories of mental test scores (1968).
  • [5] Cao, J., and Stokes, S. L. Bayesian irt guessing models for partial guessing behaviors. Psychometrika 73, 2 (2008), 209.
  • [6] Council, N. R., et al. Recommended standard procedures for the clinical measurement and specification of visual acuity. S. Karger, 1980.
  • [7] Ferris III, F. L., Kassoff, A., Bresnick, G. H., and Bailey, I. New visual acuity charts for clinical research. American journal of ophthalmology 94, 1 (1982), 91–96.
  • [8] Junker, B. W., and Sijtsma, K. Nonparametric item response theory in action: An overview of the special issue. Applied Psychological Measurement 25, 3 (2001), 211–220.
  • [9] Lalor, S. J., Formankiewicz, M. A., and Waugh, S. J. Crowding and visual acuity measured in adults using paediatric test letters, pictures and symbols. Vision Research 121 (2016), 31 – 38.
  • [10] Rosser, D., Laidlaw, D., and Murdoch, I. The development of a “reduced logmar” visual acuity chart for use in routine clinical practice. British Journal of Ophthalmology 85, 4 (2001), 432–436.
  • [11] Schulze-Bonsel, K., Feltgen, N., Burau, H., Hansen, L., and Bach, M. Visual acuities “hand motion" and “counting fingers" can be quantified with the freiburg visual acuity test. Investigative Ophthalmology & Visual Science 47, 3 (2006), 1236–1240.
  • [12] Shamir, R. R., Friedman, Y., Joskowicz, L., Mimouni, M., and Blumenthal, E. Z. Comparison of snellen and early treatment diabetic retinopathy study charts using a computer simulation. International journal of ophthalmology 9, 1 (2016), 119.
  • [13] Shwe, M., and Cooper, G. An empirical analysis of likelihood-weighting simulation on a large, multiply connected medical belief network. Computers and Biomedical Research 24, 5 (1991), 453–475.
  • [14] Westheimer, G. Scaling of visual acuity measurements. Archives of Ophthalmology 97, 2 (1979), 327–330.