Precision and Recall

What are Precision and Recall?

Precision and recall are two numbers which together are used to evaluate the performance of classification or information retrieval systems. Precision is defined as the fraction of relevant instances among all retrieved instances. Recall, sometimes referred to as ‘sensitivity, is the fraction of retrieved instances among all relevant instances. A perfect classifier has precision and recall both equal to 1.

It is often possible to calibrate the number of results returned by a model and improve precision at the expense of recall, or vice versa.

Precision and recall should always be reported together. Precision and recall are sometimes combined together into the F-score, if a single numerical measurement of a system's performance is required.

Precision and Recall Formulas

Mathematical definition of precision

Mathematical definition of recall

Precision and Recall Formula Symbols Explained

The true positive rate, that is the number of instances which are relevant and which the model correctly identified as relevant.

The false positive rate, that is the number of instances which are not relevant but which the model incorrectly identified as relevant.



The false negative rate, that is the number of instances which are relevant and which the model incorrectly identified as not relevant.

Calculating Precision and Recall

Example Calculation of Precision and Recall #1: Search engine

Imagine that you are searching for information about cats on your favorite search engine. You type 'cat' into the search bar.

The search engine finds four web pages for you. Three pages are about cats, the topic of interest, and one page is about something entirely different, and the search engine gave it to you by mistake. In addition, there are four relevant documents on the internet, which the search engine missed.

In this case we have three true positives, so tp = 3. There is one false positive, fp = 1. And there are four false negatives, so fn = 4. Note that to calculate precision and recall, we do not need to know the total number of true negatives (the irrelevant documents which were not retrieved).

The precision is given by

and the recall is

Example Calculation of Precision and Recall #2: Disease diagnosis

Suppose we have a medical test which is able to identify patients with a certain disease. We test 20 patients and the test identifies 8 of them as having the disease. Of the 8 identified by the test, 5 actually had the disease (true positives), while the other 3 did not (false positives). We later find out that the test missed 4 additional patients who turned out to really have the disease (false negatives).

We can represent the 20 patients in the following confusion matrix:

The relevant values for calculating precision and recall are tp = 5, fp = 3, and fn = 4. Putting these values into the formulae for precision and recall, we obtain:

Precision and Recall vs F-score

Usually, precision and recall scores are given together and are not quoted individually. This is because it is easy to vary the sensitivity of a model to improve precision at the expense of recall, or vice versa.

If a single number is required to describe the performance of a model, the most convenient figure is the F-score, which is the harmonic mean of the precision and recall:

This allows us to combine the precision and recall into a single number.

If we consider either precision or recall to be more important than the other, then we can use the Fβ score, which is a weighted harmonic mean of precision and recall. This is useful, for example, in the case of a medical test, where a false negative may be extremely costly compared to a false positive. The Fβ score formula is more complex:

Calculating Precision and Recall vs. F-score

For the above example of the search engine, we obtained precision of 0.75 and recall of 0.43.

Imagine that we consider precision and recall to be of equal importance for our purposes. In this case, we will use the F-score to summarize precision and recall together.

Putting the figures for the precision and recall into the formula for the F-score, we obtain:


Note that the F-score of 0.55 lies between the recall and precision values (0.43 and 0.75). This illustrates how the F-score can be a convenient way of averaging the precision and recall in order to condense them into a single number.

Precision and Recall vs Sensitivity and Specificity

When we need to express model performance in two numbers, an alternative two-number metric to precision and recall is sensitivity and specificity. This is commonly used for medical devices, such as virus testing kits and pregnancy tests. You can often find the manufacturer's stated sensitivity and specificity for a device or testing kit printed on the side of the box, or in the instruction leaflet.

Sensitivity and specificity are defined as follows. Note that sensitivity is equivalent to recall:

Specificity also uses tn, the number of true negatives. This means that sensitivity and specificity use all four numbers in the confusion matrix, as opposed to precision and recall which only use three.

The number of true negatives corresponds to the number of patients identified by the test as having the disease when they did not have the disease, or alternativelythe number of irrelevant documents which the search engine did not retrieve.

Taking a probabilistic interpretation, we can view specificity as the probability of a negative test given that the patient is well, while the sensitivity is the probability of a positive test given that the patient has the disease.

Sensitivity and specificity are preferred to precision and recall in the medical domain, while precision and recall are the most commonly used metrics for information retrieval. This initially seems strange, since both pairs of metrics are measuring the same thing: the performance of a binary classifier.

The reason for this discrepancy is that when we are measuring the performance of a search engine, we only care about the returned results, so both precision and recall are measured in terms of the true and false positives. However, if we are testing a medical device, it is important to take into account the number of true negatives, since these represent the large number of patients who do not have the disease and were correctly categorized by the device.

Calculating Precision and Recall vs. Sensitivity and Specificity

Let us calculate the sensitivity and specificity for the above case of the disease diagnosis. Recalling the confusion matrix again:

we have tp = 5, fp = 3, and tn = 8.

Sensitivity of course comes out as the same value as recall:

Whereas specificity gives:

Precision and Recall vs ROC curve and AUC

Let us imagine that the manufacturer of a pregnancy test needed to reach a certain level of precision, or of specificity, for FDA approval. The pregnancy test shows one line if it is moderately confident of the pregnancy, and a double line if it is very sure. If the manufacturer decides to only count the double lines as positives, the test will return far fewer positives overall, but the precision will improve, while the recall will go down. This shows why precision and recall should always be reported together.

Adjusting threshold values like this enables us to improve either precision or recall at the expense of the other. For this reason, it is useful to have a clear view of how the false positive rate and true positive rate vary together.

A common visualization of this is the ROC curve, or Receiver Operating Characteristic curve. The ROC curve shows the variation of the error rates for all values of the manually-defined threshold.

For example, if a search engine assigns a score to all candidate documents that it has retrieved, we can set the search engine to display all documents with a score greater than 10, or 11, or 12. The freedom to set this threshold value generates a smooth curve as below.

ROC curve for a binary classifier with AUC = 0.93. The orange line shows the model's false positive and false negative rates, and the dotted blue line is the baseline of a random classifier with zero predictive power, achieving AUC = 0.5.

The area under the ROC curve (AUC) is a good metric for measuring the classifier's performance. This value is normally between 0.5 (for a useless classifier) and 1.0 (a perfect classifier). The better the classifier, the closer the ROC curve will be to the top left corner.

Applications of Precision and Recall

Precision and Recall in Information Retrieval

Precision and recall are best known for their use in evaluating search engines and other information retrieval systems.

Search engines must index large numbers of documents, and display a small number of relevant results to a user on demand. It is important for the user experience to ensure that both all relevant results are identified, and that as few as possible irrelevant documents are displayed to the user. For this reason, precision and recall are the natural choice for quantifying the performance of a search engine, with some small modifications.

Over 90% of users do not look past the first page of results. This means that the results on the second and third pages are not very relevant for evaluating a search engine in practice. For this reason, rather than calculating the standard precision and recall, we often calculate the precision for the first 10 results and call this precision @ 10. This allows us to have a measure of the precision that is more relevant to the user experience, for a user who is unlikely to look past the first page. Generalizing this, the precision for the first k results is called the precision @ k.

In fact, search engine overall performance is often expressed as mean average precision, which is the average of precision @ k, for a number of k values, and for a large set of search queries. This allows an evaluation of the search precision taking into account a variety of different user queries, and the possibility of users remaining on the first results page, vs scrolling through to the subsequent results pages.

Precision and Recall History

Precision and recall were first defined by the American scientist Allen Kent and his colleagues in their 1955 paper Machine literature searching VIII. Operational criteria for designing information retrieval systems.

Kent served in the US Army Air Corps in World War II, and was assigned after the war by the US military to a classified project at MIT in mechanized document encoding and search.

In 1955, Kent and his colleagues Madeline Berry, Fred Luehrs, and J.W. Perry were working on a project in information retrieval using punch cards and reel-to-reel tapes. The team found a need to be able to quantify the performance of an information retrieval system objectively, allowing improvements in a system to be measured consistently, and so they published their definition of precision and recall.

They described their ideas as a theory underlying the field of information retrieval, just as the second law of thermodynamics "underlies the design of a steam engine, regardless of its type or power rating".

Since then, the definitions of precision and recall have remained fundamentally the same, although for search engines the definitions have been modified to take into account certain nuances of human behavior, giving rise to the modified metrics precision @ k and mean average precision, which are the values normally quoted in information retrieval contexts today.

In 1979 the Dutch computer science professor Cornelis Joost van Rijsbergen recognized the problems of defining search engine performance in terms of two numbers and decided on a convenient scalar function that combines the two. He called this metric the Effectiveness function and assigned it the letter E. This was later modified to the F-score, or Fβ score, which is still used today to summarize precision and recall.


References

Jurafsky and Martin, Speech and Language Processing (2019)

Goutte and Gaussier, A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation (2005)

Van Rijsbergen, Information Retrieval (2nd ed.). Butterworth-Heinemann (1979)

Kent et al, Machine literature searching VIII. Operational criteria for designing information retrieval systems (1955)