F-Score

What is the F-score?

The F-score, also called the F1-score, is a measure of a model’s accuracy on a dataset. It is used to evaluate binary classification systems, which classify examples into ‘positive’ or ‘negative’.

The F-score is a way of combining the precision and recall of the model, and it is defined as the harmonic mean of the model’s precision and recall.

The F-score is commonly used for evaluating information retrieval systems such as search engines, and also for many kinds of machine learning models, in particular in natural language processing.

It is possible to adjust the F-score to give more importance to precision over recall, or vice-versa. Common adjusted F-scores are the F0.5-score and the F2-score, as well as the standard F1-score.

F-score Formula

The formula for the standard F1-score is the harmonic mean of the precision and recall. A perfect model has an F-score of 1.

Mathematical definition of the F-score

F-score Formula Symbols Explained

precision	Precision is the fraction of true positive examples among the examples that the model classified as positive. In other words, the number of true positives divided by the number of false positives plus true positives.
recall	Recall, also known as sensitivity, is the fraction of examples classified as positive, among the total number of positive examples. In other words, the number of true positives divided by the number of true positives plus false negatives.
	The number of true positives classified by the model.
	The number of false negatives classified by the model.
	The number of false positives classified by the model.

Generalized F_β-score Formula

The adjusted F-score allows us to weight precision or recall more highly if it is more important for our use case. Its formula is slightly different:

Mathematical definition of the F_β-score

F_β-score Formula Symbols Explained

A factor indicating how much more important recall is than precision. For example, if we consider recall to be twice as important as precision, we can set β to 2. The standard F-score is equivalent to setting β to one.

Calculating F-score

Example Calculation of F-score #1: Basic F-score

Let us imagine we have a tree with ten apples on it. Seven are ripe and three are still unripe, but we do not know which one is which. We have an AI which is trained to recognize which apples are ripe for picking, and pick all the ripe apples and no unripe apples. We would like to calculate the F-score, and we consider both precision and recall to be equally important, so we will set β to 1 and use the F1-score.

The AI picks five ripe apples but also picks one unripe apple.

We can represent the true and false positives and negatives in a confusion matrix as follows:

The model’s precision is the number of ripe apples that were correctly picked, divided by all apples that the model picked.

The recall is the number of ripe apples that were correctly picked, divided by the total number of ripe apples.

We can now calculate the F-score

We recall that the F-score is the geometric mean of precision and recall. Like the arithmetic mean, as a geometric mean the F-score is between the precision and recall.

Example Calculation of F-score #2: F2-score

Let us imagine that we now consider recall to be twice as important as precision in our model. We consider a convolutional neural network in the medical domain, which evaluates mammograms and detects tumors. We consider it much worse to miss a tumor than to give a false alarm to a nonexistent tumor.

Using the same figures from the last example, let us imagine that we run the model on ten mammograms. The model detects a tumor in six of the mammograms and gives the all-clear to four mammograms.

Later we discover that of the six detected tumors, one was a false alarm, in other words, a false positive. Of the four clear mammograms, two really did contain a tumor and were false negatives.

The numbers tp, fp, tn and fn are the same as in the last example, and therefore so are the precision and recall. Since we are weighting recall as twice as important as precision, we must use the formula for the F2-score. Setting β = 2, we obtain:

Since we have weighted recall more highly, and the model has good precision but poor recall, our F-score has gone down from 0.77 to 0.74 compared to the example of the apple picker, where precision and recall were weighted equally. This shows how the F2-score can be used when the cost of a false positive is not the same as the cost of a false negative. This is a common scenario when using AI for healthcare.

Example Calculation of F-score #3: F2-score

Let us imagine we have adjusted the mammogram classifier. We test it again on another set of ten mammograms.

We find that there are now two false positives and only one false negative, while the number of true positives and true negatives remained the same.

Putting the values from the confusion matrix into the precision and recall formulas again, we get:

The recall of our model has improved since the last example, and the precision has gone down.

We calculate the F2-score again:

The recall has improved at the expense of the precision, and this has caused the F2-score to improve. This is because by using the F2-score, we are prioritizing recall over precision.

It is instructive to note here that the F2-score has improved, but the model’s accuracy (the proportion of correctly classified examples) remains the same, as the model has still categorized seven examples correctly.

F-score vs Accuracy

There are a number of metrics which can be used to evaluate a binary classification model, and accuracy is one of the simplest to understand. Accuracy is defined as simply the number of correctly categorized examples divided by the total number of examples. Accuracy can be useful but does not take into account the subtleties of class imbalances, or differing costs of false negatives and false positives.

The F1-score is useful:

• where there are either differing costs of false positives or false negatives, such as in the mammogram example

• or where there is a large class imbalance, such as if 10% of apples on trees tend to be unripe. In this case the accuracy would be misleading, since a classifier that classifies all apples as ripe would automatically get 90% accuracy but would be useless for real-life applications.

The accuracy has the advantage that it is very easily interpretable, but the disadvantage that it is not robust when the data is unevenly distributed, or where there is a higher cost associated with a particular type of error.

Calculating F-score vs Accuracy

In all three of the above examples, the classifier correctly classified seven of ten examples (apples or mammograms), and misclassified three examples.

The accuracy is defined as the ratio of correctly classified examples among all examples, so for all three cases, we have:

However, we have seen that when the recall improved in the last example, the F2-score improved because the F2-score assigns more importance to recall than precision.

For the model with lower recall and higher precision, F2 came to:

And for the model with higher recall, F2 came to:

This illustrates how the accuracy as a metric is, in general, less robust and unable to capture the nuances of the different types of errors.

Calculating F-score vs Accuracy with a Class Imbalance

Let us imagine a tree with 100 apples, 90 of which are ripe and ten are unripe.

We have an AI which is very trigger happy, and classifies all 100 as ripe and picks everything. Clearly a model which classifies all examples as positive is not very much use.

In this case, our confusion matrix would be as follows:

The accuracy is as follows:

We can see that our model has achieved a 90% accuracy without actually making any useful decisions.

Calculating the precision and recall, we obtain

Putting these into the formula for F₁, we get:

Taking the class imbalance into account, if we suspected in advance that our model suffers from low precision, we might choose an adjusted F-score with β = 0.5 to prioritize precision:

From this example, we can see that the accuracy is far less robust when there is a large class imbalance, and the F-score can be adjusted to take into account whether we consider precision or recall to be more important for a given task.

Applications of F-score

There are a number of fields of AI where the F-score is a widely used metric for model performance.

F-score in Information Retrieval

Information retrieval applications such as search engines are often evaluated with the F-score.

A search engine must index potentially billions of documents, and return a small number of relevant results to a user in a very short time. Typically the first page of results returned to the user only contains up to ten documents. Most users do not click through to the second results page, so it is very important that the first ten results contain relevant pages.

Originally the F₁-score was mainly used to evaluate search engines, but nowadays normally a calibrated F_β-score is preferred as it allows finer control and allows us to prioritize precision or recall. A search engine should ideally not miss any relevant documents for a query, but should also not return a large number of irrelevant documents on the first page.

The F-score is a set-based measure, meaning that if the F-score of the first ten results are calculated, the F-score does not take account of the relative ranking of those documents. For this reason, the F-score is often used in conjunction with other metrics, such as mean average precision, or the 11-point interpolated average precision, to get a good overview of the search engine’s performance.

F-score in Natural Language Processing

There are many natural language processing applications that are most easily evaluated with the F-score. For example, in named entity recognition, a machine learning model parses a document and must identify any personal names and addresses in the text.

In biomedical sciences, named entity recognition models are often used to recognize names of proteins in documents, since these are often similar to everyday English words or abbreviations and very difficult for software to identify accurately. An example sentence would be:

Cross-linking CD 40 on B cells rapidly activates…

The model must identify that “CD 40” is the name of a protein.

The model is trained on data where individual words have been annotated as being the start of a protein, or inside one:

When the model is run, it is possible to compare the list of true proteins (the ground truths) to the proteins recognized by the model (the predicted values).

Comparing the lists, the precision and recall can be calculated, and then the F₁, F₂, F_0.5 or other F_β-score can be chosen to evaluate the model as appropriate.

F-Score History

The F-score is believed to have been first defined by the Dutch professor of computer science Cornelis Joost van Rijsbergen, who is viewed as one of the founding fathers of the field of information retrieval. In his 1979 book ‘Information Retrieval’ he defined a function very similar to the F-score, recognizing the inadequacy of accuracy as a metric for information retrieval systems.

In his book, he called his metric the Effectiveness function, and assigned it the letter E, because it “measures the effectiveness of retrieval with respect to a user who attaches β times as much importance to recall as precision”. It is not known why the F-score is assigned the letter F today.

References

Van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). Butterworth-Heinemann.

Y. Sasaki, The truth of the F-measure (2007), https://www.cs.odu.edu/~mukka/cs795sum09dm/Lecturenotes/Day3/F-measure-YS-26Oct07.pdf