Why Harmonic Mean is used to calculate F1-Measure?

In this post, we will discuss why do we use Harmonic Mean to calculate the F1 score and what is the intuition behind this? and why can’t we just use the product of precision and recall that would also give us a metric that would combine the result of precision and recall? Also, the range of this metric will be in the range of 0 to 1, as both precision and recall lie between 0 and 1, hence the product. By the end of this blog, I will answer all these questions.

Firstly, why do we even calculate the F1 score? The answer is simply because we want to combine the result of two metrics, Precision and Recall into one metric, and why do we need it? Simple so that we can compare the result of the two models. Also, because it punishes extreme values more

E.g. we have a few models with different precision and recall values, some have high precision but the low value of recall, while some have high recall and low precision, how do we compare all these models with these different values. If we had a single metric that can describe the precision and recall then, we can just easily compare these models using this value.

Here comes the F1 score, the harmonic mean of precision and recall. That combines both precision and recall values into a single value, that we can easily interpret and compare.

Now, why do we need to take harmonic mean to calculate the F1 score? Why can’t we simply take the product of precision and recall to combine into a single metric that we can use to compare two models, and this at the same time remains in the range of 0 and 1? The answer is, you can use it and there is a metric called G score which is actually the square root of the product of precision and recall. But we don’t use it much in real-life problems.

But why the harmonic mean is preferred over the product, in this case, is because it is never higher than the geometrical mean. It also tends towards the least number, minimizing the impact of the large outliers and maximizing the impact of small ones. The F-measure, therefore, tends to privilege balanced systems.

Let say,
For a model,
Precision = 0.8
Recall \= 0.8
then,
Precision * Recall \= 0.64
F1 score \= 0.8

Now, for another model,
let Precision \= 0.1
Recall \= 0.1
then,
Precision * Recall \= 0.01
F1 = 0.1

What’s happening with the product of Precision and Recall is that it is sometimes significantly lower, but always lower than both precision and recall.
F1 score tends to perform much better especially when precision and recall are close.

For more detailed theory: https://www.cs.odu.edu/~mukka/cs795sum09dm/Lecturenotes/Day3/F-measure-YS-26Oct07.pdf

Originally published at semanticerror.com on October 18, 2019.