How to Evaluate an Object Detection Model: Explain IoU, Precision, Recall and mAP with examples

Jamie Geng
8 min readDec 9, 2022

--

You have trained you first object detection model, YOLO or R-CNN. And you have rectangular shape bounding boxes predicted. Some of them encircle the object you want to identify perfectly, but some of them are too big, too small, or too far away from the correct location. So how do we measure the performance of our object detection model? Can we evaluate object detection models like image classification models? Can we use accuracy as a metric to describe the performance of an object detection model?

Before we answer those questions, let’s first review some basic statistics terminologies.

True Positive, False Positive, False Negative, and True Negative

You probably already know what are True Positive, False Positive, False Negative, and True Negative already. But you still might get confused here.

True Positive: A true positive is just a correct prediction. In the example below the green bounding box represents the predicted bounding box, and the yellow bounding box represents the ground truth bounding box. Although those two boxes are not completely overlapped, but the overlapping region this large are normally good enough. The predicted class is also correct, it correctly identifies the object in the bounding box as a car.

A True Positive Prediction

False Positive: Now things are getting a little more complicated, because you can predict a bounding box wrong in more than one way. And it is important to know in what ways the predictions are wrong.

The first way to get it wrong is caused by the location of predicted bounding box. There is either too little overlaps between prediction and ground truth or the prediction and ground truth has no overlap at all. You can have multiple False Positives, even if you only have one ground truth bounding box.

Two false positive predictions because of not enough overlapping region

Another way to get False Positive is to wrongfully classify the object in the bounding box. In the example below, even though the predicted bounding box matches the ground truth bounding box pretty closely, the predicted class is wrong. Instead of car, it predicted Truck instead.

Misclassify a car as a truck. One False Positive in the image

False Negative: Every single not correctly predicted ground truth bounding box is a false negative. In the example below, we have two cars in the image but only one bounding box is predicted and it fails to correctly predict either of the car. So we have two False Negatives.

Two False Negatives and One False Positive

True Negative: In object detection we don’t count True Negatives. The justification is that all the bounding boxes that are not predicted are True Negatives.

You think you understand the concepts now, but do you? Let’s try some exercises. How many true positives, false positives , false negatives do you think the below image have?

In the image above you have one bounding box predicted, and it covers both cats in the image. But one prediction can only be used for one ground truth bounding box. So in the above case, we can have maximum one true positive prediction. The same principle applies the other way around. Let’s take a look at another example.

The model predicted two bounding boxes, but there is only one dog in the image. We can only pick the best matched prediction, the other prediction will be counted as a false positive.

Before we start calculating the metrics, there are some more information you need to know.

IoU: IoU stands for Intersection over Union, it shows how much two bounding boxes overlap each other. The formula is shown below

Source: PyImageSearch

In our examples, we will use IoU = 50% as acceptance criteria. If IoU is larger than or equal to 50%, then we say the location prediction is good. If IoU is less than 50%, then the prediction is too far way from the ground truth bounding box.

Confidence: In the output images, you can see there is a number next to the predicted class. The number means how confident the model is that the object in the bounding box belongs to a category. In the image below, you can see the model has a 96% confidence that the object in the box is a car.

Prediction with 96% confidence

Normally we set a minimum confidence to display the predicted bounding box, otherwise you can have way too many boxes show up in your prediction output.

Things can get out of hands when you set display confidence interval too low

Accuracy, Precision and Recall

Let’s try to answer the questions we asked at the beginning of the article. Can we use accuracy as a metric in object detection?

Accuracy is defined as below:

Now it should be straightforward that accuracy doesn’t help us to interpret the result too well. Since the number of True Negatives are too overwhelming. You will have an accuracy that equal to 100%. You probably have encountered similar situations when you dealt with imbalanced datasets, for example fraud detection. When you only have one fraud case out of thousands of cases, all you need to do is to predict everything as non-fraudulent to get a near perfect accuracy.

Since True Negatives are hindering us, it is a good idea to avoid using it in our metrics calculation. That’s when precision and recall jump in.

Precision and Recall are defined as:

If you have trouble remember the definitions, just remember 100% Precision means 0 False Positives (no wrong predictions), and 100% Recall means 0 False Negatives (all ground truth bounding boxes are correctly predicted). There is a trade-off between Precision and Recall, normally when you have a high precision the recall will be low, and vice versa. We have chosen IoU@50% as our localization threshold, but we can still adjust our confidence threshold. Let’s take a look at another example.

In the above prediction outcomes, we found 7 bounding boxes in total, and we have total 6 race cars in all images combine. Some of the Formula 1 Race Cars in the images are not being identified even when the confidence level is really low. But as the famous racing driver Max Verstappen once said: It is what it is.

Let’s rank all the predictions in descending order base on their confidence level.

We are finally ready to calculate Precision and Recall at each confidence level. We start from the prediction that has the highest confidence.

The prediction with 0.93 confidence is a correct prediction. So we have True Positive = 1, False Positive = 0, and Precision = 1. We have total 6 Race Cars, and only 1 is being correctly identified, so we have Recall = 1/6.

Then we move down to the prediction with the second highest confidence. The prediction is 0.91 confidence is also a correct prediction. So we have True Positive = 2, False Positive = 0, and Precision = 1. We have total 6 Race Cars, and 2 are being correctly identified, so we have Recall = 2/6.

Then we move down to the prediction with the third highest confidence. The prediction is 0.67 confidence is a wrong prediction because IoU is less than 50%. So we have True Positive = 2, False Positive = 0, Total Predictions = 3 and Precision = 2/3. We have total 6 Race Cars, and still only 2 are being correctly identified, Recall remains equal to 2/6.

We repeat the steps until we have precision and recall calculated for all predictions.

With this information, we are now ready to plot the precision and recall curve to visualize the trade-off between those two metrics.

Precision-Recall Curve

As we can see in the plot Precision-recall curves have a distinctive saw-tooth shape: if the prediction made is incorrect then recall is the same as for the prediction with one rank higher confidence, but precision has dropped. If it is correct, then both precision and recall increase, and the curve jags up and to the right. It is often useful to remove these jiggles and the standard way to do this is with an interpolated precision: the interpolated precision. The interpolated precision at a certain recall level r is defined as the highest precision found for any recall level r* when r*≥r.

Interpolated Precision — Recall Curve

The justification is that almost anyone would be prepared to look at a few more images if it would increase the percentage of the true positive predictions (that is, if the precision of the larger set is higher).

Finally, we are there. With Interpolated Precision and Recall, we are able to express how good our model is with one single number, the average precision. The general definition for the Average Precision (AP) is finding the area under the Interpolated Precision-Recall curve above. In some contexts, AP is calculated for each class and averaged to get the mAP. But in others, they mean the same thing. So align with your colleagues before you do the calculation. But since we are only doing this for one class “Race Cars”, mAP will be equal to AP in our example.

We add up the area under curve for each precision level and we will get the mAP value at IoU 50%.

--

--