Metrics on Object Detection

9 min readJan 21, 2020

In this blog we will look into functions implementing the same metrics used by the the most popular competitions of object detection.

Before we dive into the evaluation metrics we should know the following concepts which are necessary to understand the mathematical insights.

Those important concepts are :

Precision
Recall
F1-score
Area under curve
IOU (Intersection Over Union)
MAP(Mean Average Precision)

In Layman terms, Precision is number if fields are correctly detected. Meaning to say how many objects are correctly detected w.r.t class label and IOU in image.
Recall is about the number of fields that are not predicted. Meaning to say number of objects present in image that are not predicted w.r.t class label and IOU.
More the precision the objects predicted by model are correct w.r.t classification and localization. less the precision value Objects in image are not predicted correctly.

More the Recall the all the objects in image are predicted correctly w.r.t classification and localization. Lesser the Recall value most of the objects in image are not predicted correctly.

Things to know:

Each object predicted over the image from the algorithm is provided with the confidence score.
In case of object detection precision= TP/ Total number of predicted objects.
In case of object detection Recall=TP/ Total number of groundtruths.

#These are not different from the basic defintions , it’s easy to understand in this way.

Region proposals are different from the region of interests.

In evaluation of object detection there are two distinct tasks to measure.

Determining whether objects exists int the image(classification).
Determining the location of the objeect(localization, a regression task).

We have the “confidence score” or model score with each bounding box detected to access the model in various level of confidence.

Precision measures the ratio of the true object detections to the number of objects that the classifier predicted. If you have the precision score close to then there is high likelihood whatever the classifier predicts as positive detection is correct prediction.

Recall measures the ratio of the true object detections to the total number of groundtruth objects in the dataset. If you have the recall score close to 1.0 then almost all the objects in your model detected positively by model.

When we need to check or visualize the performance of the multi — class classification problem, we use AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve. It is one of the most important evaluation metrics for checking any classification model’s performance. It is also written as AUROC (Area Under the Receiver Operating Characteristics)

The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is on the x-axis.

Intersection Over Union (IOU)

Intersection Over Union (IOU) is measure based on Jaccard Index that evaluates the overlap between two bounding boxes. It requires a ground truth bounding box

Bounding box for ground truth

and a predicted bounding box

Bounding box for the prediction

. By applying the IOU we can tell if a detection is valid (True Positive) or not (False Positive).
IOU is the ratio of the intersection of the of predicted bounding box and ground truth bounding box to the union of predicted bounding box andgroundtruth areas.

The image below illustrates the IOU between a ground truth bounding box (in green) and a detected bounding box (in red).

True Positive, False Positive, False Negative and True Negative

Some basic concepts used by the metrics:

True Positive (TP): A correct detection. Detection with IOU ≥ threshold
False Positive (FP): A wrong detection, Detection with IOU < threshold
False Negative (FN): A ground truth not detected,[if IOU with ground truth >threshold, wrong detection]
True Negative (TN): Does not apply. It would represent a corrected misdetection. In the object detection task there are many possible bounding boxes that should not be detected within an image. Thus, TN would be all possible bounding boxes that were correctly not detected (so many possible boxes within an image). That’s why it is not used by the metrics.

threshold: depending on the metric, it is usually set to 50%, 75% or 95%.

Precision

Precision is the ability of a model to identify only the relevant objects. It is the percentage of correct positive predictions and is given by:

Recall

Recall is the ability of a model to find all the relevant cases (all ground truth bounding boxes). It is the percentage of true positive detected among all relevant ground truths and is given by:

For evaluating the model we will be considering the IOUscore, How?

How predictions work:

When multiple boxes detect the same object, the box with the highest IOU is considered TP, while the remaining boxes are considered FP.
If the object is present and the predicted box has an IOU < threshold with ground truth box, The prediction is considered FP. More importantly, because no box detected it properly, the class object receives FN, .
If the object is not in the image, yet the model detects one then the prediction is considered FP.
Recall and precision are then computed for each class by applying the above-mentioned formulas, where predictions of TP, FP and FN are accumulated.

Eg: Iam performing the object detection on image with dog and cat. We will get the Precision, Recall ,F1-score for each object.

For this example i will consider the threshold to be 0.5. In this example consider that we are calculating metrics for ***Dog***.

{ For clear understanding, Fix the Ground Truth class label ,check the iou first later the class label. Ground truth is class label to which we are calculating the metric. Here it is dog.

if IOU>0.5 label == ground truth and [ IOU is highest among the correct predictions] ->True positive.

if IOU>0.5 label == ground truth and [ IOU is not highest among the correct predictions] ->False positive.

if iou<0.5 but the label == ground truth is predicted in image->False positive.

if iou>0.5 but label != ground truth -> False Negative.

if object is not poperly detected for the ground truth ->False Negative.

If iou<0.5 and label != ground truth -> Ignore.

}

If the predicted label is not present in test image itself -> Then it will be FP for that predicted label.

Get clear with this example:

References :

https://towardsdatascience.com/breaking-down-mean-average-precision-map-ae462f623a52

https://www.youtube.com/watch?v=oz2dDzsbXr8

If we see the above 7 images we have the 14 ground truth boxes in green colour and 24 objects that are detected by the model. Each detected object has a confidence level and is identified by a letter (A,B,…,Y).Think that green box is object to be detected. As of now think we are detecting the object where the images in green colour can be dog or cat.

The following table shows the bounding boxes with their corresponding confidences. The last column identifies the detections as TP or FP. In this example a TP is considered if IOU

30%, otherwise it is a FP. By looking at the images above we can roughly tell if the detections are TP or FP.

In some images there are more than one detection overlapping a ground truth (Images 2, 3, 4, 5, 6 and 7). For those cases the detection with the highest IOU is considered TP and the others are considered FP. This rule is applied by the PASCAL VOC 2012 metric: “e.g. 5 detections (TP) of a single object is counted as 1 correct detection and 4 false detections”.

The Precision x Recall curve is plotted by calculating the precision and recall values of the accumulated TP or FP detections. For this, first we need to order the detections by their confidences, then we calculate the precision and recall for each accumulated detection as shown in the table below:

Recall is calculated from TP/Number Of groundtruthboxes for that particular class label.

Plotting the precision and recall values we have the following Precision x Recall curve:

As mentioned before, there are two different ways to measure the interpolted average precision: 11-point interpolation and interpolating all points. Below we make a comparisson between them:

Calculating the 11-point interpolation

The idea of the 11-point interpolated average precision is to average the precisions at a set of 11 recall levels (0,0.1,…,1). The interpolated precision values are obtained by taking the maximum precision whose recall value is greater than its current recall value as follows:

By applying the 11-point interpolation, we have:

Calculating the interpolation performed in all points

By interpolating all points, the Average Precision (AP) can be interpreted as an approximated AUC of the Precision x Recall curve. The intention is to reduce the impact of the wiggles in the curve. By applying the equations presented before, we can obtain the areas as it will be demostrated here. We could also visually have the interpolated precision points by looking at the recalls starting from the highest (0.4666) to 0 (looking at the plot from right to left) and, as we decrease the recall, we collect the precision values that are the highest as shown in the image below:

Looking at the plot above, we can divide the AUC into 4 areas (A1, A2, A3 and A4):

Calculating the total area, we have the AP:

The results between the two different interpolation methods are a little different: 24.56% and 26.84% by the every point interpolation and the 11-point interpolation respectively.