Explaining the reasoning going on in models trained with machine learning algorithms, such as object detection models, has become ever more important. This is due to both regulatory standards that demand insight into these models, and the increasing complexity of state-of-the-art models. We have already written about explainable AI (XAI) on our blog and given a talk about it. This time we want to write about explainable object detection.
In this article we investigate one AI application where explaining the model output is particularly interesting and challenging: Detecting objects in images. Object detection is a task under intense, active development in the field of computer vision with a continuously improving state of the art. Practical applications reach from counting animal populations to providing better thumbnails for social media. See for example here what you can do with object detection models – and why it is important to understand your model beyond summary evaluation metrics. While explainable AI is not a new topic in computer vision, there are no explainable object detection tools yet available to analyze an object detection model down to a specific input. Therefore, we will now show one approach to a detailed instance-based explanation.
We want to obtain humanly interpretable reasons for a certain model decision. For this we make use of the SHAP library to calculate shapley values. In one sentence, shapley values determine the marginal contribution of a feature towards a model result. They take into account a background distribution of the other features. You can find a detailed explanation here. In this article we build a model where we can selectively remove part of the input, i.e. hide patches of pixels in the image. These patches serve as surrogate features for which we calculate and visualize shapley values.
XAI libraries like SHAP and LIME support a range of standard models and inputs. These include (deep) neural networks and image input. Object detection differs from the models supported out of the box: The nearest analogue would be image classification. In image classification the input is the same and the employed models are typically deep neural networks. However, the output of object detection goes beyond just one value for a class or a range of class probabilities. In object detection one image can show multiple objects in varying sizes. State-of-the-art models for object detection also include a so-called non-maximum-suppression step (NMS). This filters the output of the last layer of the neural network down to the final predictions.
Non-maximum-suppression step (NMS)
NMS iteratively removes lower scoring boxes which have an intersection over union greater than a given threshold with another (higher scoring) box. The threshold choice is up to the user and depends e.g. on the expected situations: Are objects frequently grouped close together or kept further apart?
You don’t need to train this step and it is non-differentiable. This means that there is no direct connection from the input to the final output via the gradient of the neural network. This, as well as the problem of multiple outputs from the model, prohibits using tools like DeepLift or the DeepExplainer in the SHAP library. However, SHAP provides a tool we can use for generic black box models, the KernelExplainer. The challenge of our explainable object detection example lies in two aspects. First, we have to fit the object detection task in the scheme of the KernelExplainer. And second, we need to connect model and XAI framework on the technical level.
You can find the code accompanying this article here in Colab. We will explain the major steps and discuss them in the following.
For this article, we use the object-detection algorithm YOLOv5. It runs on PyTorch which is supported in Colab and competes for top performance. Additionally, the models allow easy fine-tuning for new object types.
Of the available pretrained models, we will use YOLOv5s. It is the smallest model and is pretrained on the COCO dataset and its 80 classes.
PyTorch is already available in the Colab Environment, but we need to install the SHAP library, YOLOv5, and download the pretrained model weights. From the YOLOv5 code, we need helper functions for NMS and for checking the overlap of bounding boxes (the intersection over Union or IoU).
Using the OpenCV library, we read one image. We pad it into a square and resize it to 160×160 pixels. We can also use smaller or larger input sizes, but the squares’ width/height must be a multiple of 32, to fit the input of the pretrained model. Object detection using deep neural networks and XAI methods are rather resource and time consuming. Therefore, we downscale input images to speed up the object detection time. However, if you are interested in getting detections and explanations with higher resolution in further computations feel free to try different image sizes here.
Now, to match the expected model input (RGB color format) we must reorder the image dimensions (OpenCV reads images with BGR color format), cast the image to a PyTorch tensor and continue to the model. We simply load the model weights and are ready to do inference on images.
The immediate output of this model are the coordinates – x and y - of the center of the identified object as well as width and height. We also get the probability of an object at these coordinates, plus the probabilities for each of the 80 classes. This vector lists detections for all possible anchor points, most of which will have very low scores:
When we collapse and filter the predictions using NMS, two detected objects with high confidence remain. They are, as we expected, the two people in the image.
For each detection we get the coordinates, now in the format x1, y1, x2, y2. We also receive the product of object score and the class with the highest score, as well as the index of the most likely class (the COCO class index 0 corresponds to the object category 'person'). To see how well we did in finding the person on the right, we look at the combined scores of the second detection: The model achieved a score of 68.6%.
Potential approaches to increase the confidence score are to use bigger model weights or to increase the image resolution.
Explaining the whole model output with respect to the input image is hard, simply because there is not one well delineated outcome. If we focus on one prediction per image, the question from the XAI perspective is much narrower and better defined: What part of the image contributes to this particular detection and how much?
To use this model with the KernelExplainer of SHAP, we need to fit the steps above into one PyTorch model:
1. Casting the image to a PyTorch tensor
2. Applying the core model
3. Applying NMS
4. Calculating the score of the detection we are interested in
We implement these steps as individual layers in PyTorch. We do step 1 with the Class Numpy2TorchCaster. Step 2 is the model we used above. We do steps 3 and 4 with the Class OD2Score.
The detected box can deviate from our set target when we extract the score for our target detection. If this happens, we multiply the score with the overlap between the correct target box and the detected box. The final score depends therefore both on how confident the model is in predicting the person and how well the box is positioned.
Note also that we just want to find out how the model comes to this particular result. It does not matter here if the target is a correct detection as judged by a human or compared to some gold standard.
We can chain these layers in sequence with PyTorch, as we show here:
The scoring_model takes an image as input and returns two things: How confident it is in detecting a person in the right area and how well the box is positioned as score. However, we will make one more adjustment for the input.
Now we could start to change the input image and see how the output changes. Unfortunately, the image has 160×160 = 25,600 pixel values – 76,800 if you count the three color channels separately. Calculating the influence of each individual pixel requires excessive computing resources. On top of that, it is likely that each single pixel contributes only very little towards the detection. So, instead we aggregate the pixels into superpixels, i.e. connected patches of pixels. We can segment an image into superpixels in different ways. We’ll keep it simple and use a rectangular grid with fixed width and height.
To implement this segmentation approach, we create a new layer, the SuperPixler. This layer accepts a list indicating the superpixels that should be available/not available for an image. It also builds the matching image as input for the next layer. The patch of an inactive superpixel is replaced with the mean color of the image.
The super_pixel_model takes as input the information which superpixels are active and which are greyed out. It internally converts this to the image containing our target and again returns the score how well we detected our target. Now we have a superpixel of 8×8 and an image of 160×160 pixel. Thus, the input space is just a vector of 400 binary values mapping to the 400 superpixels covering the whole image.
The KernelExplainer can finally interpret the super_pixel_model, which has a manageable input space. Now, we feed the model to the explainer and let it determine the contribution of each superpixel to the output of the detector.
We map the superpixels back to the constituent pixels (easy, since they conform to a grid) and scale the values from 0 to 1. In the colormap we use, red means a positive contribution of the pixel, blue a negative contribution.
We see that the patches with high contribution are indeed located within the bounding box of our target. Interestingly, the highest contribution seems to come from head and shoulders. The inner space and the lower body play a lesser role. It seems that for the model, a person is most strongly associated with a face and upper body contours. This makes sense and probably also correlates to the way people appear in the original COCO training data: We tend to focus on the head and upper body when taking pictures.
We see few patches contributing against the detection. This also makes sense since nothing in the image obscures the target and makes it less likely to see a person. The one blue patch here might correspond to an extended arm and the absence of it could weaken the overall impression of a person for the model.
There are all kinds of refinements we could implement here.
For example, the superpixels can be chosen in a more sophisticated mode. A better way could be a clustering of image elements, where the replacement value is the average color of their neighboring superpixels.
We can also define more than one replacement value for each superpixel. This gives a better characterization what the “absence” of a pixel would look like.
Different scales for the image and superpixel also play a role. You can experiment with these values within the limits of your available processing power.
We only looked at a case where the model gets the detection right, both with regards to the position and the type of the object. Cases where the model makes one or both errors can potentially tell us more about its “assumptions” and “preconceptions”. The example image here shows a misclassification as giraffe. The correct prediction is, in fact, out of scope for the model. Nevertheless, the patches contributing to the detection correspond to the elongated head and neck part of the kangaroo.
Another limitation here is the explanation of false negatives: We could modify the scoring procedure to look for a target that is only very weakly predicted. But to determine what input the model would need to see for a hit, we’d have to add information to the image. Say, a person is not detected because the head is obscured. To check if the occlusion is the problem, we’d have to placea head into the image in the expected position. Not only is this more difficult to implement, the space of possible imputations is also much larger for a given missed detection.
XAI tools currently don’t natively support the task of object detection. Therefore, we had to adapt the input and output of a detection model to fit a generic explainer. After successfully connecting all components together we have been able to inspect a single target. We could also determine which part of the image contributes to the detection. The model seems to correctly focus on areas of the image which have a relation to the target. However, we see a bias in the ‘people’ class towards the upper body and head. This means detecting people from some angles or partially visible persons might be more difficult for the model. On a case-by-case basis, we have gained some valuable insight into the model decision process through explainable object detection and maybe also some more trust into its detections.