As deep learning is applied to high stakes scenarios, it is increasingly important that a model is not only making accurate decisions, but doing so for the right reasons. Common explainability methods provide pixel attributions as an explanation for a model's decision on a single image. However, using these input-level explanations to understand patterns in model behavior is challenging for large datasets as it requires selecting and analyzing an interesting subset of inputs. By utilizing the human generated bounding boxes that represent ground truth object locations, we introduce metrics for scoring and ranking inputs based on the correspondence between the input’s ground truth object location and the explainability method’s explanation region. Our methodology is agnostic to model architecture, explanation method, and modality allowing it to be applied to many tasks and domains. By aligning model explanations with human annotations, our method surfaces patterns in model behavior when applied to two high profile case studies: a widely used image classification model and a cancer prediction model.

Try the demo and check out the code!



In AI applications such as cancer diagnosis, autonomous driving, and facial recognition, it is crucial to not only understand model performance, but also the reason behind model decisions. Various prior work has demonstrated weaknesses in these models — even highly accurate ones — including reliance on non-salient regions or on background information only . Explanation methods help identify these pitfalls by providing explanations for model predictions, enabling humans to identify the features on which a model decision is based. However, these methods provide explanations on the image level making it challenging to understand global model behavior or dataset limitations.

In this work, we explore model decisions using saliency methods in conjunction with the ground truth object bounding boxes provided in many computer vision datasets. We apply three scoring functions — explanation coverage, ground truth coverage, and intersection over union — to sort images based on the overlap between the explanation and the ground truth object location. By sorting images in this way, we discover insights into when and why the model was “right for the right reasons”, “wrong for the wrong reasons”, or perhaps most interestingly “right for the wrong reasons”. We show our methodology is applicable to various model architectures, explanatory methods, and input data by evaluating on two representative tasks: ImageNet vehicle classification using a pretrained ResNet50 model and melanoma prediction using a VGG11 model and the ISIC lesions dataset . In both tasks, we identify insights into model behavior and uncover unexpected features of our datasets.


Related Work

Image datasets from domains such as object detection , facial recognition , and medical classification often contain ground truth object segmentations. In our method, we utilize these ground truth annotations to help humans understand model behavior. We evaluate our model on two image datasets with ground truth annotations: ImageNet which contains human generated bounding boxes and the ISIC lesions dataset which contains clinician defined object segmentations.

In conjunction with ground truth annotations, our method relies on explanation methods (e.g., vanilla gradients, integrated gradients, LIME, SmoothGrad, and SIS) that provide pixel-level explanations for model behavior. While these methods successfully expose explanations for a single input, our method enables users to sort and rank explanations over all images as a way to gain insight into global model behavior. Our methodology is agnostic to the explanation method allowing for flexibility across use cases.

Aside from explainability methods, a growing number of techniques have been developed to help users interpret image models . However, these methods often focus on understanding patterns learned by intermediate nodes , require additional datasets, or demand architecture specific algorithms. On the other hand, our method is agnostic to model architecture, saliency method, and dataset enabling users across domains to understand global patterns in model behavior by aligning model explanations and ground truth annotations.



In our method, we leverage the ground truth annotations along with instance-level explanations to compute coverage scores for each image. By sorting the images using these scores, we are able to query for instances that give us insight into model behavior. Our method only assumes a set of inputs with ground truth regions and explanations regions, making it agnostic to model architecture, dataset, and explanation technique.

We compute three coverage metrics to allow a breadth of exploration: explanation coverage, ground truth coverage, and intersection over union (IoU). Each score takes as input a set of pixels $$GT$$ corresponding to the known ground truth region and a set of pixels $$E$$ corresponding to the explanation region.

\begin{aligned} \text{Explanation Coverage} &= \frac{|GT \cap E|}{|E|} \\\\ \text{Ground Truth Coverage} &= \frac{|GT \cap E|}{|GT|} \\\\ \text{IoU} &= \frac{|GT \cap E|}{|GT \cup E|} \end{aligned}

As shown in Figure 1, a low score under all three metrics indicates that an image’s explanation region and ground truth region are disjoint. In Figure 2, we show example scenarios using an image classification task. When a correctly classified image has a low score, it often indicates the model was relying on background information such as a snowmobile helmet to make the prediction of snowmobile or train tracks to make the prediction of electric locomotive. When an image has a low score and is incorrectly classified, it can indicate the model is focusing on a secondary object in the image or incorrectly relying on background context (e.g., using snow to predict arctic fox).

Visual summary of the overlap between the model explanation and human annotation. Visual summary of explanation coverage. Visual summary of ground truth coverage. Visual summary of intersection over union.
Examples where each of the metrics is maximized and minimized.
Figure 1. A visual representation of all three coverage metrics. Ground truth coverage represents the proportion of the ground truth region covered by the explanation region, explanation coverage represents the proportion of the explanation region covered by the ground truth, and IoU represents how similar the two regions are.

Explanation coverage represents the proportion of the explanation region covered by the ground truth region. High explanation coverage indicates the entire explanation region lies within the ground truth region, meaning the model is relying on a subset of salient features to make its prediction. Filtering for correctly classified inputs with high explanation coverage can surface instances where a subset of the object, such as the dog’s face, was sufficient to make a correct prediction. Looking at incorrectly classified images with high explanation coverage can help us find instances where the model uses an insufficient portion of the object to make a prediction (e.g., using a small region of black and white spots to predict dalmatian).

Ground truth coverage represents the proportion of the ground truth region covered by the explanation region. High ground truth coverage indicates that the model is using the entire object to make its decision. In Figure 2, we see filtering for correctly classified images with high ground truth coverage uncovers instances where the model relies on the object and relevant background pixels (e.g., the cab and the street), to make a correct prediction. Looking at incorrectly classified instances with high ground truth coverage shows examples where the model overrelies on contextual information such as using the keyboard and person’s lap to predict laptop.

IoU is the strictest metric. A high IoU score indicates the explanation and ground truth are very similar and IoU is maximized when the explanation and ground truth regions are identical. Looking at correctly classified images with high IoU scores can help identify instances where the model was right for the exactly the right reasons. Incorrectly classified images with high IoU scores can surface examples where the image labels are ambiguous such as moped and motor scooter.

Figure 2. Examples of low and high values of all three coverage metrics. Looking at correctly and incorrectly classified images at each of the score extremes enables users to find patterns in model behavior.

Case Study: Image Classification

In our first case study, we apply our methodology to an image classification task using a publicly available PyTorch ResNet50 model pretrained on ImageNet. Since the pretrained model is publicly available and ImageNet is a standard benchmark in the computer vision community, the model is widely used for out of the box classification as well as fine-tuning and transfer learning tasks.

In this case study, we apply the model to an ImageNet classification task on a subset of ImageNet9 images containing vehicles. Each image in the subset contains a single ground truth bounding box annotation as its label. For each image we compute a LIME explanation. Using the overlap metrics between the explanation and the ground truth bounding box, our methodology allows us to gain insight into model behavior and identify known flaws in ImageNet.

We begin our exploration by looking at instances where the model performs well. In particular, we choose to look at images labeled as Jeep that are classified correctly. To see if the explanations correspond to the ground truth regions, we look at images with high IoU scores. We see the model explanations have high agreement with the ground truth regions, suggesting its performance on these images is due to having learned salient features of Jeeps.

Figure 3. Filtering to correctly classified Jeep images with high IoU scores reveals instances with high agreement between the ground truth object region and the explanation.

Looking at the other end of the score distribution, we filter for correctly classified Jeep images with low IoU scores. Many of the images still have explanations focused on salient features of Jeeps such as their wheels and distinct body shape; however, we notice an example where the explanation region for the class Jeep is focused on a black dog. This may indicate that the model has memorized the existence of the black dog in the image, raising the question of whether the pixels of the dog contain adversarial properties that could cause the model to predict Jeep for any image edited to contain the dog.

Figure 4. Looking at correctly classified Jeep images with low IoU scores confirms the model is latching onto salient features except in the first image where the explanation corresponds to the image of a dog.

Finally, we look at incorrectly classified images with low explanation coverage to determine what causes the model to fail. Images with low explanation coverage have disjoint explanation and ground truth regions. Without looking at the results, we may hypothesize the model makes the incorrect prediction and has a disjoint explanation because it is guessing at random. However, looking at the images we see the main failure mode occurs when the model predicts a secondary object in the image. Despite each image only having a single label and ground truth annotation, we see a large number of images contain multiple objects.

Figure 5. Incorrectly classified images with low explanation coverage identify an unexpected dataset feature: images with only one ground truth annotation may contain multiple objects.

Our method gives insight into the pretrained PyTorch model predictions on vehicle images from ImageNet, showing that the model uses human interpretable explanations for some classes. Further sorting by our metrics allows us to discover the dataset contains images with multiple objects which is unexpected given the images contain a single label and ground truth annotation.


Case Study: Melanoma Diagnosis

In our second case study, we evaluate our method using a melanoma diagnostic task. This case study represents a real-world scenario where AI-based melanoma classification models exist online for at-home risk assessments. Since this is a high-stakes task with impact to patient health, it is critical the models rely on salient features when making a prediction.

In this case study, we use dermoscopic image data from the ISIC Skin Lesion Analysis Towards Melanoma Detection 2016 Challenge. Each image in the dataset is an up close image of a skin lesion labeled as either benign or malignant and is annotated with a lesion segmentation created by an expert clinician. We train a VGG11 pretrained on ImageNet to learn a binary benign/malignant classification from the original images and achieve accuracy on par with 2016 challenge winners. From the trained model, we extract LIME explanations. We evaluate the model by applying our coverage metrics to the ground truth lesion segmentations and LIME explanations.

We begin our exploration by analyzing correctly classified images with the highest IoU scores. These examples show the instances where the lesion segmentation and explanation region are most similar. We see there are a number of images for which the explanation is focused on the lesion, suggesting the model has learned a relationship between lesion characteristics and malignancy.

Figure 6. Filtering to correctly classified malignant tumors with high IoU shows images where the explanation corresponds with the ground truth lesion.

Since our model seems to be learning some salient features, we next filter to malignant lesions incorrectly classified as benign. Sorting by low ground truth coverage, we see there are instances where our model makes incorrect predictions relying only on peripheral skin regions. This is particularly concerning in the case of at home risk assessment where a cancerous lesion could be classified as benign due to skin surrounding the lesion.

Figure 7. Looking at images with low ground truth coverage that have been incorrectly classified as benign, shows examples where the explanation is disjoint from the lesion.

Since our model incorrectly classified malignant lesions using non-salient background information, we explore if the model can also correctly classify lesions without looking at the lesion. We filter to correctly classified benign lesions and look for images with low explanation coverage. We find a number of images where the model relies on the existence of in-frame dermatological tools to make a benign prediction. While not salient, these dermatological tools only exist in benign images and are sufficient to make a correct classification.

Figure 8. Searching for correctly classified benign images with low explanation coverage uncovers cases where the explanation corresponds to dermatological tools within the image.

Using our methodology reveals insight into melanoma model behavior showing that while the model uses salient pixels to make some decisions, it dangerously misclassifies malignant tumors due to peripheral skin regions and latches onto spurious dataset features.



In this work, we present a methodology that enables humans to understand model behavior using the alignment between ground truth object labels and saliency method explanations. Our method is agnostic to model architecture, explanation method, and image dataset, allowing it to be used in a range of applications. Using real world case studies, we show our method allows users to identify where the model is “right for the right reasons”, when the model makes correct predictions using non-salient features, and when the dataset contains unexpected features.


Try The Demo or Check Out The Code!