Abstract
In the context of an AI system consisting of a machine learning model for classification, we present a framework
denoted SafetyCage for systematically detecting and explaining misclassifications. We show how the framework
can be used under deployment of the AI systems when true labels are unknown. Specifically, a misclassification
detector measures the reliability in one particular model prediction and flags the prediction as either trustworthy
or not. Unfortunately, most existing misclassification detectors are not easily interpretable for the purpose of
finding the root cause of a misclassification. Hence, if the prediction is deemed untrustworthy, our approach
provides additional so-called local misclassification explorations to further assess the trustworthiness of the
prediction. The purpose of the framework is to be able to systematically explore the root cause of a particular
misclassification, and hence incentivizing procedures to enhance the AI system even further. We showcase
our framework with three ML models of different model architectures trained on images, tabular data and text
respectively, and present three generic suggestions of local misclassification explorations, and how they can be
adapted for each use case.