How to build trust in AI systems with misclassification detectors and local misclassification explorations

Abstract

In the context of an AI system consisting of a machine learning model for classification, we present a framework denoted SafetyCage for systematically detecting and explaining misclassifications. We show how the framework can be used under deployment of the AI systems when true labels are unknown. Specifically, a misclassification detector measures the reliability in one particular model prediction and flags the prediction as either trustworthy or not. Unfortunately, most existing misclassification detectors are not easily interpretable for the purpose of finding the root cause of a misclassification. Hence, if the prediction is deemed untrustworthy, our approach provides additional so-called local misclassification explorations to further assess the trustworthiness of the prediction. The purpose of the framework is to be able to systematically explore the root cause of a particular misclassification, and hence incentivizing procedures to enhance the AI system even further. We showcase our framework with three ML models of different model architectures trained on images, tabular data and text respectively, and present three generic suggestions of local misclassification explorations, and how they can be adapted for each use case.

Read the publication

Language

Other

Author(s)

Affiliation

SINTEF Digital / Sustainable Communication Technologies
SINTEF Digital / Mathematics and Cybernetics

Year

2025

Published in

CEUR Workshop Proceedings

Volume

4132

Page(s)

287 - 297

External resources

https://ceur-ws.org/vol-4132/

DOI

https://ceur-ws.org/vol-4132/short41.pdf

View this publication at Norwegian Research Information Repository

Contact us

Our services

Career

Sustainability

Management and board

Institutes

Other units

About us

Follow us