Interpreting "Black Box" Classifiers to Evaluate Explanations of Explanation Methods

Show full item record

Title: Interpreting "Black Box" Classifiers to Evaluate Explanations of Explanation Methods
Author: Murtaza, Adnan
Contributor: University of Helsinki, Faculty of Science
Publisher: Helsingin yliopisto
Date: 2020
Language: eng
Thesis level: master's thesis
Discipline: Algorithms and Machine Learning
Abstract: Interpretability in machine learning aims to provide explanations on the behaviors of complex predictive models, widely refer as black-boxes. Generally, interpretability means the understanding of how the models work internally, whereas, explanations are the one way to make machine learning models interpretable, e.g., using transparent and simple models. Numerous approaches have been proposed as explanation methods which strive to interpret black-box models. These explanation methods mainly try to approximate the local behavior of a model, and then explain it in a human-understandable way. The primary reason to explain the local-behavior is that explaining the global behavior of a black-box is difficult, and it remains an unsolved challenge. Moreover, there is another challenge which argues on the quality and stability of the generated explanations. One way to evaluate the quality of explanations is by using robustness as a property. In this work, we define the explanation evaluation framework, which attempts to measure the robustness of explanations. The framework consists of two distance-based measures stability and separability. We explore and use stability measure from existing literature and introduce our new separability measure, which goes along with stability measure in order to quantify the robustness of explanations. We examine model-agnostic (LIME, SHAP) and model-dependent (DeepExplain) explanation methods to interpret the predictions for various supervised predictive models, especially classifiers. We build classifiers by using UCI classification benchmark datasets and MNIST handwritten digits dataset. Our results illustrate that current model-agnostic and model-dependent explanation methods do not perform adequately with respect to our explanation evaluation framework. Our results show that these explanation methods are not robust to variations in features values and often produce different explanations for similar values and similar explanations for different values, which leads to unstable explanations. Our results and outcomes demonstrate that the developed explanation evaluation framework is useful to assess the robustness of explanations and inspire further exploration and work.
Subject: Explanation evaluation
Interpretable models
Black-box classifiers
Interpretability in machine learning

Files in this item

Total number of downloads: Loading...

Files Size Format View
murtaza_adnan_2020.pdf 5.427Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record