Interpreting "Black Box" Classifiers to Evaluate Explanations of Explanation Methods

Visa fullständig post



Permalänk

http://urn.fi/URN:NBN:fi:hulib-202004211911
Titel: Interpreting "Black Box" Classifiers to Evaluate Explanations of Explanation Methods
Författare: Murtaza, Adnan
Medarbetare: Helsingfors universitet, Matematisk-naturvetenskapliga fakulteten
Utgivare: Helsingin yliopisto
Datum: 2020
Språk: eng
Permanenta länken (URI): http://urn.fi/URN:NBN:fi:hulib-202004211911
http://hdl.handle.net/10138/314279
Nivå: pro gradu-avhandlingar
Ämne: Algorithms and Machine Learning
Abstrakt: Interpretability in machine learning aims to provide explanations on the behaviors of complex predictive models, widely refer as black-boxes. Generally, interpretability means the understanding of how the models work internally, whereas, explanations are the one way to make machine learning models interpretable, e.g., using transparent and simple models. Numerous approaches have been proposed as explanation methods which strive to interpret black-box models. These explanation methods mainly try to approximate the local behavior of a model, and then explain it in a human-understandable way. The primary reason to explain the local-behavior is that explaining the global behavior of a black-box is difficult, and it remains an unsolved challenge. Moreover, there is another challenge which argues on the quality and stability of the generated explanations. One way to evaluate the quality of explanations is by using robustness as a property. In this work, we define the explanation evaluation framework, which attempts to measure the robustness of explanations. The framework consists of two distance-based measures stability and separability. We explore and use stability measure from existing literature and introduce our new separability measure, which goes along with stability measure in order to quantify the robustness of explanations. We examine model-agnostic (LIME, SHAP) and model-dependent (DeepExplain) explanation methods to interpret the predictions for various supervised predictive models, especially classifiers. We build classifiers by using UCI classification benchmark datasets and MNIST handwritten digits dataset. Our results illustrate that current model-agnostic and model-dependent explanation methods do not perform adequately with respect to our explanation evaluation framework. Our results show that these explanation methods are not robust to variations in features values and often produce different explanations for similar values and similar explanations for different values, which leads to unstable explanations. Our results and outcomes demonstrate that the developed explanation evaluation framework is useful to assess the robustness of explanations and inspire further exploration and work.
Subject: Explanation evaluation
Interpretable models
Black-box classifiers
Interpretability in machine learning


Filer under denna titel

Totalt antal nerladdningar: Laddar...

Filer Storlek Format Granska
murtaza_adnan_2020.pdf 5.427Mb PDF Granska/Öppna

Detta dokument registreras i samling:

Visa fullständig post