Interpreting "Black Box" Classifiers to Evaluate Explanations of Explanation Methods

Näytä kaikki kuvailutiedot



Pysyväisosoite

http://urn.fi/URN:NBN:fi:hulib-202004211911
Julkaisun nimi: Interpreting "Black Box" Classifiers to Evaluate Explanations of Explanation Methods
Tekijä: Murtaza, Adnan
Muu tekijä: Helsingin yliopisto, Matemaattis-luonnontieteellinen tiedekunta
Julkaisija: Helsingin yliopisto
Päiväys: 2020
Kieli: eng
URI: http://urn.fi/URN:NBN:fi:hulib-202004211911
http://hdl.handle.net/10138/314279
Opinnäytteen taso: pro gradu -tutkielmat
Oppiaine: Algorithms and Machine Learning
Tiivistelmä: Interpretability in machine learning aims to provide explanations on the behaviors of complex predictive models, widely refer as black-boxes. Generally, interpretability means the understanding of how the models work internally, whereas, explanations are the one way to make machine learning models interpretable, e.g., using transparent and simple models. Numerous approaches have been proposed as explanation methods which strive to interpret black-box models. These explanation methods mainly try to approximate the local behavior of a model, and then explain it in a human-understandable way. The primary reason to explain the local-behavior is that explaining the global behavior of a black-box is difficult, and it remains an unsolved challenge. Moreover, there is another challenge which argues on the quality and stability of the generated explanations. One way to evaluate the quality of explanations is by using robustness as a property. In this work, we define the explanation evaluation framework, which attempts to measure the robustness of explanations. The framework consists of two distance-based measures stability and separability. We explore and use stability measure from existing literature and introduce our new separability measure, which goes along with stability measure in order to quantify the robustness of explanations. We examine model-agnostic (LIME, SHAP) and model-dependent (DeepExplain) explanation methods to interpret the predictions for various supervised predictive models, especially classifiers. We build classifiers by using UCI classification benchmark datasets and MNIST handwritten digits dataset. Our results illustrate that current model-agnostic and model-dependent explanation methods do not perform adequately with respect to our explanation evaluation framework. Our results show that these explanation methods are not robust to variations in features values and often produce different explanations for similar values and similar explanations for different values, which leads to unstable explanations. Our results and outcomes demonstrate that the developed explanation evaluation framework is useful to assess the robustness of explanations and inspire further exploration and work.
Avainsanat: Explanation evaluation
Interpretable models
Black-box classifiers
Interpretability in machine learning


Tiedostot

Latausmäärä yhteensä: Ladataan...

Tiedosto(t) Koko Formaatti Näytä
murtaza_adnan_2020.pdf 5.427MB PDF Avaa tiedosto

Viite kuuluu kokoelmiin:

Näytä kaikki kuvailutiedot