EDUCE: Explaining model Decisions through Unsupervised Concepts Extraction
With the advent of deep neural networks, some research focuses towards understanding their black-box behavior. In this paper, we propose a new type of self-interpretable models, that are, architectures designed to provide explanations along with their predictions. Our method proceeds in two stages and is trained end-to-end: first, our model builds a low-dimensional binary representation of any input where each feature denotes the presence or absence of concepts. Then, it computes a prediction only based on this binary representation through a simple linear model. This allows an easy interpretation of the model's output in terms of presence of particular concepts in the input. The originality of our approach lies in the fact that concepts are automatically discovered at training time, without the need for additional supervision. Concepts correspond to a set of patterns, built on local low-level features (e.g a part of an image, a word in a sentence), easily identifiable from the other concepts. We experimentally demonstrate the relevance of our approach using classification tasks on two types of data, text and image, by showing its predictive performance and interpretability.
READ FULL TEXT