1 Introduction
As the scope of application of deep neural networks has greatly widened, two main directions of research have been developed to make their behavior more understandable by humans. The first direction aims at developing algorithms that have the ability to explain a posteriori any blackbox model (Ribeiro:etal:2016)
. The second direction proposes new models and architectures that exhibit predictive performance close to deep learning ones, while being
interpretable by the users (see for example (Goyal:etal:2019; Yu:etal:2019; Melis:Jaakkola:2018; Quint:etal:2018)). In both these research directions, producing explainable or interpretable models usually relies on two core components: (i) First, the interpretation or explanation has to be grounded over concepts, that are, notions that make sense to a human. For instance, to explain an image classification prediction, an explanation/interpretation at the pixellevel (e.g. based on the importance of each pixel in the prediction) would be difficult to parse. Therefore, existing approaches typically use higherlevel features like superpixels (Ribeiro:etal:2016). (ii) Second, the interpretation/explanation is supported by only a few humanunderstandable concepts, since grounding the decision on too many parameters would harden the interpretation. In many existing methods, the list of concepts on which the prediction is grounded is given to the model as additional knowledge (e.g human labels that define concepts (Kim:etal:2018)). This has two important drawbacks: It requires an additional effort to manually label data at the conceptlevel and it may introduce bias in the interpretation. Indeed, a priori humandefined concepts have no guarantees to be relevant for the given task.In this article, we propose a new deep learning model which objective is to be selfinterpretable by automatically discovering relevant concepts
, thus avoiding the need for extra labeling or the introduction of any artificial bias. Furthermore, the proposed model uses a very general architecture based on convolutions, both for text and image data. As a consequence, our method is easily applicable to any new dataset. To do so, we rely on two main principles: i) our model learns to represent any input as a binary vector where each feature corresponds to a concept being absent or present in that input and ii) it computes its final prediction only using this binary representation. This allows an interpretation of the model’s prediction using only the appearance of concepts, and results in very simple decision rules. The presence of a concept is defined by the appearance of a local pattern that must be easily identifiable as belonging to an homogeneous set of patterns. We enforce this through a
concept identificationconstraint, which facilitates the interpretation of the extracted concepts. Our contributions are threefold: i) We propose a new deep selfinterpretable model that is able to predict solely from the presence or not of concepts in the input, where the concepts are discovered without supervision. ii) We instantiate this model for both images and text through convolutional neural networks and describe how its parameters are efficiently learned with stochastic optimization. iii) We analyze the quality of the learned models on different text categorization and image classification datasets, and show that our model reaches good classification accuracy while extracting meaningful concepts.
The paper is organized as follows: in Section 2 we describe the general idea of our approach and explain how it can be casted as a learning problem optimized with stochastic optimization techniques. We experimentally train our model and analyze the results for both images and text in Section 3 and 4. In Section 5, we connect our method to existing interpretable models.
2 The EDUCE model
In this paper, we tackle multiclass classification tasks but our model can be easily extended to any supervised problem such as regression or multilabel classification. We consider a training dataset of inputs , and corresponding labels , . The goal is to learn a predictive function that exhibits high classification performance, while being easily understandable by humans.
2.1 Principles
The general idea of our approach is the following: classical (deep learning) models take as input lowlevel features describing the inputs (e.g pixels, words) and directly provide a highlevel output, such as a category label . The final prediction is entirely based on complex computations over lowlevel input features, which renders the interpretation of the model hard to parse for a human. Even if Deep Neural Network (DNN) based models build intermediate representations, these are often highdimensional and not constrained to extract meaningful information.
Our model, called EDUCE for Explaining model Decisions through Unsupervised Concepts Extraction relies on two main principles:

Lowdimensional Binary Representation for Classification: EDUCE builds a midlevel representation of any input in the form of a lowdimensional binary vector. Each binary feature denotes the presence or absence of different concepts in the input and is computed by extracting local patterns or subparts of the input (see Figure 1). The output is computed based on this binary representation allowing a quick interpretation of the decision.^{1}^{1}1In our case, the classifier is a linear model without bias, and thus final score is a weighted sum of the concepts appearing in the input.

Concepts Identification: Since the training dataset does not contain any midlevel labels, the extraction of meaningful concepts is unsupervised but constrained through the concept identification criterion that ensures that all the patterns extracted for each concept carry a common semantics, thus allowing an easier interpretation.
These two principles are captured through three main components learned simultaneously: (i) the concept extractor is in charge of identifying if and where a concept appears in the input, (ii) the final classifier computes the final prediction based on the absence/presence of the concepts and (iii) the concept classifier ensures that concepts are homogeneous and identifiable from each others.
2.2 Concept extraction and final prediction
Let us consider a set of concepts. The concept extractor builds an intermediate binary representation of the input , where each value denotes the presence or absence of concept in . Therefore, the concept extractor replaces the first layers in classical DNN architectures, such that is lowdimensional and discrete (binary in our case). We build through a stochastic process in two steps: i) first for each concept , patterns that are the more likely to correspond to each concept are identified (step 1 Figure 1) and ii) each extracted pattern is used to decide on the absence or presence of into (step 2 Figure 1) giving the binary representation (shown in step 3 Figure 1).
Let us define as the set of all local patterns in , for example a set of patches in an image. We denote
the probability that the pattern contained in
is the most relevant to concept in i.e. such that , and are the parameters of the distribution. Now let us denote the probability, parameterized by , that the extracted pattern triggers the presence of concept . The intermediate representation of is obtained by two consecutive sampling steps:(1) 
The final decision is solely computed from the intermediate representation . We use a linear classifier without bias to rely on its weights for its interpretation: for each category , each concept is associated with a weight denoted . The final score is computed by summing the weights of concepts identified into the input, i.e. for which . We obtain the probability , parameterized by , through a softmax function such that .^{2}^{2}2Note that, for sake of clarity, we use an approximative notation as the softmax function considers the scores of all possible categories .
2.3 Concepts identification
Since is a binary representation of , our method is very close to sparsecoding techniques (Mairal:2009) and does not have the incentive to extract meaningful information. Without any additional constraint, it would be difficult or even impossible to interpret the concepts discovered by the model. Indeed, due to the combinatorial nature of the midlevel representation , the model can easily find combinations of patterns that allow good classification accuracy, without extracting meaningful patterns. Let us denote the patterns extracted by the concept extractor for each concept. It is necessary to ensure that, for any concept , all extracted share common semantics, and that the semantics carried by patterns in concept is different than the one carried by patterns in another concept .
This constraint is enforced in EDUCE by jointly learning a multiclass concept classifier able to classify the pattern in as belonging to concept , thus defining the categorical distribution where are its parameters. This classifier is learned on patterns responsible for each concept’s appearance in the input. Therefore, the concept classification loss is the crossentropy loss, only considering concepts appearing in (i.e such that ).
(2) 
Another way to obtain consistent patterns would be to add a sparsity constraint on the number of concepts present in any input, preventing the use of combinations of patterns to have good classification accuracy. Therefore, we consider adding a norm constraint on the number of concepts that are present for a given input example. Nonetheless, we experimentally demonstrate that the constraint is not sufficient, and can harm final performance by making the representation coarser. On the opposite, our concept classifier is necessary and sufficient: depends on the number of concepts present hence sparsity is encouraged. However, if discovered concepts are consistent and easy to identify, can be low without harming task performance.
2.4 Objective function and learning algorithm
Our objective function mixes the final classification crossentropy and the concept classifier loss, as:
(3) 
where and are sampled as in Equation 1, controls the strength of the concept identification w.r.t. the final prediction, guides the strengh of the sparsity constraint and denotes the norm. The learning algorithm optimizes the parameters of the distributions , and . As the explicit computation of the expectation involves expensive summations over all possible values of and
we resort to MonteCarlo approximations of the gradient. This is a classic method in the Reinforcement Learning
(Sutton:1998:IRL:551283). The resulting learning algorithm is given in Algorithm 1 and the gradient derivation is provided in Supplementary Material.^{3}^{3}3We use the average loss as control variate. Note that the learning can be efficiently implemented for a large variety of architectures over batches, using one GPU per run. Our code for the text and image experiments will be released upon acceptance.(4)  
(5) 
3 Text classification experiments
Setting We experiment on the DBpedia ontology classification dataset and the AGNews topic classification dataset (Zhang:LeCun:2015). The DBpedia ontology classification dataset was constructed by picking nonoverlapping categories from DBpedia 2014 (Lehmann:etal:2014). We subsample examples of the train dataset for training, and for validation. For testing, we subsample of the examples in the test dataset (using stratified sampling). The AGNews topic classification dataset was constructed from the AG dataset’s 4 largest categories. We divide the training set into two sets: training samples and validation samples. We test on the full test dataset composed of samples. We use pretrained word vectors trained on Common Crawl (Grave:etal:2018), and keep them fixed. We consider patterns as all sets composed of consecutive words.^{4}^{4}4We considered flexible number of words in the patterns but performance were poorer, we consider this as direction for future research.
Therefore, sampling a pattern is equivalent to the sampling of its start word. For comparison, we train a noninterpretable “Classic" model that uses a Bidirectional LSTM, while EDUCE is based on convolutional layers as we want to use a general architecture that works on multiple data types. We monitor final prediction accuracy on the validation set and report results on the test set. For each set of hyperparameters, we run
different random seeds. We explore three different number of concepts: , and . Details on the range of hyperparameters, the training procedure and size of the architecture is in supplementary material Section 8.Quantitative analysis Table 1 reports the performance on the DBPedia dataset and AGNews dataset for concepts. We report the final accuracy (Final Acc.) over the task, the accuracy of the concept classifier on the test data (Concept Acc.). Naturally, Concept Acc. should be low for models with . Therefore, we also compute an a posteriori concept accuracy: after training, for each model, we gather the concepts patterns it detects () on the test data. We separate the patterns into two sets (training and testing, note that these are both generated from the test data). For each model, we train a new concept classifier a posteriori on the model’s patterns and report the a posteriori concept classifier performance (A Posteriori Concept Acc.). We also report the average number of concept that are detected as present (i.e. ) per input (Sparsity). We tried different values of and , and the combination of the two, our method is defined by values ). We show here the most relevant to our analysis, complete results are available in the supplementary Section 8.3.1
. For all metrics we report the mean and standard error of the mean (SEM) over the training random seeds.
Model  Final Acc. (%)  Concept Acc. (%)  A Posteriori Concept Acc. (%)  Sparsity  
DBPedia  
Baseline  
Baseline  
Baseline  
EDUCE  
Classic  N/A  N/A  N/A  
AGNews  
Baseline  
Baseline  
Baseline  
EDUCE  
Classic  N/A  N/A  N/A 
First, looking at the performance of the “Classic" model, we see that encoding the input into a lowdimensional binary vector only reduces the accuracy of a few percent (from to on DBPedia and to on AGNews dataset). This means that classifying by identifying relevant patterns is an efficient approach. As expected with , patterns extracted for each concept are not homogeneous as a posteriori concept accuracy is low ( and for DBPedia and AGNews respectively). Adding our concept classifier () greatly improves the concept accuracy without significative loss on final accuracy. EDUCE obtains (resp. ) classification performance with (resp. ) concept accuracy on DBPedia (resp. AGNews).
Using only a sparse constraint with results in a much lower concept accuracy, meaning that patterns are less consistent within a concept. The only exception is with but this achieved at the expense of final classifier’s performance that drops significatively on AGNews. To explain this, note that on the AGNews dataset, the number of concept is larger than the number of categories ( categories) so a simple solution to obtain high concept accuracy is to map one concept per category. Indeed, the model only using norm constraint without our concept classifier has an average of concept present per input, and supplementary Figure LABEL:fig:matcatag shows that this corresponds to mapping one concept per class. This makes the final performance go down to as the representation of the input is coarser. On the opposite, our model does not suffer from this: we achieve with a final performance of with an a posteriori concept accuracy of showing that concepts are consistent, yet maintaining on average concepts present per input. Note that adding the constraint to our method (, ) does not improve the relevance and consistency of discovered concepts as measured by a posteriori accuracy values, and can hurt final performance (see Table LABEL:tab:detailedres and Figure A.4 in supplementary Section 8.3.1).
Figure 1(a) compares the effect of using different values of (left is DBPedia, right is AGNews). We see that using a smaller value results in higher concept accuracy, at the expense of final classification performance. On the opposite, a larger value of gives higher final classification performance, which is expected as the binary representation is of larger size, but in poorer a posteriori concept accuracy. Still using concepts and we achieve concept accuracy on DBPedia and a higher final classifier performance than with concepts.


Interpreting EDUCE
We turn to show how EDUCE’s category prediction is easily interpretable. The following results were generated with and no constraint and concepts. Table 1(a) shows a document from the DBPedia Dataset labeled as Natural Place, where the underlined words correspond to the pattern extracted for different concepts. Separately, in Table 1(b) we show, for each concept detected in the example of Table 1(a), some patterns extracted from others test documents (each set of 3 words is a pattern, patterns are commaseparated). This allows us to interpret the concepts’ meaning: concept 0 maps to the notion of geographical information, concept 7 to the idea of nature and concept 3 to the notion of municipality. We also see that the patterns extracted in the example Natural Place in Table 1(a) are consistent with these interpretations. Importantly, note that in Table 1(b) patterns are consistent yet come from multiple categories: for the four concepts shown, each extracted pattern belongs to a different category. To corroborate this, Figure 1(b) shows the empirical frequency of presence of each concept, per category. We see that multiple concepts appear per category, and that concepts are shared among categories. For this setting sparsity is (see supplementary Table LABEL:tab:detailedres), i.e. on average each text input triggers concepts. These results show how easily the categorization of any text can be explained by the detection of multiple, relevant, and intelligible concepts. More qualitative examples are in the supplementary Section LABEL:sec:expesupp2.
4 Image classification experiments
Having assessed the relevance of our model on text data, we now turn to image data and explore if the EDUCE model is also able to extract meaningful concepts.
Setting We tackle image classification using MNIST Lecun:etal:98 to evaluate our approach, and results over the dataset are given in supplementary material. To further test the relevance of the patterns detected, we build a dataset where each image contains two randomly located of different labels. As we consider labels to , there are 10 possible resulting categories that are the combination of the two digits label: . We train on generated images, and tested on different images. We achieve test final classification performance using concepts. Figure 2 shows extracted patterns (not cherry picked) for the concepts, and the categories associated with the appearance of this concept. As in our experiments with text data, we can explain the model behavior: the model learns to extract single digits as patterns for different concepts, that are then combined to predict the final category.
We also conduct experiments on a dataset composed of RGB images split in categories: dogs, cats and birds^{5}^{5}5We construct this dataset by combining random images from the Caltech Bird 2002011 dataset (WahCUB_200_2011) with images of the catsanddogs Kaggle dataset (KaggleDogVCat) in equal proportion. We train on images and test on images. We build our model on top of a pretrained VGG11 model Simonyan14c. Figure 2 shows extracted patterns and associated categories. Final classification performance is with concepts. In Figure 3 we plot extracted patterns for the concepts and report in Figure 3(a) the weights of the final classifier. From these two figures, we can interpret the model’s behavior: concept 8 and concept 9 show what differentiate a dog from a cat or a bird, and support the classifier’s prediction of the dog category. Figure 3(b)
shows the extracted patterns on random images. We can see that our model focuses on relevant parts of the images, similar to attention models.
5 Related work
A posteriori explanations A first type of existing methods interprets an alreadytrained model, typically using perturbation and gradientbased approach. The most famous method is LIME (Ribeiro:etal:2016), but other method exist (Bach:etal:2015; Shrikumar:etal:2017; Simonyan:etal:2014; Sundararajan:etal:2017). (Melis:Jaakola:2017) design a model that detects inputoutput pairs that are causally related. (Kim:etal:2018) propose to explain a model’s prediction by learning concept activation vectors. However, the classifier is fixed and concepts are predefined, requiring human annotations, while we learn both jointly and in an unsupervised and endtoend manner.
Selfinterpretable models Contrarily to the previous line of work, our work fall in the domain of selfinterpretable models. Several existing methods propose interpretable models for NLP tasks. Such methods are specific to text data select and rationales, i.e. parts of text, on which the model bases its prediction, see (Lei:etal:2016; Yu:etal:2019) and very recently (Bastings:etal:2019). Moreover, they do not encourage selected rationales to match datasetwide instances of concepts. (Goyal:etal:2019) propose visual explanations of a classifier’s decision, while (Alaniz:Akata:2019) use an architecture composed of an observer and a classifier, in which the classifier’s prediction can be exposed as a binary tree structure. However, contrarily to ours, their model does not provide a local explanation of the decision based on parts of the input. Closer to our work, (Melis:Jaakkola:2018) learn a selfexplaining classifier that takes as input a set of concepts extracted from the original input and a relevance score for each concept. While they define a set of desiderata for what is an interpretable concept, they simply represent the set of extracted concept as an encoding of the input and learn it with an autoencoding loss. Their work can be seen as a generalization of (Li:etal:2018). (Quint:etal:2018)
extend a classic variational autoencoder architecture with a differentiable decision tree classifier that takes as input the encoding of the data sample. Hence the classification is based on a binary representation of the data as in our model. However, they methodology is different and they only experiment on image data.
Other works Albeit not directly towards building an interpretable classifier, (KenyonDean:etal:2019) propose an attractiverepulsive loss which clusters the data into the different categories. (Mordatch:2018) propose a model that learns to define a concept by a combination of events in an environment. Our work is also close to Latent Dirichlet Allocation (LDA) for topic models (Blei:etal:2013)
, yet the methodology is different: LDA learn the parameters of a probabilistic graphical model of text generation with approximate inference.
6 Discussion and perspectives
We propose a new neural networksbased model, EDUCE, that is selfinterpretable thanks to a twostep method. First, it computes a lowdimensional binary representation of inputs that indicates the presence of automatically discovered concepts. Each positive feature in this representation is associated with a particular pattern in the input, and patterns extracted for one particular concept are enforced to be identifiable by an external classifier. We experimentally demonstrate on text categorization and image classification, using very similar architectures in both type of data, the relevance of our approach. The EDUCE model extracts meaningful information, and provides understandable explanation to the final user. We contemplate multiple direction for future research. First, if supervision at the concept level was available, we could use it to ground the discovered concepts to ‘humans’ notions, yet letting the model discover extraconcept to avoid any bias. Another direction would be to make EDUCE output a compact representation of the classification process, e.g. using natural language generation on top of our approach.
References
7 Details on the learning algorithm
Algorithm 1 of the main paper shows how we compute the gradient for each of the parameter. Specifically, when we compute the loss to be backpropagated, we tune the weight of each term. That is, we backpropagate
(6)  
(7)  
(8)  
(9)  
(10)  
(11) 
where controls the strength of the Reinforcement Learning terms w.r.t. the gradients over and , and controls the strength of the gradient w.r.t. over the gradient w.r.t. .
8 Details on text experiments
8.1 Detailed setting
For our experiment on text data we use the DBpedia ontology classification dataset and the AGNews topic classification dataset both created by [Zhang:LeCun:2015]. The DBpedia ontology classification dataset was constructed by picking nonoverlapping categories from DBpedia 2014, a crowdsourced community effort to extract structured content from Wikipedia [Lehmann:etal:2014]. The train dataset has examples, among which we subsample examples for training, and for validation. For testing, we subsample of the examples in the test dataset (using stratified sampling).
The AGNews topic classification dataset was constructed from the AG dataset, a collection of more than 1 million news articles, by choosing 4 largest categories from the original corpus. Each category contains training samples, from which we divide into two sets: training samples and validation samples. We test on the full test dataset composed of samples. In both datasets the title of the abstract or article is available but we do not use it. We use pretrained word vectors trained on Common Crawl [Grave:etal:2018], and keep them fixed. For both datasets, the vocabulary is built using only the most frequents words on the training and validation dataset. Code for preprocessing of the datasets will be released along the code for our model.
We consider patterns as all sets composed of consecutive words. Therefore, sampling a pattern is equivalent to the sampling of its start word resulting in an efficient sampling model. The “Classic" model is a Bidirectional LSTM while our model is based on convolutional layers. We monitor final prediction accuracy on the validation set and report results on the test set. For each set of hyperparameters, we run different random seeds and crossvalidate hyperparameters on the average performance across seeds. We explore three different number of concepts: , and , and we evidently consider each value of separately as directly affects the concept classifier’s base performance. The full range of hyperparameters explored and size of the architecture is listed in the supplementary material Section 8.
8.2 Architectures
Every word in the input is represented as an pretrained word embedding vector. We use pretrained word vectors trained on Common Crawl [Grave:etal:2018] and keep them fixed. The vectors are of size The Bidirectional LSTM (BiLSTM) we use for the “Classic" model has 1 layer. The size of the hidden state is . The BiLSTM processes each text input up to it maximum length, then we concatenate the final forward and backward hidden layers together. The concatenated vector is fed to a linear layer that returns the score over all possible categories .
For our model, we consider patterns of fixed size of 3 words: for an input text , each pattern of is a combination of 3 consecutive words, therefore of size . We feed each pattern of to a linear layer of output size , followed by a softmax nonlinearity over the possible patterns, for each concept. We then sample one pattern per concept at training (at testtime we take the most probable). We then take the dot product of a weight vector
per concept with the selected pattern, followed by a sigmoid activation function, in order to obtain the probability of that concept being present.
8.3 Hyperparameters considered
We try the following ranges of hyperparameters , , , . We use a learning rate of and Adam optimizer [Kingma:and:Ba:2014], batches of size .
8.3.1 Detailed results
Table LABEL:tab:detailedres details test performance as reported in the main paper Table 1 for all values of we tried. Figure A.4 shows final classification performance (yaxis) w.r.t. concept accuracy performance a posteriori (xaxis) on the two text datasets considered (left is DBPedia, right is AGNews). Each marker on the lines corresponds respectively to increasing the value of in . Each line corresponds to a different value of the sparsity constraint parameter . The horizontal orange lines denotes the “Classic" model classification performance (its concept classification performance is not computable as it does not rely on the binary representation ). Shaded areas denotes standard error of the mean (SEM) over the random training seeds. Both the table and figure illustrate the clear tradeoff between final classification performance and concept consistency, where concept accuracy performance of results in a much lower final accuracy. We also see that adding the constraint to our method (, ) does not improve much in terms of obtaining meaningful, consistent concepts (as per concept a posteriori accuracy values), and even hurt final performance.