Currently, Deep Learning (DL) constitutes the state-of-the art models in many problems lecun2015deep; hinton2006reducing; xu2015show; vaswani2017attention; karpathy2014large. These models are opaque, complex and hard to debug, which makes their use unsafe in critical applications such as healthcare and high-risk scenarios. Furthermore, DL often requires a large amount of training data with over-simplified annotations that obviate an important part of centuries-long knowledge from domain experts. At the same time, DL generally uses correlation shortcuts to produce their outputs, which makes them finicky and difficult to correct. On the contrary, most classical symbolic AI approaches are interpretable but do not reach neither similar levels of performance nor scalability.
Among the potential solutions to clarify the decision process of a DL model, the topic of eXplainable AI (XAI) emerges. Given an audience, an XAI system produces details or reasons to make its functioning clear or easy to understand arrieta2020explainable; guidotti2018survey. To make black box Deep Learning methods more interpretable, a large amount of works exposed their vulnerabilities and sensitivity and came up with visual interpretation techniques, such as attribution or saliency maps zeiler2014visualizing; selvaraju2017grad; olah2017feature. However, the explanations provided by these methods, often in form of heatmaps, are not always enough, i.e. they are not easy to quantify, correct, nor convey to non technical audiences Jain2019AttentionIN; wiegreffe-pinter-2019-attention; viviano2021saliency; maguolo2020critic; he2020sample; adebayo2018sanity; kindermans2019reliability.
Having both specific and broad audiences of AI models contributes towards inclusiveness and accessibility, both part of the principles for responsible arrieta2020explainable and human-centric AI pisoni2021human. Furthermore, as advocated in diaz2020accessible, broadening the inclusion of different minorities and audiences can facilitate the evaluation of AI models when the objective is deploying human-centred AI systems.
A very critical challenge is thus to blend DL representations with domain expert knowledge. This leads us to draw inspiration from Neural-Symbolic (NeSy) learning dAvilaGarcez19NeSy; besold2017neural, a learning paradigm composed by both neural (or sub-symbolic) and symbolic AI components. An interesting challenge consists of bringing explainability in this fusion through the alignment of such learned and symbolic representations Bennetot19. In order to pursue this idea further, we approach this quest by considering the expert knowledge to be in form of a KG.
Since our ultimate objective is fusing DL representations and domain expert representations, to fill this gap we propose the eXplainable Neural-symbolic (X-NeSyL
) learning methodology, to bring explainability in the process. X-NeSyL methodology is aimed to make neural-symbolic models explainable, while providing more universal explanations for both end-users and domain experts. X-NeSyL methodology is designed to enhance both performance and explainability of DL, in particular, a convolutional neural network (CNN) classification model. X-NeSyL methodology is constituted by three main components:
A symbolic processing component to process symbolic representations, in our case we model explicit knowledge from domain experts with knowledge graphs.
A neural processing component to learn neural representations, EXPLANet: eXplainable Part-based cLAssifying NETwork architecture. EXPLANet is a compositional deep architecture that allows to classify an object by its detected parts.
An XAI-informed training procedure, able to guide the model to align its outputs with the symbolic explanation and penalize it accordingly when this is not the case. We propose SHAP-backprop to align the representations of a deep CNN with the symbolic one from a knowledge graph, thanks to a SHAP Attribution Graph (SAG) and a misattribution function.
The election of these components is designed to enhance a DL model by endowing its output with explanations at two levels:
Enhancement of the explanation at inference time: We extend the classifier inference procedure to not only classify, but also detect what will serve as basis for the explanation. These components should be possible to be specified through the symbolic component, e.g., a knowledge graph that acts as gold standard explanation from the expert. EXPLANet is proposed here to classify an object based on the detected object-parts, and thus, has the role of facilitating the mapping of neural representations to symbols.
Enhancement of the explanation at training time: We penalize the original model at this second training phase, aimed towards improving the original classifier, thanks to an XAI technique called Shapley analysis lundberg2017unified that assesses the contribution of each feature to a model output. SHAP-backprop training procedure is presented to adjust the model using a misattribution function that quantifies the error coming from the contribution of features (object-parts) attributed to the output (expressed in a SHAP Attribution Graph, SAG) not in agreement with the theoretical contribution expressed by the expert knowledge graph.
Together with the X-NeSyL methodology, this paper contributes an explainability metric to evaluate the interpretability of the model, SHAP GED (SHAP Graph Edit Distance), that measures the degree of alignment between the symbolic (expert) and neural (machine) representations. The objective of this metric is to gauge the alignment between the explanation from the model and the explanation from the human target audience that validates it.
We illustrate the use of X-NeSyL methodology through a guiding use case on monument architectural style classification and its dataset named MonuMAI lamas2020monumai. We selected this dataset because it includes object-part-based annotations which make it suitable for assessing our proposal.
The pipeline components of the X-NeSyL methodology are summarized in Fig. 1. They are meant to complete a versatile template architecture with pluggable modular components to make possible the fusion of representations of different nature. X-NeSyL methodology can be adapted to the needs of the use case, and allows the model to train in a continual learning lesort2020continual setting.
The experiments to validate the X-NeSyL methodology make evident the well known interpretability-performance trade-off with respect to traditional training with an improvement of 3.6 % with respect to the state of the art (MonuNet lamas2020monumai) on MonuMAI dataset. In terms of explainability, our contributed interpretability metric, SHAP GED, reports a gain of up to 0.38 –from 0.93 to 0.55–. The experimental study shows that X-NeSyL methodology makes it possible for CNNs to gain explainability and performance.
The rest of this paper is organized as follows: First we present the literature around XAI, and compositional, part-based classifiers in Section 2. We present a set of frameworks on Neural-Symbolic integration as a basis and promising body of research to attain XAI in Section 3. We describe X-NeSyL methodology in Section 4. Its core components are presented therein, Section 4.1.1 presents the symbolic component, i.e., how KGs can be used to represent symbolic expert knowledge to be leveraged by a DL model, Section 4.2 presents the neural representation component describing EXPLANet architecture, and Section 4.3 the XAI-guided training method SHAP-Backprop. X-NeSyL methodology is evaluated through the proposed explainability metric SHAP GED, presented and illustrated in Section 5. The complete methodology pipeline is illustrated through a driving use case on MonuMAI cultural heritage application in Section 6. Section 7 we discuss results, alternative perspectives, and open research avenues for the future. Finally, the Appendix includes additional experiments with an extra dataset, PASCAL-Part.
2 Related work: Explainable deep learning and compositional part-based classification
The Explainable AI literature is blooming parallelly with the advances of DL models, and so is the set of surveys doing a great job at classifying the various methods arrieta2020explainable; buhrmester2019analysis; guidotti2018survey. We particularly focus on attribution methods, i.e. XAI methods that relate a particular output of a DL model to their input variables. They can both be model agnostic, but also aim at improving the quality of the visualization, such as heatmaps, saliency maps or class activation methods. In the latter case, attribution studies what part of an input example is responsible for the network activating in a particular way olah2017feature; arrieta2020explainable.
This section reviews three types of XAI attribution methods, 1) local explanations, 2) saliency maps and 3) compositional part-based classification models.
2.1 Local explanations
Methods categorized inside local explanations have one really important property: they are (most of the times) model agnostic and do not need access to the neural architecture. The idea consists of starting from any specific point in the input space, and explore its neighborhood to understand what has caused the prediction.
LIME ribeiro2016should is a technique that explains the prediction by learning a local interpretable model around the prediction. The proposed approach limits itself to linear models as local interpretable models. This method can be seen as an attribution method, as it tries to gauge the importance of the input feature with the final prediction.
When applied to images, LIME divides the input image into superpixels (such as SLIC achanta2012slic) as they tend to have more meaningful information than singular pixels for a human audience. The output consists of the superpixels that contribute the most to a specific class (a factual explanation). This idea can be expanded to superpixels that contribute negatively to the output. Furthermore, LIME proposes a framework (SP-LIME) to find representative examples and their explanations while avoiding to give redundant information. This approach is highly interesting for data augmentation purposes but it was not illustrated in the case of images.
Anchors ribeiro2018anchors is built on the idea of LIME, i.e., giving a local explanation thanks to a locally interpretable model, but aims at giving if-then rules called anchors. An anchor explanation is a rule that sufficiently anchors the prediction locally – an explanation such that changes to the rest of the feature values of the instance do not matter.
Similarly to LIME, anchors are not restricted to use with tabular data and can be expanded to images thanks to the use of superpixels. Basically, it select a few superpixels that seem to be of most importance (anchors) and alter the rest of pixels to verify if the prediction stays the same.
Minimal input deformation fong2017interpretable proposes to study the neighborhood of a prediction by using a meaningful perturbation, and mainly focusing on blur and noise, which are a kind of ”natural” way to delete parts of the image. Having introduced the notion of meaningful deletion for images, the problem of finding the information is reduced to a deletion game where the smallest mask causing the confidence to drop a certain amount (defined beforehand as a hyper parameter) is sought.
SHAP (SHapley Additive exPlanation) values lundberg2017unified
is a method to explain individual predictions. SHAP is based on coalitional game theory and theoretically optimal Shapley Values. The goal of SHAP is to explain the prediction of an instanceby computing the contribution of each feature to the prediction. The feature values of a data instance act as players in a coalition. The computed Shapley values tell us how to fairly distribute the ”payout” (i.e., the prediction) among the features. A player can be an individual feature value (e.g., for tabular data), or a group of feature values. For example, to explain an image, pixels can be grouped into superpixels, and the prediction distributed among these superpixels. One innovation that SHAP brings to the table is that the Shapley value explanation is represented as an additive feature attribution method, a linear model molnar2020interpretable, and the desirable properties of local accuracy, missingness and consistency make it the most consistent with explanations from humans that understand the model. In this sense, SHAP analyses the additiveness of different coalitions of subfeatures in a more generic framework, while LIME can be seen as a particular case of SHAP assessing all features. It is also considered a global explanation method arrieta2020explainable where summary plots provide the average impact of a single or a set of samples on model output magnitude, but using all classes and model predictions.
2.2 Explanations as saliency maps
Saliency maps have been a very powerful tool in explaining deep CNNs, as they propose an easily interpretable map. Indeed, a saliency map is going to be, most of the time, a heatmap superimposed over an image that gives information about what parts of the image were useful for the prediction. These are, like SHAP and LIME, attribution methods, and they are pretty standard in terms of usage in post-hoc explanations of Computer Vision black box models.
Guided Backpropagation & DeconvNet springenberg2015striving zeiler2014visualizing
are probably the most famous and oldest methods. DeconvNets are based on the idea of basically running the model backwards to map activities from intermediate convolution layers back to the input pixel space. This is done thanks to deconvNets, as presented inzeiler2011adaptive
, without the learning part. Basically, in order to examine a given ConvNet activation, all other activations in the layer are set to zero, and the feature maps are passed as input to the attached deconvNet layer. Then it is unpooled, rectified and filtered to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until the input pixel space is reached. Guided backpropagation is a variant of the standard deconvolution approach that is meant to work on every type of CNNs, even if no pooling layers are present. The main difference lies on the way ReLU functions are handled in the different cases.
DeepLIFT shrikumar2017learning main specificity is that it computes importance scores based on differences with a reference (in case of images, a black image). There is a connection between DeepLIFT and Shapley values. The Shapley values measure the average marginal effect of including an input over all possible orderings in which inputs can be included. If we define ”including” an input as setting it to its actual value instead of its reference value, DeepLIFT can be thought of as a fast approximation of the Shapley values. It can as well be seen as an extension of SHAP for images.
LRP (layer-wise relevance propagation): bach2015pixel; binder2016layer introduce a novel way to consider the operation done inside a neural architecture with the concept of relevance. Given their definition of relevance and adding certain properties, relevance intuitively corresponds to the local contribution to the prediction function f(x). The idea of LRP is to compute feature relevance thanks to a backward pass and thus, it yields a pixel-wise heat map.
PatternNet & PatternAttribution kindermans2018learning take the previous work a step further by applying a proper statistical framework to the intuition behind. More precisely, they build on slightly more recent work called DTD (deep Taylor decomposition) as introduced in montavon2017explaining
. The key idea of DTD is to decompose the activation of a neuron in terms of contributions from its inputs. This is achieved using a first-order Taylor expansion around a root point. The difficulty in the application of DTD is the choice of the root point , for which many options are available. PatternAttribution is a DTD extension that learns from data how to set the root point. This way the function extracts the signal from the data, and it maps attribution back to the input space, which is the same idea of relevance. PatternNet
yields a layer-wise back-projection of the estimated signal to the input space.
CAM (Class Activation Mapping) zhou2016learning has as goal to leverage the effect of Global Average Pooling layers to a localization of deep representations for CNNs. A class activation map for a particular category indicates the discriminative image regions used by the CNN to identify that category. The process for the basic CAM approach is to put a Global Average Pooling layer on top of the convolution network and then perform classification. The layer right before the Global Average Pooling is then visualized.
Grad-CAM & Grad-CAM++ selvaraju2017grad; chattopadhay2018grad emerged from the need to go faster than CAM and avoid a training procedure to happen. The idea of class activation mapping is kept, but to build weights on the features maps (convolution layers) it is using a backpropagated gradient from the score given to a specific class. The advantage is that no architectural changes or re-training is needed, contrary to the architectural constrains of the CAM approach. With a simple change, Grad-CAM can similarly provide counterfactual activations for a specific class. Grad-CAM++ expands Grad-CAM with an improved way to process the weights of the feature maps. Grad-CAM is widely used for interpreting CNNs.
Score-CAM wang2020score is a novel approach in between CAM-like approaches and local explanations. The main idea behind this new approach is that it does not need to backpropagate any signal inside the architecture and as such, only a forward pass is needed.
Integrated gradient sundararajan2017axiomatic proposes a new way to look at the issue. The underlying idea relies on the complexity of ensuring that a visualisation is correct besides being visually appealing and making sense. Here it introduces two axioms that attribution methods should follow, called sensitivity and implementation invariance.
Sensitivity: An attribution method satisfies Sensitivity if for every input and baseline that differ in one feature but have different predictions, the differing feature is given a non-zero attribution.
Implementation invariance: Two networks are functionally equivalent if their outputs are equal for all inputs, despite having very different implementations. Attribution methods should satisfy Implementation Invariance, i.e., attributions should always be identical for two functionally equivalent networks.
Some easily applicable sanity checks can be done to verify that a method is dependant on the parameters and the training set adebayo2018sanity. For instance, integrated gradient stems from path methods, and verify both precedent axioms. We consider the straight line path (in ) from the base line between and in the input, and compute the gradients at all points along the path. Integrated gradients are obtained by accumulating these gradients. Specifically, integrated gradients are defined as the path integral of the gradients along the straight line path from the baseline to the input .
All these methods seem visually appealing, but most of them rely on heuristics about what we want to look at, more or less well defined. Some works have been proposed to check the validity of such methods as tools to understand the underlying process inside neural architectures.
It has also been highlighted by the research community that saliency methods must be used with caution, and not blindly trusted, given their sensibility to data and training procedures viviano2021saliency; kindermans2019reliability.
2.3 Compositional Part-based Classification Models
Compositionality andreas2019measuring in computer vision refers to the notion or capacity to represent complex concepts (from objects to procedures to beliefs) by combining simple parts fodor2002compositionality; andreas2019measuring. Despite CNNs being not inherently compositional, compositionality is a desirable property for them to be learned stone2017teaching. For instance, hand-written symbols can be learned from only a few examples using a compositional representation of the strokes lake2015human. The compositionality of neural networks has also been regarded as key to integrate symbolism and connectionism hupkes2019compositionality; mao2019neuro.
Part-based object recognition is an example of semantic compositionality and a classic paradigm, where the idea is to gather local level information to make a global classification. In de1999object
, the authors propose a pipeline that first groups pixels into superpixels, then does segmentation at the superpixel-level, transforming this segmentation into a feature vector and finally classifying the global image thanks to this feature vector. Similar work is proposed byhuber2004parts, where they extend it to 3D data. Here the idea is to classify part of the image into a predefined class, and then use those intermediate predictions to provide a classification of the whole image. The authors of bernstein2005part also define mid level features that capture local structure such as vertical or horizontal edges, Haar filters and so on. However they are closer to dictionary learning than to the work we propose in this paper.
One of the most well known object parts detection model is felzenszwalb2009object
. It provides object detection based on mixtures of multiscale deformable part models, based on data mining of hard negative examples with partially labelled data to train a latent SVM. Evaluation is done in PASCAL object detection challenge (PASCAL VOC benchmark).
Finally, more recently, semi-supervised processes were developed such as ge2019weakly. They are proposing a two step neural architecture for fine-grained image classification aided by local detections. The idea is that positive proposal regions are highlighting varied complementary information, and that all this information should be used. In order to do that, first an unsupervised detection model is made by alternatively applying a CRF and Mask-RCNN (given an initial approximation with CAM). Then having a detection model and thus the positive region proposal they are fed to a Bi-Directional LSTM that will produce a meaningful feature vector accumulating information across all regions and then be able to classify the image. It can be seen as unsupervised part-based classification.
Except the last one, these models predate the era of the democratization of deep learning. As with deep learning, end-to-end pipelines were prioritized, with the expectation that abstract machine level representations become stronger than handcrafted features. In our work we propose to still use an intermediate representation, since machine level deep representations are hard to interpret, yet they represent the most performant models to date. However, we probe this intermediate deep representation and compare it with domain knowledge. The objective is aligning the model output explanations with those of the expert, during training.
3 Neural-Symbolic (NeSy) Integration models
One approach to merge deep representations with symbolic knowledge representation and/or adding explainability to deep neural networks such as CNNs is through Neural-Symbolic (NeSy) integration. NeSy integration aims at joining standard symbolic reasoning with neural networks in order to achieve the best of both fields and soften their limitations. A complete survey of this method is provided in dAvilaGarcez19NeSy; GarcezLG2009. Indeed, symbolic reasoning is able to work in presence of few data as it constrains entities through relations. However, it has limited computational properties and needs background knowledge. On the other hand, neural networks are fast and able to infer knowledge. However, they require a lot of data and have limited reasoning properties. Their integration overcomes these limitations and, as stated by BianchiPHS19; TownsendCM20, improves the explainability of the learned models. In the following, we present the main NeSy frameworks.
Many NeSy frameworks treat logical rules as constraints to be embedded in a vector space. In most of the cases these constraints are encoded into the regularization term of the loss function in order to maximize their own satisfiability. Logic Tensor NetworksSerafiniG16 and Semantic Based Regularization DiligentiGS16 perform the embedding of First-Order Fuzzy logic constraints. The idea is to jointly maximize both training data and constraints. Both methods are able to learn in presence of constraints and perform logical reasoning over data. In Semantic Based Regularization the representations of logical predicates are learnt by kernel machines, whereas Logic Tensor Networks learn the predicates with tensor networks. Other differences regard the dealing of the existential quantifier: skolemization for Logic Tensor Networks, or conjunction of all possible groundings for Semantic Based Regularization. Logic Tensor Networks have been applied to semantic image interpretation by Donadello17 and to zero-shot learning by DonadelloS19. Semantic Based Regularization has been applied, for example, to the prediction of protein interactions by SaccaTDP14 and to image classification DiligentiGS16. Both Logic Tensor Networks and Semantic Based Regularization show how background knowledge is able to i) improve the results and ii) counterbalance the effect of noisy or scarce training data. Minervini018 proposed a regularization method for the loss function that leverages adversarial examples: the method firstly generates samples that maximize the unsatisfaction of the constraints, then the neural network is optimized to increase their satisfaction. KriekenAH19
proposed another regularization technique applied to Semi-Supervised Learning where the regularization term is calculated from the unlabelled data. In the work ofXuZFLB18, propositional knowledge is injected into a neural network by maximizing the probability of the knowledge to be true.
Other works use different techniques but keep the idea of defining logical operators in terms of differentiable functions (e.g., TowellS94; GarcezZ99). Relational Neural Machines is a framework developed by MarraDGGM20
that integrates neural networks with a First-Order Logic reasoner. In the first stage, a neural network computes the initial predictions for the atomic formulas, whereas, in the second stage, a graphical model represents a probability distribution over the set of atomic formulas. Another strategy is to directly inject background knowledge into the neural network structure as done bydanieleKENN. Here, the knowledge is injected in the model by adding new layers to the neural network that encode the fuzzy-logic operator in a differentiable way. Then, the background knowledge is enforced both at inference and training time. In addition, weights are assigned to rules as learnable parameters. This allows for dealing with situations where the given knowledge contains errors or it is softly satisfied by the data without a priori knowledge about the degree of satisfaction.
The combination of logic programming with neural networks is another exploited NeSy technique. Neural Theorem ProverRocktaschelR16
is an extension of the logic programming language Prolog where the crisp atom unification is soften by using a similarity function of the atoms projected in an embedding space. Neural Theorem Prover defines a differentiable version of the backward chaining method (used by Prolog) with the result of learning a latent predicate representation through an optimisation of their distributed representations.DeepProbLog manhaeve2018deepproblog integrates probabilistic logic programming (ProbLog by RaedtKT07) with (deep) neural networks. In this manner, the explicit expressiveness of logical reasoning is combined with the abilities of deep nets.
In terms of the latest graph attention mechanisms, an application for NeSy scene detection using graph neural networks is in sharifzadeh2020classification
. They show how triple-based schema representations of non-expert but instead relational knowledge can be used as inductive bias to learn better representations in the task of scene representation graph, predicate and object classification. The inductive bias of encoding relational prior knowledge enables its propagation and model fine-tuning with external triple data.
When it comes to learning from both experts and data, theoretical and empirical studies show that it is always more efficient to learn from both than using the best of two models. besson2019learning confirm this by combining expert knowledge –in the form of marginal probabilities and rules– with empirical data and apply it to learning the probability distribution of the different combinations of symptoms of a given disease. This approach is useful in cases when there is not enough data to learn without experts, but enough to correct them if needed. In the purely X-NeSyL methodology we will assume this expert ground truth to be the most predominant one, even if not the only one, and thus, the one considered as gold standard. We will assume the latter matches the domain experts knowledge represented in another modality with a symbolic representation, alternative to the traditional use of DL dataset labels.
Finally, knowledge distillation, used by hu2016harnessing, is also used as NeSy technique. Here, symbolic knowledge is extracted from a trained “teacher” network. This knowledge is used as regularization term for training a “student” network. The latter emulates the teacher network whereas the teacher is trained by reducing the KL-Divergence with the student network.
Other systems use external knowledge in the form of linked data or ontologies to link inputs and outputs to background knowledge by using a symbolic learning system to generate an explanatory theory tiddi2020knowledge; sarker2020wikipedia; ebrahimi2021towards.
4 EXplainable Neural-Symbolic (X-NeSyL) learning methodology
One challenge of the latest DL models today is producing not only accurate but also reliable outputs, i.e., outputs whose explanations agree with the ground truth, and even better, agree with a human expert on the subject. X-NeSyL methodology is aimed at filling this gap, and getting model outputs and experts explanations to coincide. In order to tackle the concrete problem of fusing DL representations with domain expert knowledge in form of knowledge graphs, in this section we present the three main ingredients that compose the X-NeSyL methodology: 1) the symbolic knowledge representation component, 2) the neural representation learning component, and 3) the alignment mechanism for both representations to align, i.e., correct the model during training or penalize it when disagreeing with the expert knowledge.
First, in Section 4.1 we present the symbolic component that serves to endow the model with interpretability –which will be in form of knowledge graphs–, then in Section 4.2 the neural representation learning component –that will serve to reach the best performance– and finally, in Section 4.3 the XAI-guided training procedure that makes both components align with SHAP-Backprop during training of the DL model.
4.1 Symbolic knowledge representation for including human experts in the loop
Symbolic AI methods are interpretable and intuitive (e.g. they use rules, language, ontologies, fuzzy logics, etc.). They are normally used for knowledge representation. Since we advocate for leveraging the best of both, symbolic and neural representation learning currents, in order to make the latter more explainable, here we choose a simple form of representing expert knowledge, with knowledge graphs. Right after, in order to demonstrate the practical usage of X-NeSyL methodology, we present the running use case using knowledge graphs that will demonstrate the usage of this methodology thorough the paper.
4.1.1 Knowledge Graphs
Different options exists to leverage a KG as a versatile element to convey explanations lecue2020role. We inspire ourselves by NeSy frameworks for XAI using ontologies and KGs bollacker2019extending; Bennetot19; confalonieri2021using, on explanations of image and tabular data-based models and, more broadly, on the XAI literature guidotti2018survey. We focused more precisely on attribution methods that try to measure the importance of the different parts of the input toward the output. We provide a formalization of the domain expert data into a semantic OWL2-based KG that is actually leveraged by the detector and classifier DL model.
In this work we present a new training procedure to enhance interpretability of part-based classifier, given an appropriate KG. It is based on Shapley values (or SHAP) lundberg2017unified which outputs feature attribution of the various part elements toward the final classification, which we compare with the KG. We use the SHAP information to weight the loss that we backpropagate le1989handwritten at training time.
Alongside standard images and annotations we have in our various datasets, we also have expert knowledge information. This information is usually encoded in knowledge graphs (KGs), such as the one in Figure 7.
A knowledge graph is formalized as a subset of triples from , with the set of entities and the set of relations. A single triple means that entity is related to through relation . In the context of part-based classification, such graph encodes the relationship (that is, ) between elements (parts) and the (whole) object they belong to.
The attribution graph encodes whether an element contributes positively or negatively towards a prediction. This way the can be rewritten as (one entry for each macro label) with (one entry for each element that is part-of the object), .
If a link between an element and a macro (object-level) label exists in the theoretical KG, then it means such element is typical of that label and should count positively toward this prediction, thus, its entry in the matrix representing the is equal to . If no such link exists, then it means it is not typical of the macro label and should contribute negatively, thus its entry in the matrix is equal to .In our case we choose values of the KG edges to be binary, and since we set = 1, . Seeing a KG as a feature attribution graph is not the only way to model a KG; we can also encode KGs as ontologies. It is worth mentioning that ontologies can be seen as a set of triples with the format (subject, predicate, object) or (subject, property, value) where edges can have varying semantic meaning following constraints from Description Logics Baader07.
Modeling the graph as an adjacency matrix is not appropriate since architectural style nodes and architectural elements nodes are playing two very different roles. Instead, we model the graph as a directed graph, with edges from the architectural element toward the architectural styles.
4.1.2 A driving use case on cultural heritage: MonuMAI architectural style facade image classification
The latest deep learning models have focused on (whole) object classification. We choose part-based datasets as a straight forward way to leverage extra label information to produce explanations that are compositional and very close to human reasoning, i.e., explaining a concept or object based on its parts.
In this work, we interested ourselves in the MonuMAI (Monument with Mathematics and Artificial Intelligence) lamas2020monumai citizen science application and corresponding dataset collected through the application, because it complies with the required compositional labels in a object detection task, based on object parts. At the same time, facade classification by pointing relevant architectonic elements is an interesting use case application of XAI. We use this example thorough the article as a guiding application use case that perfectly serves to demonstrate the usage of our part-based model and pipeline for explainability.
The MonuMAI project has been developed at the University of Granada (Spain) and has involved citizens in creating and increasing the size of the training dataset through a smartphone app111Mobile App available in the project website: monumai.ugr.es.
The MonuMAI dataset
MonuMAI dataset allows to classify architectural style classification from facade images; it includes high quality photographs, where the monument facade is centered and fills most of the image. Most images were taken by smartphone cameras thanks to the MonuMAI app. The rest of images were selected from the Internet. The dataset was annotated by art experts for two tasks, image classification and object detection as shown Figure 6. All images belong to facades of historical buildings that are labelled as one out of four different styles (detailed in Table 2 and Table 1): Renaissance, Gothic, Baroque and Hispanic-Muslim. Besides this label given to an image, every image is labeled with key architectural elements belonging to one of fourteen categories with a total of annotated elements (detailed in Table 1). Each element is supposed to be typical of one or two styles, and should almost not appear inside facade of the other styles. Examples for each style and each element are in Fig. 2 and 3, while the MonuMAI dataset labels used are shown in Figs. 4 and 5.
|Architectural element||Count||Element rate (%)||Architectural style|
|Architectural style||#Images||Ratio (%)|
Apart from MonuMAI dataset, and in order to draw more general conclusions on our work, we used a dataset with similar hierarchy to MonuMAI. Additional results for PASCAL-Part chen2014detect dataset are in the Appendix.
MonuMAI’s Knowledge Graph
The original design of MonuMAI dataset and MonuNet baseline architecture lamas2020monumai use the KG exclusively as a design tool to visualize the architectural style of a monument facade based on the identified parts, but it is not explicitly used in the model. In contrast, we change that to go further, in order to guarantee a reproducible and explainable decision process that aligns with the expert knowledge. We will see in Section 4.3.3 how KGs can be used in a detection + classification architecture, during training, since EXPLANet is designed to incorporate the knowledge in the KG. Besides the trust gain, we aim at easing the understanding of flaws and limitations of the model, along with failure cases. This way, requesting new data to experts would be backed up by proper explanations and it would be effortless to target new and relevant data collection.
The KG corresponding to MonuMAI dataset has only fourteen object classes and four architectural styles. Each architectural element is linked to at least one style. Each link between the two sets symbolizes that an element is typical and expected in the style it is linked to.
Renaissance: rounded arch, triangular pediment, segmental pediment, porthole, lintelled doorway, serliana.
Baroque: rounded arch, lintelled doorway, porthole, broken pediment, solomonic column.
Hispanic-muslim: flat arch, horseshoe arch, lobed arch.
Gothic: trefoil arch, ogree arch, pointed arch.
MonuMAI’s KG is depicted in Figure 7, where the root is the Architectural Style class (which inherits from the Thing top-most class in OWL). Note there is one more dimension in the KG, the leaf level of the original MonuMAI graph in lamas2020monumai that represents some characteristics of the architectural elements, but it is not used in the current work.
We also explored the possibility of rewriting the looser structure captured in the KG as an ontology, using the OWL2 format. We did not limit ourselves to copying the hierarchy of the original KG, but rather added some categories to keep the ontology flexible to allow further expansions in the future. Three main classes are modelled in this ontology: A Facade represents an input image as a concept. A facade is linked to one and only one222In this study, as in MonuMAI, we represent the predominant one. Future work could consider the blend of more than one present style. ArchitecturalStyle through the relation exhibitsArchStyle, for which four styles can we used (others could be added by defining new classes). A facade can be linked to any number of ArchitecturalElement identified on it through the relation (i.e. OWL object property) hasArchElement.
ArchitecturalElement represents the class of architectural elements identified before, and is divided in subcategories based on the type of elements such as ”Arch” or Window”. This subcategorization, which does not exist in the original KG, was designed with the possibility of adding constraints between subcategories, such as an ”arch” is probably higher in space than a ”column”, or least the lowest point is higher than a column’s lowest point. Such geometrical or spatial constraints were not explored further, as it required extra expertise modelling from architecture experts, but could be easily added in future work.
Finally, the concept ArchitecturalElement is linked to an ArchitecturalStyle object through the object property isTypicalOf.
This ontology formulation allows us to see the problem of style classification as a problem of KG edge detection between a facade instance and a style instance. This approach was unsuccessful (discussed in Section 6.4).
The KG formulation presented in Section 4.1.1 can be seen as a semantic restriction of the ontology we propose, where we kept only the triples including isTypicalOf relation and expanded the KG with a virtual relation isNotTypicalOf, to link together all elements with all the styles. This way the KG is a directed graph with edges going from the architectural element toward the architectural style. Because we restrict ourselves to only one relational object property and its inverse, the edges bear either positive or negative information, which motivates our modeling choice of having value for formulated edges.
4.2 EXPLANet: Expert-aligned eXplainable Part-based cLAssifier NETwork Architecture
Previous section detailed the symbolic representation mechanism within the X-NeSyL methodology. While KGs serve the purpose of interpretable knowledge, in this section we present the neural representation learning component, mainly responsible for high performance in today’s AI systems.
Our ultimate goal in this work is making DL models more trustworthy when it comes to the level of their explanations, and their agreement with domain experts. We will thus follow a human-in-the-loop holzinger2019interactive approach.
Typically, to identify the class of a given object, e.g., an aeroplane, a human first identifies the key parts of that object, e.g., left wing, right wing, tail; then, based on the combination of these elements and the importance of each single element, he/she concludes the final object class.
We focus on compositional part-based classification because it provides a common framework to assess part- and whole object based explanations. To achieve this we want to enforce the model to align with a priori expert knowledge. In particular, we built a new model called, EXPLANet: Expert-aligned eXplainable Part-based cLAssifier NETwork Architecture, whose design is inspired by the way humans identify the class of an object.
EXPLANet, is a two-stage classification model as depicted in Figure 8. The first stage detects the object-parts present in the input image and outputs an embedding vector that encodes the importance, quantity and combinations of the detected object-parts. This information is used by the second stage to predict the class of the whole object present in the input image. More precisely:
The first stage is a detection module, which can be a detector such as Faster R-CNN ren2015faster or RetinaNet lin2017focal. Let us consider that there are object-part classes. This module is trained to detect the key object-part classes existent in the input image, and outputs predicted regions. Each one is represented by its bounding box coordinates and a vector of size representing the probability of the object-part classes. Let us denote (with ) the probability vector of detecting object-part . First we process all by setting non maximal probabilities to zero, and denoting this new score , being also a vector of size . Let us denote vector the final descriptor of the image. We build by accumulating the probabilities of such that:
Vector aggregates the confidence of each predicted object-part. Large values in mean that the input image contains a large number of object-part with a high confidence prediction, whereas a low value means that predictions had low confidence. Intermediate values are harder to interpret as they could be a small amount of high confidence predictions or a large amount of low confidence predictions, but the idea is that there is probably some objects of these kinds in the image. This object-parts vector can be seen as tabular data where each object part can be considered a feature (to be explained later by an XAI method). We will see in next section how a SHAP analysis can study the contribution of each actual object part present in the image to the actual final object classification. Note that this aggregation scheme is for Faster R-CNN. For RetinaNet we aggregate by summing all probabilities (and do not just take the one of the detected object represented by the max. probability; we found out that this was more stable for training the RetinaNet framework).
The second stage of EXPLANet is a classification network, which is actually a two-layer multi-layer perceptron (MLP), that uses the embedding information (i.e., takes the previous detector output as input) to perform the final classification. This stage outputs the final object class based on the importance of the present key object parts detected in the input image.
The goal of such design is to facilitate the reproduction of the thought process of an expert, which is to first localize and identify key elements (e.g., in the case of architectural style classification of a facade, various types of arches or columns; and then use this information to deduce the final class (e.g., its overall style). However, EXPLANet architecture alone does not control for expert knowledge alignment. Next section introduces the next step of the pipeline, an XAI based training procedure and loss function to actually verify that this happens, and when this is not the case, correct the learning.
4.3 SHAP-Backprop: An XAI-informed training procedure and XAI loss based on the SHAP attribution graph (Sag)
After having presented the symbolic and neural knowledge processing components of X-NeSyL, we proceed to detail the XAI-informed training procedure to make the most of the best of both worlds, interpretable representations, and deep representations.
More concretely, this section presents how to use a model agnostic XAI technique to make a DL (CNN-based) model more explainable by aligning the test-set feature attribution with the expert theoretical attribution. Both knowledge bases will be encoded in KGs.
4.3.1 SHAP values for explainable AI feature contribution analysis
SHAP is a local explanation method lundberg2017unified; molnar2020interpretable that for every singular prediction it assigns to each feature an importance value regarding the prediction. It tells if a feature contributed to the current prediction and gives information about how strongly it contributed. These are the Shapley values of a conditional expectation function of the original model. In our case we computed them with Kernel SHAP lundberg2017unified.
Similarly to LIME ribeiro2016should, Kernel SHAP is a model agnostic algorithm to compute SHAP values. In LIME, the loss function, weighting kernel and regularization term are chosen heuristically, while in SHAP they are chosen in a way that they satisfy the SHAP properties. See details in lundberg2017unified.
The idea of computing SHAP is to check whether object parts have the expected importance on the object class prediction (e.g. whether the presence of a horseshoe arch contributes to Hispanic-Muslim class). SHAP computation happens always in a per class basis, as the computation is regarding binary classification (belonging to class vs not).
In our part-based pipeline, we apply SHAP at the tabular data level, i.e., after the aggregation function. As such, SHAP’s only input is the feature vector that contains the information about parts detected before. Throughout this section, when we refer to feature value, we refer to this feature vector, and a feature value means one entry of this vector. As such, each feature value encodes the information about one element (either an architectural element for MonuMAI or an object part for PASCAL-Part) from our knowledge model. The final class prediction performed afterwards is done by the classification module of our part-based model, given such a feature vector.
In Figs. 9 and 10 we performed the SHAP analysis over the whole validation set. In practice it means that SHAP values were computed for each element of the validation set and plotted on the same graph. Then for each feature of the feature vector, in our case for each architectural element, we plot all SHAP values for this specific element found in the dataset, and we color them based on the feature value. They are plotted line-wise and each dot represents the feature value of a specific datapoint, i.e., image. High feature values (regarding the range they can take) are colored pink and low feature values in blue. Here, if an element is detected several times or with high detection confidence, it will be pink (blue for less detection confidence or less frequency). Then, horizontally, are shown the SHAP values, where high (absolute) values have high impact on the prediction333See tutorial https://christophm.github.io/interpretable-ml-book/shap.html and SHAP source code in https://github.com/slundberg/shap. .
If we compare the SHAP plots with the KG, here we do not observe any large amount of outliers or datapoints not coinciding with the domain expert KG acting as ground truth (in Fig. 12 right). We now need to be able to use this information automatically. Pink and blue (high and low) values of datapoint features can appear both in right and left sides of the plots, meaning its value can contribute towards predicting the considered class or not, respectively. However, in our case, only pink datapoints being on the positive (right) side of SHAP plot represent the correct behaviour if such element is also present in the KG. In that case, their feature value loss will not be penalized during training, as they match the expert KG (considered as GT). The rest of datapoints’ SHAP values (blue in right side, pink and blue in left side) will be used by SHAP-Backprop to correct the predicted object class.
An example of computation of SHAP values on a single feature vector is in Table 3. On the right there is the feature vector, and on the left the SHAP values for each object class for each object part. We highlighted in green positive values and in red negative values.
4.3.2 SAG: SHAP Attribution Graph to compute an XAI loss and explainability metric
By measuring how interpretable our model is, in the form of a KG, we want to be able to tell if the decision process of our model is similar to how an expert mentally organizes its knowledge. As highlighted in the previous section, thanks to SHAP we can see how each feature value impacts the predicted macro label and thus, how each part of an object class impacts the predicted label. Based on this, we can create a SHAP attribution graph (SAG). In this graph, the nodes are the object (macro) labels, and the parts are linked to a macro label if according to the SHAP algorithm, it played a contribution role toward predicting this label.
Building the SAG is a two step process. First we extract the feature vector representing the attributes detected (float values). Thanks to the detection model we get the predicted label from it. Feature vectors are the output of the aggregation function that are fed to the classification module.
Using as hyperparameter a threshold444Default thresholds used in our case for detection were = 0.05 for both Faster-RCNN and RetinaNet, as they showed to work best for numerical stability. on each feature value, we identify which architectural element we have truly detected in the image. Then, using the SHAP values computed for this feature vector, we create a SAG per image in the test set, and thus we link together feature values and predicted label probabilities inside the SAG. This way we use SHAP to analyse the output for all classes, not only the predicted one:
Having a positive SHAP value means the detected feature contributes to predicting this label, given a trained classifier and an image. We thus add to the SAG such edge representing a present feature contribution.
Having a negative SHAP value and a feature value below the threshold means that this element is considered typical of this label and its absence is detrimental to the prediction. As such, we can link the object label and the part label in the SAG, as a lacking feature contribution.
An example of SAG for the architectural style classification problem is in Fig. 12. M, R, G, B means Hispanic-Muslim, Renaissance, Gothic and Baroque, respectively. The pseudo code to generate the SAG can be found in Algorithm 1.
In practice this allow us to have an empirical attribution graph, the SAG (built at inference time), and a theoretical attribution graph, the KG (representing prior knowledge). We can then compare both of them.
4.3.3 SHAP-Backprop to penalize misalignment with an expert KG
In order to improve performance and interpretability of the model, we hypothesize that incorporating the SHAP analysis during the training process can be useful to fuse explainable information and improve interpretability.
The underlying idea is that SHAP helps us understand on a local level what features are contributing toward what class. In our case, SHAP links elements to a label by telling us if it contributed toward or against this label, and how strongly. Besides this analysis, we have the KG that is embedding basically the same information. We can see the KG as a set of expected attributions, i.e., if an element is linked to the object label in the KG, it should contribute to it, otherwise it should contribute against.
Given these two facts, we can compare real attribution via SHAP analysis, that gives us the empirical attribution graph, with the theoretical attribution found in the KG. If there is a discrepancy, then we want to penalize this feature, either in the classification process or in the detection process.
Misattribution, which happens when an attribution is unexpected or absent, can stem from various origins. One would be a recurrent misdetection inside the dataset. As such, penalizing misattribution at detection time could help us correct those. Penalizing the classification process could be considered as well, but has not been done here yet.
A schema of this approach is presented in Fig. 11. Yellow blocks are the input data, the green blocks are the ground truth elements, used for training of the original model, the blue blocks are trainable models, the red blocks are the output of the various algorithms and the gray blocks are the untrainable elements. The purple block is the proposed added procedure. Thin arrows link together the outputs of the trainable model along with the reference elements used for backpropagation. Thus, initially two thin arrows exist, one between the macro label and the output of the classification model, and one between elements (parts of objects) and the output of the detection module. We add a third arrow between the SHAP misattribution and the output of the detection model as the new SHAP-penalized loss is leveraging this misattribution.
This new loss requires an intertwined training of the classification model with the detection model to compute the SHAP analysis; however the extra required training time in practice is not a big issue. Indeed, in the initial training protocol, one would fully train the detection and then the classification. Here, we have to train the classification at each detection epoch. We expect the explainability metric, i.e., the SHAP GED between the KG and the SAG to increase thanks to this SHAP signal backpropagation.
4.3.4 LShap: A new loss function based on SHAP values and a misattribution function
Let be the number of training examples and let be the training image examples.
Let be the detector function such as:
where are the bounding boxes detected by D, the confidence associated to each predicted box, and the predicted class of each box. The associated ground truth label is used for standard backpropagation, but we will not need it for the weighting process.
Faster R-CNN ren2015faster uses a two terms loss:
where is the loss corresponding to predicting the region of interest and is dependant on the class predicted for each BB. It is computed at the output level, whereas is the loss function from the region proposal network, and it is computed at the anchor level555Anchors are a set of predefined bounding boxes of a certain height and width. These boxes are defined to capture the scale and aspect ratio of specific objects. Height and width of anchors are a hyperparameter chosen when initializing the network.. We use a weighted , since the SHAP information is computed at the output level and not the anchor level. We can write the loss as the sum of the losses for each image, and within an image, for each BB:
Where is the index of the considered image and the index of a BB predicted within that image.
We now introduce the SHAP values, which are used as a constraining mechanism of our classifier model to be aligned with expert knowledge provided in the KG. SHAP values are computed after training the classification model.
Let with , with the number of different macro labels, and let be the SHAP values for each training example , where is the macro object label, and is the detected part. Each SHAP value is thus of size , with being the number of parts in the model. Furthermore, due to the nature of the output of the classification model, which are probabilities, and the way SHAP values are computed, they are bounded to be a real number in .
The KG was already modeled as an attribution graph and corresponding matrix (in order to compute the embedding out of the KG) in Section 4.3.2 and we will be using the same notation.
Introducing the misattribution function
To introduce SHAP-Backprop into our training, we first need to be able to compare the SHAP values with a ground truth, which here is represented by the expert KG. We thus introduce the misattribution function to assess the level of alignment of the feature attribution SHAP values with the expert KG.
The goal of the misattribution function is to quantitatively compare the SHAP values computed for the training examples with the KG. For that we assume the SHAP values are computed for all feature vectors. A misattribution value is then computed for each feature value of each feature vector. Before considering the definition of misattribution function, we can distinguish two cases when comparing these two elements, depending on the feature values observed:
A) The feature value considered is higher than a given hyperparameter , i.e., the positive case. symbolizes the value above which we consider a part is detected in our sample image. In our case .
B) The feature value is lower or equal to : in this case we assume there is no detected part, i.e., the negative case.
Case A: In the first case, given the KG, for a SHAP attribution to be coherent, it should have the same sign as the KG. If it is the case, the misattribution is 0, i.e., there is no correction to be made and backpropagate. Otherwise, if it has opposite sign, the misattribution will depend on the SHAP value. In particular, it will be proportional to the absolute value of the SHAP attribution. We thus propose the following misattribution function:
Where is the index of the considered image, is the index of a given object (macro) label and is the index of a given (object) part. This way correspond to the edge value between the macro label and the part , where the positive part of a real number. , and thus due to the bounding of the SHAP values. The detector output feature values are bounded due to the nature of the classification output which is in [0,1]. However, SHAP values are naturally bounded in [-1,1] lundberg2017unified.
Case B: Since we choose to be and has only real values, if , we therefore should not backpropagate any error through the loss function, since no BB is detected for the object part.
Given the prior information in the KG, the posterior information (SHAP values post-training), and a way to compare them (attribution function ), we suggest two new versions of a weighted loss, , that will substitute the former ROI loss.
- Bounding Box-level weighting of the loss
This first weighted loss is at the bounding box (BB) level, meaning each BB will be weighted individually based on its label and the associated SHAP value. We propose the following loss:
where is the number of BBs predicted in image , and the ground truth (GT) labels for instance images . We propose two possible loss weighting options, depending on , a balancing hyperparameter (equal to in our experiments), that can be linear:
with the index of the considered image, its associated class, the considered part class, the KG and the SHAP values. Either way, if is equal to when the misattribution is in order to maintain the value of the original loss function. Thus, : the larger the misattribution, the larger the penalization.
- Instance-level weighting of the loss
This second weighted loss is at the instance level, meaning we are weighting all the BBs for a given dataset instance with the same value:
i.e., the instance level weighting of the loss function considers the max BBox misattribution function value. Just as the BB level weighting, the aggregation of terms in the misattribution function can either be linear or exponential.
5 X-NeSyL methodology Evaluation: SHAP GED metric to report model explainability for end-user and domain expert audiences
Detection and classification modules of EXPLANet use mAP and Accuracy, respectively, as standard evaluation metric. In order to evaluate explainability of the model in terms of alignment with the KG, we propose the use of the SHAP Graph Edit Distance (SHAP GED) at test time. This metric has a well defined target audience: the end-user (in our case, of a citizen science application) and domain experts (art historians), i.e., users with non-technical background necessarily.
Even if the SAG above can be computed for any set of theoretical and empirical feature attribution sets, we are interested in using the GT KG in order to compute a explainability score on a test set.
The simplest way to compare two graphs is applying the GED sanfeliu1983distance. Using straight up the GED between a KG and the SAG does not work very well, since the number of object parts (architectural elements in our case) detected vary too much from an image to another. What we do is to compare the SAG to the projection of the KG given the nodes present in the SAG. More precisely, given a SAG, we compute a new graph from the KG, where we take the subgraph of the KG that only contains the nodes in the SAG. As, such they will have the same nodes, but with the potential addition of new edges.
An example of such projection can be seen in Figure 12 (right). This way, the projection serves to only compute the relevant information given a specific image.
Once SHAP-Backprop procedure penalizes the missalignment of object parts with those of the KG (detailed in next section), we will use the SAG to compute the SHAP GED between the SAG and its projection in the KG. This procedure basically translates into counting the number of ”wrong” edges in the SAG given the reference KG, i.e., the object parts that should not be present in this data point, given the predicted object label.
After detailing all necessary components to run the full pipeline of X-NeSyL methodology, together with an evaluation metric that facilitates the full process evaluation, we are in place to set up an experimental study to validate each component. It is worth noting that each component can be adapted to each use case. Next section experiments will demonstrate, with a real life dataset, how X-NeSyL methodology can facilitate the learning of explainable features, by fusing the information from deep and symbolic representations.
6 MonuMAI Case Study: Classifying monument facades architectonic styles
In order to evaluate the X-NeSyL methodology, and all inherent components including the detection and classification module within the proposed EXPLANet architecture, as well as the XAI training procedure, we perform two main studies. In the first study, we evaluate the full SHAP-Backprop training mechanism by testing the detection module followed by the classification one. In other words, we test and assess the full EXPLANet architecture. In the second study, as an ablation study to assess the influence of the detector’s accuracy on the overall part-based classifier, we evaluate the detection module of EXPLANet model with two different detection models, Fast-RCNN ren2015faster and RetinaNet lin2017focal.
In both evaluation studies, we used two datasets, MonuMAI and PASCAL-Part. For simplicity, we focus on MonuMAI dataset mainly in this section, while additional results for PASCAL-Part can be seen in the Appendix. For the remainder of the paper, we will use elements and object parts interchangeably, and macro labels will be used to refer to the classification labels, i.e., the style of the macro object.
6.1 Experimental setup
To evaluate the classification performance we use the standard accuracy metric (equation 11).
where represents the number for correct and total predictions. To evaluate the detection performance, we use the standard metric mean average precision mAP (Eq. 12).
where given categories of elements, precision and r define
as the area under the interpolated precision-recall curve for class.
We initialized Faster R-CNN ren2015faster
and RetinaNet with the pre-trained weights on MS-COCOlin2014microsoft
then fine-tuned both detection architectures on the target datasets, i.e., MonuMAI or PASCAL-Part. The last two layers of Faster R-CNN were fine-tuned on the target dataset. As optimization method, we used Stochastic Gradient Descent (SGD) with learning rate ofand a momentum of . We use Faster R-CNN
implementation provided by PyTorch.
For the classification module, we also fine-tuned the two layer MLP with 11 intermediate neurons. We used the Adam kingma2014adam
optimizer provided by Keras.
To perform an ablation study on the element or part-based detector, the original dataset is split into three categories (train, validation and test), following a 60/20/20 split. Reported results are computed on the test set.
The compositional part-based object classification with RetinaNet is trained in two phases. First the detection is trained by finetuning a RetinaNet-50 pretrained on MS COCO. We use Adam optimizer with starting learning rate (LR) of and a scheduler of learning rate to reduce on plateau with a patience of 4666Patience is the number of epochs taken into account for the scheduler to decide the network converged. Here, the last four.. We train this way for 50 epochs. Then we freeze the whole detection weights and train only the classification. We use Adam optimizer with starting LR= and a scheduler of LR to reduce on plateau with a patience of 4. We train this way for 25 epochs.
Even if our objective was having a fully end-to-end training, the need for a quite different LR between the detection and classification modules led us to train separately for convenience, at the moment.
6.2 EXPLANet model analysis
In order to assess the advantages of EXPLANet, we consider two baselines: 1) MonuNet lamas2020monumai: the architecture proposed with MonuMAI dataset, designed as a compressed architecture that is able to run in embedded devices such as smartphones777Since MonuMaiKET detector and MonuNet classifier are not connected, MonuNet does not provide object detection.. 2) A simple object classifier based on vanilla ResNet-101 he2016deep. MonuNet is a different classification architecture to ResNet, it uses residual and inception blocks but with a more static architecture than EXPLANet that does not allow modifications or is not meant to be scalable.
The results of EXPLANet classification model based on Faster R-CNN and RetinaNet detector backbones, together with these two baseline classification networks are shown in Table 4 for MonuMAI dataset. EXPLANet with Faster R-CNN outperforms the ResNet-101 baseline.
|EXPLANet using Faster R-CNN backbone detector|
|EXPLANet using RetinaNet backbone detector||49.5||90.4||0.86|
|ResNet-101 baseline classifier||N/A||N/A|
|MonuNet baseline classifier||N/A||N/A|
MonuNet lamas2020monumai, the baseline provided by MonuMAI dataset authors, is an architecture designed for being used in mobile devices in real time. Because of its compressed design targeting embedded systems, its performance is not fully comparable with EXPLANet. However, we report it for reference, as it is the only previous model trained on novel MonuMAI dataset to date, to the best of our knowledge.
The result of the ablation study assessing the impact of the object detector on EXPLANet is in Table 4
. We obtain basically the same accuracy. Even if independently, RetinaNet model is slightly superior to Faster R-CNN, it seems that the explanation for having worse default results when using EXPLANet with RetinaNet instead of with Faster R-CNN is due to 1) hyperparameter choice, since Faster R-CNN uses pretraining on MS-COCO while RetinaNet uses pretraining on ImageNet, and 2) both coarse grained MonuMAI dataset and fine-grained PASCAL-Part are of different nature in terms of the overlap among part classes.
Due to the naturally simpler nature of RetinaNet, the latter is faster to train than Faster R-CNN888We use the RetinaNet implementation from https://github.com/yhenon/pytorch-retinanet. Ease of use stems from the fact that if we wanted to modify the aggregation function, whether its analytical form or at the end of the detector, at which we should attach the classifier, it would be much simpler..
We can see the confusion matrix computed on MonuMAI for EXPLANet, using both Faster R-CNN and RetinaNet object detectors as backbones, in Fig.13.
Overall, both part-based models outperform the regular classification for MonuMAI, which means, that the more accurate the classification model is, the more interpretable (lower SHAP GED) it becomes. Although a better detector (better mAP) could be intuitive to encourage a better GED, it is not expectable, because the mAP evaluates the spatial location and the presence or not of a descriptor (object part), while the GED evaluates just the presence. Moreover, having no correlation among mAP and GED is reasonable, specially because mAP evaluates only one part of the model (detection), and thus it makes more sense that accuracy correlates with SHAP GED, as our results show.
The object part detector module of MonuNet baseline, i.e., MonuMAIKET detector (based on Faster R-CNN detector -using ResNet101- as backbone) reaches slightly higher performance. We assume this minor difference due to the different TensorFlow and PyTorch default implementations of Faster R-CNN’s inherent ResNet module versions in MonuNet and EXPLANet, respectively). Furthermore, the EXPLANet with RetinaNet approach outperforms EXPLANet with Faster R-CNN interpretability-wise. This probably stems from the object parts aggregation functions that are slightly different.
In the Faster R-CNN version of EXPLANet, only the probability for the highest scoring label is kept, whereas in the case of using RetinaNet as part detector, the latter aggregates over all the scores for each example, here with the sum function. This way RetinaNet is probably more robust to low score features as it always observes several for each example.
6.3 SHAP-Backprop training analysis
Once assessed the EXPLANet architecture as a whole, and once performed an ablation study with respect to dependency on the object detector, we asses the different ways of weighting the training procedure penalization.
Table 5 displays the computed results of X-NeSyL methodology SHAP-Backprop method on MonuMAI lamas2020monumai dataset (see additional results for PASCAL-Part everingham2010pascal in Appendix). In bold are the best results for each metric, and in italics the second best. We tested on what we call Standard procedure, which is the typical pipeline of training first the detection module and then the classification module in two different steps, sequentially, without SHAP-Backprop nor any other interference.
The four other cases are methods computed with SHAP-Backprop. At each epoch in the detection phase, we train a classifier and use it to compute the SHAP values. These are then used to weight the detection loss with the misattribution function presented in the previous subsection.
|Standard procedure (baseline, no SHAP-Backprop)||42.5|
|Linear BBox level weighting||0.69|
|Exponential BBox level weighting||88.6|
|Linear Instance level weighting||0.55|
|Exponential Instance level weighting||44.2||88.6|
Table 5. Applying SHAP-Backprop has little effect on accuracy and mAP. Nonetheless, all but the linear BBox level weighting increase the classifier accuracy around 1-2%. These instabilities could probably stabilized with domain specific fine-tuning. Furthermore, we have to take into account the stochastic nature of the training process. Since the SHAP value computation is approximated using a random subset of reference examples to be more efficient in computation time.
On the other side, in terms of interpretability, we do have sensible improvement, (reducing SHAP GED from 0.93 to 0.55) in the case of linear instance weighting. The gain obtained in both dimensions is large enough to conclude that the X-NeSyL methodology helped improving interpretability in terms of SHAP GED to the expert KG.
6.4 Lessons learned
In order to create a NeSy deep learning model that is explainable, we chose the expert as target audience of the model output explanation. We then proposed a training procedure and metric to qualitatively asses domain expert based explanations. Although further explainability metrics beyond SHAP GED could be studied depending on the audience of the explanation, we showed explainability, in terms of alignment of the conceptual match with the domain expert KG is increased.
The X-NeSyL methodology with pluggable components is meant to be a generic and versatile one, i.e., a template architecture that can be adapted and customized to each use case: the symbolic component for knowledge representation, the neural component based on a compositional architecture such as EXPLANet, and the XAI-informed miss-attribution procedure to be applied during training. Furthermore, our experiments verified our hypotheses:
Overall, X-NeSyL methodology brings explainability in the fusion of deep learning representations with domain expert knowledge. X-NeSyL did have the expected effect on the MonuMAI dataset, and SHAP-Backprop improves the explainability metric (SHAP GED) on the model.
The intermediate learned representation of EXPLANet allows to remove noisy information.
The more accurate EXPLANet model is, the more interpretable (lower SHAP GED) it becomes.
Even if some weighting schemes confirmed the interpretability-performance trade-off (some improve SHAP GED while worsening accuracy, and viceversa), the linear instance-level weighting scheme can improve over both interpretability and performance.
As there is no consensus on how to measure explainability, especially because methods are achieving different goals, the lack of unification moved us to develop and contribute our own metric, SHAP GED, because as far as we know no other work explicitly incorporates the use of expert domain KGs in an image classification process to produce explanations for end-users and domain experts. We encourage researchers to put effort to further explore expert knowledge alignment models and develop richer metrics beyond this scope.
When it comes to the semantic modelling of the KG, we explored the possibility of using the ontology OWL format, but compared to standard ontologies such as ArCo carriero2019arco or the Google Knowledge Graph, our domain knowledge on architectonic styles is rather flat in terms of hierarchies of triples. We therefore did not need in this case the additional semantics modelling power of the Web Ontology Language (OWL) and limited ourselves to explain partOf relationships. This permitted us to simplify the explanations and edge semantics. Future work should consider more complex semantic constraints natural of OWL format.
Regarding the reproducibility or scalability of X-NeSyL methodology, manually constructing KGs for a given dataset in our case was not hard, given the scale of MonuMAI or PASCAL-Part datasets. For larger datasets where no domain expert KG is available, one debility of X-NeSyL methodology, in concretely of using KGs as symbolic knowledge representation, is the needs for the domain expert to design the KG. This may require, if experts are limited, or data is disperse and sparse, to previously recur to knowledge engineering tasks, among others, automatic knowledge base construction and datatype learninghuitzil2018datil; diaz2017couch, relation learning maruhashi2018learning, link prediction getoor2005link; suchanek2019knowledge, concept induction sarker2019efficient or entity alignment Zhao20.
7 Conclusions and Future work
With the presented work we open up different research horizons and future avenues of work that we detail in this section.
We extensively considered what is one of the most crucial points to be addressed while developing XAI methods. Within the general needs for producing more trustworthy outputs, we tackled the challenge of fusion and alignment of deep learning representations with domain expert knowledge. To achieve this we proposed a new methodology, X-NeSyL, to fuse deep and symbolic representations thanks to an explainability feedback mechanism that facilitates the alignment of both deep and symbolic features. The part-based detection and classification, EXPLANet, and XAI-informed training procedure SHAP-Backprop leverage expert information in form of a knowledge graph. X-NeSyL could be seen as one way to attain explainable and theory-driven data science arrieta2020explainable.
We demonstrated the full pipeline of X-NeSyL methodology on MonuMAI and PASCAL-Part datasets, and the EXPLANet model with two variants of object detectors. The fusion of learned representations of different nature through the addition of an XAI technique component facilitates the model to learn with a human expert in the loop.
X-NeSyL methodology was also validated through a contributed audience-specific explainability metric, SHAP GED, that quantifies the alignment of the X-NeSyL methodology neural model (EXPLANet) with the symbolic representation of the expert knowledge. All models, datasets, training pipeline and metric of the showcased X-NeSyL methodology are available online999github.com/JulesSanchez/architectural_style_classification. This approach targeted compositional object recognition based on explaining the whole through the object-parts on deep architectures. However, other non compositional semantic properties of description logics could be further modelled in order to assess, and further constrain the level of alignment of a DL model with symbolic knowledge representing the expert.
Given the diverse contributions of this work, there is a broad set of options that can follow up to improve X-NeSyL methodology. In terms of evaluation of our work, the assessment was limited by the number of available datasets that contain part-based data, which is not large, since they must include a corresponding KG as well.
The explainability metric may be refined, since the proposed vanilla version of SHAP GED might not take into account all explainable factors an expert would like to see reflected in a black box model explanation. Future work includes assessing the SHAP GED metric itself, as the most suitable graph comparison metric, and including more elaborated datasets with finer grained object-part labels.
The ontology alignment with the deep model predictions can be refined in many ways. For instance, instead of using a simple KG, representing the expert knowledge in a rich ontology that incorporates extra axiomatic restrictions between elements, such as spatial relations or geometric constraints, could be useful to further improve SHAP-Backprop.
One way to improve the model along these lines may be inducing spatial structure in the embedding space (e.g. with approaches such as ConvE dettmers2018conve, which uses CNNs on embeddings for link prediction). Furthermore, the exploration of graph attention mechanisms sharifzadeh2020classification could be studied to learn locality relations in parallel with conceptual KG relations.
An actionable future work that could be very valuable for the XAI field is providing textual explanations of the output, since even limiting the model to describe the SAG could help build trust in the model output.
To conclude, we invite researchers and domain experts to be part of the XAI debate and contribute to democratize XAI, and to collaboratively design quantitative metrics and assessment methods aimed at developers, domain experts and end-users, as target audiences of DL model explanations.
This research was funded by the French ANRT (Association Nationale Recherche Technologie - ANRT) industrial Cifre PhD contract with SEGULA Technologies. The paper has been partially supported by the Andalusian Excellence project P18-FR-4961. S. Tabik was supported by the Ramon y Cajal Programme (RYC-2015-18136).
In this appendix we extend the results obtained with MonuMAI dataset to a second dataset, PASCAL-Part, to further validate our results. We detail such datasets and the results obtained in the following sections.
8.1 Additional results: PASCAL-Part Dataset
In order to validate results with more than one part-based dataset, we expanded experiments to use an adapted version of PASCAL-Part [chen2014detect] which provides two level of annotations: element annotations for the detection level (object-parts), and the macro (whole) level with image level labels.
PASCAL VOC 2010 dataset is a popular dataset for the task of object detection. It is organized into 20 object classes [everingham2010pascal]. PASCAL-Part dataset extends PASCAL VOC-2010 dataset with additional annotations by providing segmentation masks for each object part [chen2014detect].
In this work, we use a curated version of the PASCAL-Part provided by [donadello2016integration]101010Available online at github.com/ivanDonadello/semantic-PASCAL-Part/.. The idea is to reduce the number of elements of the original PASCAL-Part by collapsing categories together such as ”upper arm” and ”lower arm” inside a single category ”arm”.
We created a second curated version of the dataset, as in the PASCAL-Part there can be several ”macro” objects labelled within a single image, whereas we want to consider, to ease evaluation purposes, only images with one image-level label111111Therefore, we discarded all images where there was more than one macro object.. The total of 1448 remaining images include 20 macro categories and 44 different parts, whose distribution and some samples are shown in previous sections of the Appendix.
PASCAL-Part dataset classes and parts are in the following list. The first element represents each class, and it is followed by its corresponding part classes121212In OWL language, the latter would be placed in their hasPart object property range:
Bird: Torso, Tail, Neck, Eye, Leg, Beak, Animal Wing, Head
Aeroplane: Stern, Engine, Wheel, Artifact Wing, Body
Cat: Torso, Tail, Neck, Eye, Leg, Ear, Head
Dog: Torso, Muzzle, Nose, Tail, Neck, Eye, Leg, Ear, Head
Sheep: Torso, Tail, Muzzle, Neck, Eye, Horn, Leg, Ear, Head
Train: Locomotive, Coach, Headlight
Bicycle: Chain Wheel, Saddle, Wheel, Handlebar
Horse: Hoof, Torso, Muzzle, Tail, Neck, Eye, Leg, Ear, Head
Bottle: Cap, Body
Person: Ebrow, Foot, Arm, Torso, Nose, Hair, Hand, Neck, Eye, Leg, Ear, Head, Mouth
Car: License plate, Door, Wheel, Headlight, Bodywork, Mirror, Window
Pottedplant: Pot, Plant
Motorbike: Wheel, Headlight, Saddle, Handlebar
Cow: Torso, Muzzle, Tail, Horn, Eye, Neck, Leg, Ear, Head
Bus: License plate, Door, Wheel, Headlight, Bodywork, Mirror, Window
TvMonitor: Screen, TvMonitor
Since explicitly using the ontology built for MonuMAI, in the case of object classification yielded no significant advantage,we did not pursue this direction further for PASCAL-Part dataset. A thorough work to convert PASCAL-Part into an ontology could be done, but the variety of elements inside it could make it difficult to 1) group them in meaningful categories, 2) extend the additional data and object properties of such richer ontology to a KG that can be compared with an attribution graph. Since such extension to an ontology can complicate the ontology - misattribution matching process, we leave such extension to future work.
PASCAL-Part’s Knowledge Graph
In analogy to MonuMAI previous application, where architectural elements play the role of object parts and macro labels correspond to the architectural styles, we also used the KG provided by the PASCAL-Part dataset [donadello2016integration]131313Curated PASCAL-Part Dataset and KG available github.com/ivanDonadello/semantic-PASCAL-Part/. We do not provide a visualization of this KG as it would be unreadable..
8.2 Results for PASCAL-Part Dataset
Results for PASCAL-Part dataset are compiled in Table 7
. Applying SHAP-Backprop has almost no effect on the accuracy and interpretability, but it had detrimental effect on the detector mAP. This could be explained due to this particular KG being very sparsely populated, and the fact that object parts have large overlap in the object labels they theoretically contribute to. For instance, consider the following concrete example: We have a person in an image, for which we detect the legs, and let us assume the background is such that sheep legs are detected. According to our expert KG data, detecting legs makes sense toward predicting a person, and thus the sheep legs detection inside the background would not be discouraged. Future work should consider the distinction among syntactically equal (e.g. in image captioning tasks, the wordleg) but semantically different parts of objects (an animal vs a human leg). In other words, the isPartOf relationship could be further specialized in our KG to a) have as range the class Leg, with subclasses AnimalLeg and HumanLeg (instead of just Leg as PASCAL-Part dataset has it now), or b) having isPartOfAnimal, isPartOfHuman as extra specialized object properties in our ontology (right now PASCAL-Part and MonuMAI only have one kind of object property, hasPart).
The overall lower score in mAP we obtain for PASCAL-Part stems from the fact that this dataset makes it harder for smaller objects to be detected, and the Faster R-CNN model we used was not fine-tuned to be fairly comparable in both settings.
|EXPLANet using Faster R-CNN backbone detector||0.45|
|EXPLANet using RetinaNet backbone detector||39.3|
|ResNet-101 baseline classifier||N/A||87.2||N/A|
When considering the different weighting schemes of SHAP-Backprop, for the PASCAL-Part, the vanilla ResNet classifier baseline performs better than that one for EXPLANet, which means that the part-based classifier EXPLANet is underperforming in this case. It can be explained by several factors, but the predominant one is probably that images contain valuable information that the part-label model does not. It becomes quite clear when studying the PASCAL-Part KG, as we do in Section 19, since several labels are made of the same part names, but represent distinct things, i.e., parts from different object provenance (e.g. leg of a person and leg of an animal; both car and bus have the same object-parts). The part-based model has thus trouble differentiating such categories whereas a purely image-based model (and not attribute based) would have no issue with those.
|Standard procedure (baseline, no SHAP-Backprop)||36.5||82.4||0.45|
|Linear BBox level weighting|
|Exponential BBox level weighting|
|Linear Instance level weighting||0.42|
|Exponential Instance level weighting||34||82.7|
While the linear weighting appears to have a more positive effect on improving explainability of the model, it may not be significant, given that it does not always improve interpretability when applied on the more specific PASCAL-Part dataset.
The discordance in performance (for mAP and Accuracy in the detector task) in Table 5 and 7 for RetinaNet being superior than Faster R-CNN only in MonuMAI but not for Pascal-Part can be explained due to Pascal-Part dataset labelling procedure (joining elements not unifiable, i.e., with same identifier, such as leg or wheel, but belonging to very different types of objects: sheeps and cows have the same parts but different object label). Therefore, it is worth highlighting the differences in the labelling process of both datasets, as the classification based in parts with the same name but very different semantics and visual appearance in Pascal-Part is not designed for a neural network that only takes attributes as input, to learn classifying objects based on the parts. Thus, as the Pascal-Part KG lacks highly discriminative features, accuracy and SHAP GED are not obviously nor directly connected, specially in RetinaNet where, due to its inherent architecture aggregation function, all probabilities are used to perform a prediction (the aggregation function sums them), not just the highest one. As micro and macro labels are not appropriate (as designed in MonuNet dataset), the interpretability metric fails to reflect reality, independently of the quality of the detector.
As conclusion, X-NeSyL methodology showed slightly differently results in datasets designed with different purpose. This was due to mainly the lack of discriminative dataset labels for EXPLANet to leverage. In particular, PASCAL-Part dataset, whose design does not allow full evaluation of SHAP-Backprop’s effect on interpretability, reflected on a decreased performance on mAP and accuracy, when compared to our baseline. This can be explained by its non-discriminative nature not designed for a part-based object detection such as the case of the EXPLANet architecture.