## 1 Introduction

In the last few years, explaining outputs returned by Artificial Intelligence (AI) algorithms has become more and more important

[RGPD, gunning2017explainable]. This echoes the dominance of deep neural networks, which reach very high performance in several visual recognition tasks but lack of explainability

[marcus, GARNELO201917]. Explaining decisions returned by intelligent systems is not only helpful for understanding their reasoning process, it is also essential for gaining acceptance and becoming trustworthy to humans [ribeiro2016should]. In human-centered fields like medical image analysis [LITJENS201760], decisions cannot be made relying blindly on a model since the consequences could be disastrous.While several definitions of interpretability and explainability exist in the literature [miller, gilpin, Lipton, DoshiKim2017Interpretability], there is no consensus among them and these two notions are sometimes used interchangeably. Overall, it emerges that interpretability is the ability to present insight into how a system works in understandable terms, whereas explainability is the ability to describe how a system works in an accurate and logical way. In this paper, we focus on rendering the reasoning process of our model to explain its decisions. To get explanations, a first family of methods consists in learning a local interpretable approximation model around the prediction returned by a black-box model [shap, ribeiro2016should]

. Those approaches can deal with any model, so they are well-suited for deep neural networks. However, although they aim at extracting key characteristics that led to the output, they cannot exactly replicate the reasoning the black-box model performed. The second possibility is to use models that are propitious for generating explanations, such as decision trees, decision rules or by distilling an unexplainable model into an explainable one

[distillation]. Their main advantage is that the reasoning leading to a specific output is easy to track, so it can be used for generating an explanation. However, those models may not be as effective as black-box models, since explainability usually comes at a cost. Indeed, there is a well-known trade-off between accuracy and explainability [gunning2017explainable]. In this paper, we propose to rely on this second family of approaches by counterbalancing this trade-off with very little need for labelled data whose acquisition is costly. Our approach is based on two conclusions from human image interpretation studies: (1) the importance of contextual and spatial relations in object and scene recognition

[biederman], and (2) the ability of humans to learn from few examples [thorpe1996speed, Li]. Several approaches focus on few data learning [zeroShot, oneShot] but they need side information. We propose to mix statistical and symbolic learning to train a model that learns to manipulate spatial relations from few examples.Our goal is to build a novel approach that can learn to reason and generate both annotations and explanations from just few examples. In our experiments, the organs to annotate all have properties and they are all linked by spatial relations. Thus, learning these relations and properties should help us to recognize them. Our approach relies on using fuzzy relations that take into account both quantitative and qualitative information, which enables to have a linguistic and thus understandable description of each relation. Learning fuzzy relations has already been proposed in [donadello] and in [gonzalez2012efficient] to achieve higher classification performance but not for explaining the reasoning as we propose. Given an unknown example, the system looks for the set of objects that best satisfies the relations between the objects of interest. We model this as a constraint satisfaction problem. In Section 3, we describe the whole pipeline that consists in three main steps: assessing relations, extracting the most relevant ones and generating constraints for solving a constraint satisfaction problem and producing explanations. In Section 4, a demonstration of this approach is shown on a task of multiple organ recognition on medical images. This task is a good example of spatial reasoning since the spatial arrangement of the organs plays an important role in their recognition. In addition, working on medical images presents several challenges, including a need for explainability and the fact that datasets are usually small. We tested and compared our model to the state of the art and showed that our approach is able to achieve high accuracy and generate explanations in spite of a low number of training data.

## 2 Background

The approach we present in the next section relies on learning relevant fuzzy relations between objects for defining a constraint satisfaction problem. All the notions that are involved are reminded in this section.

### 2.1 Fuzzy Logic

Fuzzy logic and fuzzy set theory [ZADEH1965338] can be seen as an extension of Boolean logic that enables to manage imprecision. In a universe , a fuzzy set is characterized by a mapping such as . This mapping specifies in what extent each belongs to and it is called *the membership function of* . If is a non-fuzzy set, is either 0, i.e. is not a member of , or 1, i.e. is a member of . This range of degrees is useful for dealing with vagueness.

The fuzzy logic framework is also convenient for expressing relations between two sets. Given two universes and , a binary fuzzy relation is characterized by a mapping defined as . It assigns a degree of relationship to any . -ary fuzzy relations are defined identically. Another advantage is that fuzzy logic allows using words instead of mathematical symbols.

### 2.2 Fuzzy Constraint Satisfaction Problem

A constraint satisfaction problem (CSP) consists in assigning some values to a set of variables that must respect a set of constraints, such as scheduling problems [MINTON1992161] for instance. [Dubois1996] presents an extension of CSPs to the fuzzy logic framework to deal with imprecise parameters and flexible constraints. This is called a fuzzy constraint satisfaction problem (FCSP). A FCSP is defined by a set of variables , a set of domains and a set of flexible constraints . It is an appealing framework in the context of explainable annotation since it enables to both solve the annotation task (getting each variable assignment) and generate explanations using the constraints.

To solve a FCSP, the FAC-3 algorithm [Dubois1996, VANEGAS20161] is usually applied to prune the search space. Then, a backtracking algorithm explores every possible solution. Finally, we get the best solution by picking the one that is the most consistent with the set of constraints .

## 3 Proposed Approach

In this section, we describe our new approach that aims at annotating regions of interest in images and at providing an explanation for each annotation. It consists of three steps: the assessment of fuzzy relations from a given vocabulary between the organs we are looking for (Sec. 3.1), the learning of the most relevant relations between the organs (Sec. 3.2) and the solving of a FCSP providing explanations for finding the regions that are the most consistent with the relevant relations and explaining the reasoning behind it (Sec. 3.3). An overview of the whole approach is illustrated in Figure 1.

### 3.1 Step 1: Assessing Relations

This step aims at evaluating several relations between the regions of interest (the organs) so that we can later (in the following step) find the most relevant of them.

Let us consider a training set that contains images and a set of labels that contains labels such as each image is divided into regions of interest that are mapped to labels by the following function:

(1) | ||||||

Let us consider a set of relations. We call this set a vocabulary. It is set by an expert in the target task and it is composed of would-be relevant relations. For example, one relation can be a directional relation like *to the left of* or a distance relation like *close to*. The richer the vocabulary, the more expressive the system which should help to produce better annotations and explanations. Relations in are automatically evaluated on the regions of interest of each image in . The way they are computed depends on the definition of the relation, as shown in Sec. 4.2.2.

For any relation , let denote its arity. is evaluated for each possible -tuple of regions of interest. It is important to distinguish from its evaluations on the different regions. The number of evaluations to perform is:

(2) |

At the end of this step, we have a set of evaluated relations between organs that can be seen as features.

### 3.2 Step 2: Learning Relevant Fuzzy Relations

In this step, the objective is to extract among the previously assessed relations the most relevant of them. For a label , our postulate is that the relevant relations involving the regions labelled as are the most frequent ones since they should be verified by most, if not all, examples of these regions. Thus, learning the relevant relations is performed by mining the most frequent ones. It is done in a one-vs-all way since the relevant relations for one class of organs are not the same as for a different class. As each example from one class should be correlated to each other, we use a fuzzy mining algorithm that takes advantage of that [close].

Let be the set of all the evaluations of relations from on the labeled regions of interest. A subset of relations is a set belonging to . The mining algorithm we use is based on a fuzzy closure operator that enables to find all the closed sets of relations [close]

. All the frequent closed sets of relations are computed and the frequent sets of relations can be derived from them. A set of relations is said to be frequent when its frequency in the dataset is larger than a given threshold. Since this step is performed in a one-vs-all way, each class has its own threshold whose value is an hyperparameter determined during a validation phase. The value of this threshold has a direct impact on the number of frequent subsets of relations that are extracted. If it is too high, it is likely that no or few subsets of relations are seen as frequent, which may be not enough for discriminating classes. This would be a case of underfitting. On the other hand, if the threshold is too low, some irrelevant features will be kept. That would lead to overfitting. At the end of this step, for each label

, we have a set of frequent subsets of evaluated relations such as .### 3.3 Step 3: Solving the FCSP and Generating Explanations

Given a test example i, we can obtain a set of potential regions of interest by segmentation. The goal of this step is to find the labels of the regions that best satisfy the relations between organs that were learnt in the previous step. This can be modelled as a FCSP. Also, since these relations are associated to a linguistic description, we can generate an explanation for each annotation.

For each label , we got at the end of the previous step a set . Let us define such as :

(3) |

This set corresponds to the set of the frequent subsets of relations of maximal size. Each evaluated relation in the subsets of relations in is directly translated into a constraint . We can now build a model that is defined by the constraints that have been learned and its frequency thresholds. No iterative optimization process is needed, which makes it well suited to small training sets.

The test example i is divided into regions of interest that we want to annotate. The FCSP we get is the following :

(4) |

(5) |

(6) |

Then, each constraint in is evaluated, the FCSP is solved and the first part of the output, the labels, are returned. We obtain a new mapping such as :

(7) | ||||||

Then, for each variable , an explanation is generated using the constraints in . This is possible because the relations (and so the constraints) that we use are associated to a linguistic description.
For instance, the constraint (represented as a tuple ) leads to: “ is to the left of ”. Thus, using the constraints generated from enables to express an explanation in the form of “*output* BECAUSE *cause*,…,*cause*”. For a given label , all the constraints related to are extracted. The least satisfied constraint gives us a certainty factor to moderate the explanation [Budescu2012], e.g. ”*This organ is likely to be annotated as the liver…*“. The constraints and the certainty factor are then sent to a surface realiser like simpleNLG [simplenlg] to aggregate them into a syntactically correct sentence.

## 4 Case study

In this section, we detail the experiments we have performed on a dataset of medical images. The task is to perform explained multiple organ annotation by learning a model from few data. While multiple organ detection has been a regularly tackled topic in the literature [hooChangShin, criminisi, pauly, lee], multiple organ annotation has only been tackled in [xue2017automatic]. The principle of this method is to find images in the dataset that share visual characteristics with the image under study, and then to label it based on the labels from visually similar images. However, it cannot provide any explanation. In [lee], abdominal organ detection is performed using fuzzy spatial rules, but these rules are not suited to other datasets and they have to be set by an expert before learning. Organ classification has been addressed in [roth2015anatomy] using data augmentation to dodge the problem of having a small training set.

### 4.1 Dataset

It is important to note that the field of XAI is currently lacking a dataset that mainly focuses on explanations. This is why we carried out our experiments on a segmentation dataset that we used for assessing the accuracy of our model and the reliability of the explanations it produces. This dataset is named *Anatomy3* and has been presented in [jimenez2016cloud]. It contains 391 CT and MR images and their corresponding segmented organs. Images can be scans of the whole body (referred as CTwb and MRwb) or enhanced images of the abdomen (referred as CTce and MRce). Those are all 3D images that are actually the superposition of 2D slices. As we work on 2D images, we consider only slices in the following. We selected the slices to build a 2D image dataset. Figure 2 displays one example for each type of scan.

The set of organs (labels) we study is composed of the *liver*, the *spleen*, the *urinary bladder*, the *left* and *right kidneys*, the *left* and *right lungs* and the *left* and *right psoas major muscles*. We kept all the images that contain these 9 organs (and their corresponding segments), for a total of 35 examples and 315 segments in our dataset.

### 4.2 Experimental Settings

#### 4.2.1 Model Training

The model we build with our approach consists in the frequent subsets of relations that are extracted. There are as many hyperparameters as labels and they correspond to the thresholds used for assessing the frequency of a subset of relations. Model selection is necessary to get optimized thresholds, which is why we used *nested cross-validation* [cawley2010over]: (1) an outer cross-validation is performed in which we get a training set and a test set for each iteration, (2) an inner cross-validation is performed on the training set of the outer cross-validation to get an inner training set and a validation set for tuning hyperparameters. This enables to get an unbiased error prediction.

In the inner cross-validation, hyperparameter tuning is performed using bayesian optimization over 20 iterations with a Gaussian process prior. The acquisition function is the expected improvement.

#### 4.2.2 Relations

Many fuzzy spatial relations have been studied in the literature [bloch2005fuzzy]. In our experiments, we use directional, distance and symmetry relations. Directional and distance relations [bloch1999fuzzy, bloch1999fuzzy2]

are computed as a fuzzy landscape and assessed using a fuzzy pattern matching approach

[cayrol1982fuzzy]. As shown in Figure 3, the fuzzy landscape is generated by computing the fuzzy morphological dilation of a reference object by a structuring element whose shape determines the kind of relation. Let be the space of the images. Let be a reference object in and the membership function associated to the fuzzy landscape representing the relation whose reference object is . Let be the membership function corresponding to an object in . The relation between and is the result of the fuzzy degree of intersection between and such as [bloch2005fuzzy](8) |

For instance, in Figure 3, the relation is *to the left of*, the reference object is the red organ and the object is the blue organ.

*blue organ to the left of the red organ*. Given an input ((a)a), a segmented organ is considered ((b)b) as the reference object. This organ is used to compute a fuzzy landscape ((c)c) that represents the degree to which each pixel verifies the relation

*to the left of the red organ*. Finally, the relation is assessed by evaluating the degree of intersection between this fuzzy landscape and the blue organ.

To get a finite catalogue of relations, we constrained the parameters of these relations to express only relations such as *above* or *close to*.

The symmetry relation [colliot] we use consists in finding the line that maximizes a symmetry measure between two organs. Since this measure is not differentiable, a direct search method is used to solve this optimization problem, such as the downhill simplex method.

We also use one property that can be seen as a unary relation since it characterizes just one organ. It evaluates how stretched an organ is. Given a segmented organ, a PCA is performed to get its two principal axes. Then, the organ is projected on both axis and the ratio of these projections is used to compute the degree corresponding to this property. However, this does not manage concave shapes well.

Our vocabulary of relations contains: *to the left of*, *to the right of*, *below*, *above*, *close to*, *symmetrical to* and *stretched*. That makes 6 binary and one unary relations. As we consider 9 organs, the number of relations to evaluate for one image is equal to 441, which contributes to make our model expressive. There is however a trade-off between the expressivity of the system and the computation time needed for assessing all these relations.

### 4.3 Problem initialization

As stated in in Sec. 3, the whole process consists in three main steps. The inputs we deal with are segments provided in the datasets. They are not fuzzy, but the process is exactly the same whether we deal with fuzzy or crisp objects.

The intermediary goal is to generate constraints for defining a FCSP. Once solved, the FCSP returns the labels and constraints are used for generating explanations.

The variables are the segments provided in the dataset. Each of them corresponds to an organ. We have the following FCSP:

(9) |

(10) |

where is equal to . For each organ , the flexible constraints are generated from the set of the frequent subsets of relations of maximal size to build a set of constraints . Furthermore, since every organ is unique, there cannot be identical annotations in this problem. That means has to be extended with constraints representing that two variables cannot be the same, which is the *AllDifferent* global constraint.

The definition of the FCSP is thus made automatically. Then, once the FCSP is defined, for a given example, it can be solved as described in Sec. 2.2.

### 4.4 Results

Fig. 4 shows an example of output for an input image with 9 organs to annotate and thus 9 explanations to provide.

Organ | Value of the corresponding threshold |
---|---|

Liver | 0.96 |

Spleen | 0.86 |

Bladder | 0.80 |

Right kidney | 0.92 |

Left kidney | 0.89 |

Right lung | 0.98 |

Left lung | 0.97 |

Right psoas muscle | 0.92 |

Left psoas muscle | 0.88 |

We evaluate our model using the accuracy, which is the ratio for all organs of the number of correct annotations over the total number of annotations. We got an accuracy of for a model containing only directional relations. The outer cross-validation is actually a 3-fold cross-validation (23/24 training examples for 12/11 test examples in each iteration) and the inner one is a 4-fold cross-validation. As there are 9 organs to annotate, there are 9 hyperparameters that need to be set for extracting frequent relations (Table 1). Constraints could be added to the hyperparameter optimization process to make explanations longer or shorter.

We observe the explanations rightfully rely on the relations that have been extracted and later turned into constraints. For example in Fig. 4, the set of constraints associated to the right kidney is:

.
Some of these constraints may seem redundant, like the last two constraints in . That can happen because fuzzy morphological dilations depend on the shape of the reference object. As two different organs are never exactly the same, there are slight differences between those two constraints. Each organ is linked to such a set of constraints. The final set of constraints is the union of all these sets.

Assessing the quality of the explanations is tricky. What makes a good explanation ultimately depends on the knowledge and expectation of the end-user. Criteria like the coherence, the simplicity and the relevancy of the explanation are good indicators [miller, baaj], but they may not be easy to assess. Three evaluation methods are proposed in [DoshiKim2017Interpretability]: asking an expert, asking simple questions to a group of non-expert people or using a proxy model that has been proved to be explainable to assess the model under study.

We also investigated on the number of training examples that are required by our model to perform well. We get an accuracy of at worst for a couple of training images (so 33 test examples). Actually, when dealing with just one training example, since our model looks for frequent relations to set the constraints, it will extract the relations whose evaluation is larger than the thresholds we talked about in Sec. 3.2

. Any example that is not an outlier should then allow the model to perform well. Thus, we show that our approach can perform spatial reasoning and achieve high accuracy from just a pair of training examples.

We observe that our model outperforms the CNN classifier presented in

[roth2015anatomy], which does not achieve perfect accuracy. That model was trained on a bigger training set and does not provide any kind of explanation. The closest method to ours, which was presented in [xue2017automatic], does not give any accuracy as a baseline. Its drawback is that it can miss labels, which happens at least once every five examples. In our approach, a label cannot be missing since every variable of the FCSP has to be associated to a domain.On a side note, the generalisability of our approach depends on how well images are segmented (although fuzzy logic helps to deal with imprecision), how expressive the vocabulary is and how many outliers are in the dataset. Applications where one of this is missing may lead to a drop in performance regarding both annotations and explanations.

## 5 Conclusion and Prospects

In this article, we present a novel visual learning and reasoning framework whose goal is to explain and annotate relevant objects in images. The problem is formalized as a fuzzy constraint satisfaction problem. It is based on fuzzy spatial relations, which are learned on a set of annotated objects in images and then translated into constraints. We demonstrated our approach on a medical image dataset and showed that our method takes advantage of symbolic learning and reasoning so that it explains its results and it only needs a couple of training examples to achieve accuracy.

In the future, we would like to work on a strategy that makes the first step of the process faster. A first idea is to determine a hierarchical structure of the spatial relations to apply a topological sort. Moreover, since fuzzy logic enables to manage imprecise segments, the goal is to insert an unsupervised segmentation model before the model we presented here. This would enable to adapt to different kinds of images.

Finally, this is a first step in mixing statistical machine learning (especially deep learning) for perception with symbolic learning and reasoning for higher level intelligence in order to create an explainable artificial intelligence.