In Multiple Instance Learning (MIL), data is organised into bags of instances, and each bag is given a single label. This reduces the burden of labelling, as each instance does not have to be assigned a label, making it useful in applications where labelling is expensive, such as healthcare (Carbonneau et al., 2018). However, there are often only a few key instances within a bag that determine the bag label, so it is necessary to identify these key instances in order to interpret a model’s decision-making (Liu et al., 2012)
. Directly interpreting the decision-making process of machine learning methods is difficult due to the complexity of the models and the scale of the data on which they trained, so there is a need for methods that allow insights into the decision-making processes(Gilpin et al., 2018).
In MIL problems, only some of the instances in each bag will be discriminatory, i.e., a significant number of the instances in a bag will be non-discriminatory or even related to other bag classes (Amores, 2013). Identifying the important instances and presenting those as the interpretation of model decision-making filters out the non-discriminatory information and provides the explainee with a reduced and relevant interpretation, which will avoid overloading the user (Li et al., 2015). Our work provides the following novel contributions:
MIL interpretability definition We identify two questions that interpretability methods for MIL should be able to answer: 1) Which are the key instances for a bag? 2) What outcome (class/value) does each key instance support? In the rest of this work, we will refer to these two questions as the which and the what questions respectively. As we explore in Section 2, existing interpretability methods are able to answer which questions with varying degrees of accuracy, but can only answer what questions under certain assumptions.
MIL model-agnostic interpretability methods Existing interpretability methods are often model-specific, i.e., they can only be applied to certain types of MIL models. To this end, we build upon the state-of-the-art MIL inherent interpretability methods by developing model-agnostic interpretability methods that are up to 30% more more accurate. The model-agnostic interpretability methods that we propose can be applied to any MIL model, and are able to answer both which and what questions.
MIL interpretability method comparison We compare existing model-specific methods with our model-agnostic methods on several MIL datasets. Our experiments are also carried out on four types of MIL model, giving a comprehensive comparison of interpretability performance. To the best of the authors’ knowledge, this is one of the first studies to compare different interpretability methods for MIL.111Source code for this project is available at https://github.com/JAEarly/MILLI.
The remainder of this work is laid out as follows. Section 2 provides relevant background knowledge and reviews existing MIL interpretability methods. Section 3 outlines the requirements for MIL interpretability and details our approaches that meet these requirements. Next, Section 4 provides our results and experiments. Section 5 discusses our findings, and Section 6 concludes.
2 Background and related work
The standard MIL assumption (SMIL) is a binary problem with positive and negative bags (Dietterich et al., 1997; Maron and Lozano-Pérez, 1998). A bag is positive if any of its instances are positive, otherwise it is negative. The assumptions of SMIL can be relaxed to allow more generalised versions of MIL, e.g., through extensions that include additional positive classes (Scott et al., 2005; Weidmann et al., 2003). SMIL also assumes there is no interaction between instances. However, recent work has highlighted that modelling relationships between instances is beneficial for performance (Tu et al., 2019; Zhou et al., 2009). Existing methods for interpreting MIL models often rely on the SMIL assumption, so cannot generalise to other MIL problems. In addition, existing methods are often model-specific, i.e., they only work for certain types of MIL models, which constrains the choice of model (Ribeiro et al., 2016b). Identifying the key instances in MIL bags is a form of local interpretability (Molnar, 2020), as the key instances are detected for a particular input to the model. Two of the key motivators for interpreting the decision-making process of a MIL system are reliability and trust — identifying the key instances allows an evaluation of the reliability of the system, which increases trust as the decision-making process becomes more transparent. In this work, we use the term interpretability rather than explainability to convey that the analysis remains tied to the models, i.e., these methods do not provide non-technical explanations in human terms.
Under the SMIL assumption, which and what questions are equivalent — there are only two classes (positive and negative), and the only key instances are the instances that are positive, therefore once the key instances are identified, it is also known what outcome they support. However, when there are multiple positive classes, if instances from different classes co-occur in the same bags, answering which and what questions becomes two distinct problems. For example, some key instances will support one positive class, and some key instances will support another. Therefore, solely identifying which are the key instances does not answer the second question of what class they support. Existing methods, such as key instance detection (Liu et al., 2012), MIL attention (Ilse et al., 2018)
, and MIL graph neural networks (GNNs;Tu et al. (2019)) do not condition their output on a particular class, so it is not apparent what class each instance supports, i.e., they can only answer which questions. One existing method that can answer what questions is mi-Net (Wang et al., 2018), as it produces instance-level predictions as part of its processing. However, these instance-level predictions do not take account of interactions between the instances, so are often inaccurate. A related piece of work on MIL interpretability is Tibo et al. (2020), which considers interpretability within the scope of multi-multi-instance learning (MMIL; Tibo et al. (2017); Fuster et al. (2021)). In MMIL, the instances within a bag are arranged into into further bags, giving a hierarchical bags-of-bags structure. The interpretability techniques presented by Tibo et al. (2020) are model-specific as they are only designed for MMIL networks. In this work, we aim to overcome the limitations of existing methods by developing model-agnostic methods that can answer both which and what questions.
In single instance supervised learning, model-agnostic techniques have been developed to interpret models. Post-hoc local interpretability methods, such as Local Interpretable Model-agnostic Explanations (LIME;Ribeiro et al. (2016a)) and SHapley Additive exPlanations (SHAP; Lundberg and Lee (2017)), work by approximating the original predictive model with a locally faithful surrogate model that is inherently interpretable. The surrogate model learns from simplified inputs that represent perturbations of the original input that is being analysed. In this work, one of our proposed methods is a MIL-specific version of this approach.
At the start of this section, we outline the requirements for MIL interpretability (Section 3.1). In Section 3.2, we propose three methods that meet these requirements under the assumption that there are no interactions between instances (independent-instance methods). In Section 3.3 we remove this assumption and propose our local surrogate model-agnostic interpretability method for MIL.
3.1 MIL interpretability requirements
A general MIL classification problem has possible classes, with one class being negative and the rest being positive. This is a generalisation of the SMIL assumption, which is a special case when . An interpretability method that can only provide the general importance of an instance (without associating it with a particular class) can only answer which questions. In order to answer what questions, the method needs to state which classes each instance supports and refutes. Formally, for a bag of instances and a bag classification function , we want to assign a value to each instance that represents whether it supports or refutes a class :
where is the interpretability function and is the interpretability value for instance with respect to class . Here, we assume that means instance supports class , implies instance refutes class , and the greater , the greater the importance of instance . For some existing methods, such as attention, is the same for all classes, and in all cases, so these requirements are not met. We later demonstrate these limitations in Section 5. In the next two sections we propose several model-agnostic methods that satisfy these requirements.
3.2 Determining instance attributions
In this section, we propose three model-agnostic methods for interpreting MIL models under the assumption that the instances are independent. This means we can observe the effects of each instance in isolation without worrying about interactions between the instances. We exploit a property of MIL models: the ability to deal with different sized bags. As MIL models are able to process bags of different sizes, it is possible to remove instances from the bags and observe any changes in prediction, allowing us to understand what instances are responsible for the model’s prediction. Below, we propose three methods that use this property to interpret a model’s decision making.
Given a bag of instances and a bag classification function , we can take each instance in turn and form a single instance bag: for . We then observe the model’s prediction on each single instance bag , where is the output of for class (i.e., the
entry in the output vector of). A large value for suggests instance supports , and a value close to zero suggest it has no effect with respect to class . If we repeat this over all instances and all classes, we can build a picture of the classes that each instance supports, allowing us to answer what questions. However, this method cannot refute classes (i.e., in all cases). It should be noted that this method gives the same outputs as the inherent interpretability of mi-Net (Wang et al., 2018), but here it is a model-agnostic rather than a model-specific method.
A natural counterpart to the Single method is the One Removed method, where each instance is removed from the complete bag in turn, i.e., we form bags . For a particular class , we can then observe the change in the model’s prediction caused by removing from the bag: . If the prediction decreases, supports , and if it increases, refutes , i.e., this method is able to both support and refute different classes. However, if there are other instances in the bag that support or refute class , we may not observe a change in prediction when is removed, even if is a key instance.
In order to access the benefits of both the Single and One Removed methods, we can combine their outputs. A simple approach is to take the mean, i.e., . This method can identify the important instances revealed by the Single method, and also refute outcomes as revealed by the One Removed method.
With these three methods, it is assumed that there are no interactions between the instances. This is not true for all datasets, therefore, in the next section, we remove this assumption and propose a further method that is able to deal with the interactions between instances.
3.3 Dealing with instance interactions
In order to calculate instance attributions whilst accounting for interactions between instances, we have to consider the effect of each instance within the context of the bag. With instance interactions, the co-occurrence of two (or more) instances changes the bag label from what it would be if the two instances were observed independently. The instances have different meanings depending on the context of the bag, i.e., on their own they mean something different to what they mean when observed together. One way to uncover these instance interactions is to perturb the original input to the model and observe the outcome. In the case of MIL, these perturbations can take the form of removing instances from the original bag. By sampling coalitions of instances, and fitting a weighted linear regression model against the coalitions and their respective model predictions, it is possible to construct a surrogate locally faithful interpretable model that accounts for the instance interactions in the original bag. A coalition is a binary vectorthat represents a subset , so the number of ones in the coalition is equal to the length of the . The surrogate model takes the form
where each coefficient is the importance attribution for each instance with respect to class . Given a collection of coalitions
, minimising the loss function
means is a locally faithful approximation of the original model for class . The loss function is weighted by a kernel , which determines how important it is for to be faithful for each individual coalition. Here, only approximates for class , i.e., to produce interpretations for a bag with respect to all classes, a surrogate model needs to be fit for each class. Similar approaches have been applied in single instance supervised learning in methods such as LIME and KernelSHAP. Both use different choices for : LIME employs or cosine distance, and KernelSHAP uses a weighting scheme that approximates Shapley values (Shapley, 1953). For single instance supervised learning models, when perturbing the inputs, it is not possible to simply ‘remove’ a feature from an input as the models expect fixed-size inputs — either a new model has to be re-trained without that particular feature, or appropriate sampling from other data has to be undertaken, which can lead to unrealistic synthetic data. In MIL, as models are able to deal with different size bags, instances can be removed from the bags without the need for re-training or sampling from other data, meaning these issues do not occur in our setting.
While it is possible utilise the weight kernels from LIME and KernelSHAP for MIL, we identify a significant drawback with both methods. Their choice for weights all coalitions of the same size equally, i.e., they do not consider the content of the coalitions, only their size. Just because two coalitions are of the same size does not mean it is equally important that the surrogate model is faithful to both of them. Furthermore, the sampling strategies of both approaches lead to very large ( close to 1) or very small coalitions ( close to 0). Unless the number of samples is very large, samples of average size ( close to 0.5) will not be chosen. As we see in Section 4.4, this approach to sampling is appropriate for some datasets, but not for others.
We aim to overcome both of these drawbacks with our own MIL-specific choice of weight kernel and sampling approach. Below, we propose Multiple Instance Learning Local Interpretations (MILLI). At the core of our approach is a new method for weighting coalitions based on an initial ranking of instance importances . For a coalition and ranking of instance importances , we define our weight kernel as:
is a function that weights instances based on their order in the ranking. Its two hyperparameters,and define the shape of the function, and ultimately determine the kernel . The value of dictates whether the sampling should be biased towards instances that are highly ranked or not: means is biased towards instances higher in ordering, and means is biased towards values lower in the ordering. The value of is discussed below, as it becomes important when sampling coalitions. We provide an illustration of in Figure 1.
In , the coalition is weighted on its content rather than its length, i.e., the distance between a subset and a bag is no longer determined simply by the number of instances removed. This is beneficial, as removing many non-discriminatory instances from the bag likely means the label of the bag remains the same, despite being different in size. Conversely, removing one discriminatory instance could change the bag label, even though the bag is only one instance smaller. As well as using in the weight kernel, we also utilise it to sample coalitions. We can consider the problem of sampling as a repeated coin toss, where and . In equal random sampling, every value is equal to 0.5, meaning . However, it is possible to improve upon equal random sampling by changing the value of for each instance in bag, i.e., we can change the likelihood of each instance being involved in a coalition. To sample more informative coalitions, we set , meaning :
If , the sampling is biased towards smaller coalitions, and if , the sampling is biased towards larger coalitions. The maximum and minimum is controlled by . When or , every value is equal to 0.5, meaning we have equal random sampling as described above. We provide an illustration of how changes in Figure 2.
The final part of MILLI is how to determine the initial ranking of importances. For this, we can use the Single method from Section 3.2, which produces values , and we convert this to an instance ranking: the new values represent the position for instance in (e.g., the instance with the greatest value for has ). MILLI is more expensive to compute that the methods outlined in Section 3.2: compared with . However, when there are interactions between the instances, this extra complexity is required in order to build a better understanding of the classes that each instance supports and refutes, allowing more accurate answering of what questions. It is important to note that MILLI is indeed model agnostic — it only requires access to the bag classification function , and makes no assumptions about the underlying MIL model. As we demonstrate in the next section, this means we can apply it to any MIL model.
We apply our model-agnostic methods to seven MIL datasets. In this section, we detail the evaluation strategy (Section 4.1), the datasets (Section 4.2), models (Section 4.3), and results (Section 4.4). For implementation details, see Appendix A.1.
4.1 Evaluation strategy
For our evaluation, we do not assume that the interpretability methods have consistent output domains; we only assume a larger value implies larger support. Therefore, the best approach for evaluating the interpretability methods is to use ranking metrics — rather than looking at the absolute values produced by the methods, we compare the relative orderings they produce. As noted by Carbonneau et al. (2018)
, although a large number of benchmark MIL datasets exist, many do not have instance labels. Without instance labels, evaluating interpretability is challenging, as we do not have a ground truth instance ordering to compare to. We identify two appropriate evaluation metrics: for datasets with instance labels, we propose the use of normalised discounted cumulative gain at n (NDCG@n), and for datasets without instance labels, we propose the use of area under the perturbation curve compared with random orderings (AOPC-R;Samek et al. (2016)). While AOPC-R does not require instance labels, it is very expensive to compute. A significant difference between the two metrics is the way they weight the importance ordering: NDCG@n prioritises performance at the start of the ordering, whereas AOPC-R equally prioritises performance across the entire ordering. We are the first to propose both of these metrics for use in evaluating MIL interpretability, and further discuss the advantages of both metrics in Appendix A.3.
Below we detail the three main datasets on which we can evaluate our interpretability methods. These datasets were selected as they have instance labels, so we can evaluate them using both NDCG@n and AOPC-R. However, we provide further results on the classical MIL datasets Musk, Tiger, Elephant and Fox in Appendix A.4 (Dietterich et al., 1997; Andrews et al., 2002).
The Spatially Independent, Variable Area, and Lighting (SIVAL; Rahmani et al. (2005)) dataset consists of 25 classes of complex objects photographed in different environments, where each class contains 60 images. Each image has been segmented into approximately 30 segments, and each segment is represented by a 30-dimensional feature vector that encodes information such as the segment’s colour and texture. The segments are labelled as containing the object or containing background. We selected 12 of the 25 classes to be positive classes, and randomly sampled from the other 13 classes to form the negative class. For additional details see Appendix A.6.
The SIVAL dataset has no co-occurrence of instances from different classes, i.e., each bag only contains one class of positive instances. Therefore, the which and what
questions are the same. In order to explore what happens when there are different classes of positive instances in the same bag, we propose an extension of the MNIST-Bags dataset introduced byIlse et al. (2018). Our extension, four-class MNIST-Bags (4-MNIST-Bags) is setup as follows: class 1 if 8 in bag, class 2 if 9 in bag, class 3 in 8 and 9 in bag, and class 0 otherwise.222We use n to refer to an image from the MNIST dataset that represents the number , even though that is not the assigned class label in the 4-MNIST-Bags dataset.In this dataset, answering what questions goes beyond answering which questions, e.g., the existence of an 8 supports classes one and three, but refutes classes zero and two. For further details see Appendix A.7.
To test the applicability of the interpretability methods on larger bag sizes, we apply it to colorectal cancer tissue classification. The ColoRectal Cancer (CRC) dataset (Sirinukunwattana et al., 2016) is a collection of microscopy images with annotated nuclei. We follow the same setup as Ilse et al. (2018), in which a bag is positive if it contains one or more nuclei from the epithelial class. This means the problem conforms to the SMIL assumption, and there are no interactions between instances. Each microscopy image is 500 x 500 pixels, and was split into 27 x 27 pixel patches to give a maximum of 324 patches per slide (patches were discarded if they contained mostly slide background). Not all of the instances are labelled, so we tailor our assessment using NDCG@n to only consider labelled instances. For additional details see Appendix A.8.
4.3 Models and Methods
To highlight the model-agnostic abilities of the proposed methods, for each dataset, we trained four different types of multi-class MIL model: an embedding-based multiple instance neural network (MI-Net; Wang et al. (2018)), an instance-based multiple instance network (mi-Net; Wang et al. (2018)), an attention-based model (MI-Attn; Ilse et al. (2018)), and a graph neural network model (MI-GNN; Tu et al. (2019)). Aside from MI-Net, each of these models provide their own inherent interpretability method that we can compare our methods against. Additional details on the models are given in Appendix A.2. As well as evaluating our independent-instance methods and MILLI, we also compare against LIME and SHAP using both random and guided sampling (choosing coalitions that maximise the weight kernel). This means that in total we are evaluating nine different interpretability methods: inherent interpretability, the three independent-instance methods (Section 3.2), two LIME methods, two SHAP methods, and MILLI (Section 3.3). Note that, aside from the inherent interpretability methods, these methods are either novel (independent-instance methods and MILLI) or are applied to MIL for the time in this study (LIME and SHAP).
For each dataset, we measured the performance of each interpretability method on each of the four MIL models. For the SIVAL dataset we measured the performance on only the negative class and the bag’s true class, but for all the other datasets we evaluated the interpretability over every class. This distinction was made as we know that each SIVAL bag can only contain instances from one positive class, therefore it is not necessary to evaluate over all possible classes. We present the interpretability results run against the test set for the SIVAL, 4-MNIST-Bags, and CRC datasets in Tables 1, 2 and 3 respectively. The results are averaged over ten repeat trainings of each model, and we also give the test accuracy of each of the underlying MIL models. We discuss our choice of hyperparameters in Appendix A.5.
|Inherent||N/A||0.813 / 0.265||0.717 / 0.005||0.586 / 0.023||0.705 / 0.098|
|Single||0.825 / 0.280||0.813 / 0.266||0.801 / 0.302||0.734 / 0.194||0.793 / 0.261|
|One Removed||0.778 / 0.256||0.837 / 0.308||0.736 / 0.293||0.776 / 0.231||0.782 / 0.272|
|Combined||0.828 / 0.291||0.828 / 0.294||0.803 / 0.316||0.762 / 0.227||0.805 / 0.282|
|RandomSHAP||0.801 / 0.284||0.807 / 0.300||0.766 / 0.322||0.784 / 0.250||0.789 / 0.289|
|GuidedSHAP||0.826 / 0.291||0.819 / 0.290||0.790 / 0.313||0.765 / 0.235||0.800 / 0.282|
|RandomLIME||0.809 / 0.295||0.815 / 0.310||0.776 / 0.335||0.793 / 0.258||0.798 / 0.299|
|GuidedLIME||0.780 / 0.259||0.830 / 0.310||0.742 / 0.296||0.776 / 0.233||0.782 / 0.274|
|MILLI||0.823 / 0.283||0.827 / 0.307||0.794 / 0.307||0.790 / 0.239||0.808 / 0.284|
SIVAL interpretability NDCG@n / AOPC-R results. For all of the interpretability methods, the standard error of the mean was 0.01 or less.
|Inherent||N/A / N/A||0.723 / 0.136||0.750 / 0.002||0.419 / N/A||0.630 / 0.069|
|Single||0.722 / 0.138||0.723 / 0.137||0.778 / 0.164||0.761 / N/A||0.746 / 0.146|
|One Removed||0.811 / 0.187||0.810 / 0.186||0.809 / 0.143||0.786 / N/A||0.804 / 0.172|
|Combined||0.775 / 0.184||0.775 / 0.185||0.816 / 0.183||0.804 / N/A||0.792 / 0.184|
|RandomSHAP||0.813 / 0.186||0.809 / 0.185||0.825 / 0.178||0.828 / N/A||0.819 / 0.183|
|GuidedSHAP||0.773 / 0.187||0.773 / 0.188||0.816 / 0.183||0.805 / N/A||0.792 / 0.186|
|RandomLIME||0.828 / 0.189||0.825 / 0.189||0.841 / 0.179||0.838 / N/A||0.833 / 0.186|
|GuidedLIME||0.760 / 0.189||0.756 / 0.187||0.785 / 0.145||0.776 / N/A||0.769 / 0.174|
|MILLI||0.947 / 0.190||0.943 / 0.189||0.917 / 0.181||0.959 / N/A||0.942 / 0.186|
By analysing the NDCG@n interpretability results across these three datasets, we find that MILLI performs best with an average of 0.85, followed by the GuidedSHAP and Combined methods (both with an average of 0.81). For the average AOPC-R results (including the results on the classical MIL datasets, see Appendix A.4), we find that Combined, GuidedSHAP, RandomLIME, and MILLI are the best performing methods, however the overall the difference in performance between the methods is much less than what we observe for the NDCG@n metric. For the 4-MNIST-Bags dataset, the difference in performance of MILLI on NDCG@n vs AOPC-R is due to the difference in weighting between the metrics: MILLI achieves a better ordering for the most important instances (outperforms other methods on NDCG@n), but gives the same ordering as other methods for less important instances (equal performance on AOPC-R). In the majority of cases, all of our proposed model-agnostic methods outperform the inherent interpretability methods, and are relatively consistent in performance across all models. For the SIVAL dataset, the independent-instance methods perform well, which is expected as the instances are independent. However, on the 4-MNIST-Bags dataset, MILLI excels as it samples informative coalitions that capture the instance interactions. On the CRC dataset, methods that are able to isolate individual instances, (i.e., the Single, Combined, and GuidedSHAP methods) perform well due to instance independence in this dataset. Furthermore, if we consider the witness rate (WR; the proportion of key instances in each bag; Carbonneau et al. (2018)) of the datasets, we find that the CRC dataset has a higher WR (27.47%) than SIVAL (15.28%) and 4-MNIST-Bags (8.04%). This means, with larger coalitions, it becomes more difficult to isolate the contributions of individual instances, which is why the One Removed and Random sampling methods struggle.
To demonstrate how MILLI captures instance interactions, and to show the limitations of existing methods that cannot condition their output for a particular class, we compare the MILLI interpretations with the attention interpretations on the 4-MNIST-Bags dataset (Figure 3). As this bag contains an 8 and a 9, the correct label for it is class three. Both the MILLI and attention methods have identified the 8 and 9 instances as key instances, i.e., they have both answered the which question. However, only MILLI correctly identifies that the 8 refutes class two and that the 9 refutes class one, answering the what question, something that the attention values do not do. We provide further examples, including interpretability outputs for the SIVAL and CRC datasets, in Appendix A.10.
LIME, SHAP, and MILLI all require the number of sampled coalitions to be selected. A greater number of coalitions means there is more data to fit the surrogate model on, therefore it is more likely to faithful to the underlying model. However, the more samples that are taken, the more expensive the computation. As shown in Figure 4, MILLI is the most consistent in terms of performance and efficiency. For all methods, it is a case of diminishing returns, where additional samples only lead to a small increase in performance.
The different sampling approaches used in this work have distinct advantages. When there are interactions between instances, RandomSHAP and RandomLIME are better than GuidedSHAP and GuidedLIME as they form larger coalitions that are more likely to capture the instance interactions. However, for the CRC dataset, where there are a large number of independent instances and a high WR, RandomSHAP, RandomLIME and GuidedLIME struggle as they cannot form small coalitions. MILLI performs relatively well across all datasets as the size of its sampled coalitions can be adapted depending on the dataset, demonstrating its generalisability. However, it is still outperformed by GuidedSHAP on the CRC dataset. One possible explanation for this is that MILLI is limited to sampling only smaller or larger coalitions, whereas GuidedSHAP samples both large and small coalitions. One approach for improving MILLLI would be to incorporate paired sampling (Covert and Lee, 2020), where for every sampled coalition , we also sample its complement coalition . Following the advice of Carbonneau et al. (2018)
, further MIL studies on more complex datasets, such as Pascal VOC(Everingham et al., 2010), could also be insightful for evaluating MIL interpretability methods. It would also be beneficial to examine if these techniques are applicable to MIL domains beyond classification, e.g. MIL regression as in Wang et al. (2020).
In this work, we have discussed the process of model-agnostic interpretability for MIL. Along with defining the requirements for MIL interpretability, we have presented our own approaches and compared them to existing inherently interpretable MIL models. By analysing the methods across several datasets, we have shown that independent-instance methods can be effective, but local surrogate methods are required when there are interactions between the instances. All of our proposed methods are more effective than existing inherently interpretable models, and are able to not only identify which are the key instances, but also say what classes they support and refute.
For our work, the main details for the methodology are detailed in Section 3. In addition, we provide details that will aid reproducibility in the Appendix. The dataset sources and a general overview of our implementation is given in Appendix A.1. Further information on the MIL models used in this work can be found in Appendix A.2, and specific details on their architectures and training can be found in Appendices A.6, A.7, and A.8 for the SIVAL, 4-MNIST-Bags, and CRC datasets respectively. The codebase for this work can be found on GitHub: https://github.com/JAEarly/MILLI.
The authors acknowledge the use of the IRIDIS High Performance Computing Facility and associated support services at the University of Southampton in the completion of this work. IRIDIS-5 GPU-enabled compute nodes were used for the long running experiments in this work.
We also acknowledge the use of SciencePlots (Garrett, 2021) for formatting our Matplotlib figures.
This work was funded by AXA Research Fund and the UKRI Trustworthy Autonomous Systems Hub (EP/V00784X/1). We would also like to thank the University of Southampton and the Alan Turing Institute for their support.
- Multiple instance classification: review, taxonomy and comparative study. Artificial intelligence 201, pp. 81–105. Cited by: §1.
- Support vector machines for multiple-instance learning.. In NIPS, Vol. 2, pp. 7. Cited by: §4.2.
- Multiple instance learning: a survey of problem characteristics and applications. Pattern Recognition 77, pp. 329–353. Cited by: §1, §4.1, §4.4, §5.
Improving kernelshap: practical shapley value estimation via linear regression. arXiv preprint arXiv:2012.01536. Cited by: §5.
- Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence 89 (1-2), pp. 31–71. Cited by: §2, §4.2.
The pascal visual object classes (voc) challenge.
International journal of computer vision88 (2), pp. 303–338. Cited by: §5.
- Nested multiple instance learning with attention mechanisms. arXiv preprint arXiv:2111.00947. Cited by: §2.
- garrettj403/SciencePlots. External Links: Cited by: §6.
Explaining explanations: an overview of interpretability of machine learning.
2018 IEEE 5th International Conference on data science and advanced analytics (DSAA), pp. 80–89. Cited by: §1.
- Attention-based deep multiple instance learning. In International conference on machine learning, pp. 2127–2136. Cited by: §A.2, §A.8, §2, §4.2, §4.2, §4.3.
- Communication in human-agent teams for tasks with joint action. In International Workshop on Coordination, Organizations, Institutions, and Norms in Agent Systems, pp. 224–241. Cited by: §1.
- Key instance detection in multi-instance learning. In Asian Conference on Machine Learning, pp. 253–268. Cited by: §1, §2.
- A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874. Cited by: §2.
- A framework for multiple-instance learning. Advances in neural information processing systems, pp. 570–576. Cited by: §2.
- Interpretable machine learning. Lulu. com. Cited by: §2.
Localized content based image retrieval. In Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval, pp. 227–236. Cited by: §4.2.
- ”Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §2.
- Model-agnostic interpretability of machine learning. arXiv preprint arXiv:1606.05386. Cited by: §2.
- Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems 28 (11), pp. 2660–2673. Cited by: §4.1.
- On generalized multiple-instance learning. International Journal of Computational Intelligence and Applications 5 (01), pp. 21–35. Cited by: §2.
- A value for n-person games. Princeton University Press. Cited by: §3.3.
Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images. IEEE transactions on medical imaging 35 (5), pp. 1196–1206. Cited by: §4.2.
- A network architecture for multi-multi-instance learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 737–752. Cited by: §2.
- Learning and interpreting multi-multi-instance learning networks.. J. Mach. Learn. Res. 21, pp. 193–1. Cited by: §2.
- Multiple instance learning with graph neural networks. arXiv preprint arXiv:1906.04881. Cited by: §A.2, §2, §2, §4.3.
- In defense of lstms for addressing multiple instance learning problems. In Proceedings of the Asian Conference on Computer Vision, Cited by: §5.
- Revisiting multiple instance neural networks. Pattern Recognition 74, pp. 15–24. Cited by: §2, §3.2, §4.3.
- A two-level learning method for generalized multi-instance problems. In European Conference on Machine Learning, pp. 468–479. Cited by: §2.
- Multi-instance learning by treating instances as non-iid samples. In Proceedings of the 26th annual international conference on machine learning, pp. 1249–1256. Cited by: §2.
Appendix A Appendix
a.1 Implementation details
All code for this work was implemented in Python 3.8, using the PyTorch library for the machine learning functionality. Hyperparameter tuning was carried out using the Optuna libary. Some local experiments were carried out on a Dell XPS Windows laptop, utilising a GeForce GTX 1650 graphics card with 4GB of VRAM. GPU support for machine learning was enabled through CUDA v11.0. Other longer running experiments, such as hyperparameter tuning, were carried out on a remote GPU node utilising a Volta V100 Enterprise Compute GPU with 16GB of VRAM. The longest model to train was the GNN for the CRC dataset, which took just under half an hour. The longest running experiment was the hyperparameter analysis for the CRC dataset at just under 40 hours. This was due to high number of samples for the local surrogate methods and the fact that the results were average over 10 different models. The randomisation of data splits was fixed using seeding; the seed values are provided in the code for each experiment. We use the following sources for our data:
The annotated SIVAL dataset was downloaded from the publicly accessible page:
The MNIST dataset was access directly from the PyTorch Python library:
The CRC dataset was downloaded from the publicly accessible page:
The Musk dataset was downloaded from the publicly accessible page:
The Tiger, Elephant and Fox datasets were downloaded from the publicly accessible page:
In this work, we trained four different models for each dataset. In this section, we provide further details on each of the models. Each model was tuned independently for each dataset, so in the later sections we provide details of the specific architectures for each dataset. We also tuned the learning rate, weight decay, and dropout for each of the models independently for every dataset. Again, these are given in the later sections for each specific dataset.
Each instance is embedded to a fixed size, and then these embeddings are aggregated to give a single bag embedding. This aggregation can either take the mean (mil-mean) or max (mil-max) of the instance embeddings. The bag embedding is then classified to give an overall bag prediction. For this model, we tuned the number and size of fully connected (FC) layers, as well as the choice of aggregation function.
A prediction is made for each individual instance, and then these are aggregated to a single outcome for the bag as a whole. Similar to the MI-Net model, we tuned the FC layers and the aggregation function.
Each instance is embedded to a fixed size, and then the mil-attn block produces an attention value for each instance embedding. The instance embeddings are then aggregated to a single bag embedding by performing a weighted sum based on the attention values. The bag embedding is then classified to give an overall bag prediction. The mil-attn block is the same as per the MIL attention mechanism proposed by Ilse et al. (2018), i.e., using a single hidden layer. We tuned the the number and size of the FC layers, as well as the size of the hidden attention layer.
The GNN model treats the bag as a fully connected graph and uses graph convolutions to propagate information between instances. Initially, the original instances are embedded to a fixed size. These instance embeddings are then passed onto a GNNembed block and a GNNcluster block, the output of which is used to reduce the graph representation down into a single embedding using differentiable pooling (gnn-pool). The final bag representation is then classified to give an overall bag prediction. For more details see Tu et al. (2019). We tuned the number and size of the embedding, GNN, and classifier layers.
a.3 Evaluation metrics
In this section, we provide further details on how we use normalised discounted cumulative gain at n (NDCG@n) to evaluate the interpretability methods for datasets with instance labels, and how we use area under the perturbation curve with random orderings (AOPC-R) to evaluate the interpretability methods for datasets without instance labels.
To use NDCG@n, we first need a ground truth ordering to compare to. For a particular class, we can place each instance into one of three groups based on their ground truth instance labels: supporting instances, neutral instances, and refuting instances. The ideal instance ordering for that class would then have all of the supporting instances at the beginning, followed by all the neutral instances, and then all of the refuting instances at the end. Then, given interpretability outputs for a particular class, we rank the outputs from highest to lowest, i.e., in order of how much they support that class. This importance ordering is then compared to the ground truth ordering using the follow metric:
IDCG is the ideal discounted cumulative gain that normalises the scores across different values of . The relevance function is as follows:
Although originally designed for single instance supervised learning, here we adapt AOPC-R for MIL. Given an importance ordering, AOPC successively removes the most relevant instances (i.e., those at the start of the given importance ordering), and measures the change in prediction. A better ordering will show a more rapid decrease in prediction, as instances that are more supportive will be removed first. This rate of decrease in prediction with respect to class for a classifier and bag is measured as follows: given the ordered bag (which is just ordered by instance importance, with the most important instances at the start),
To normalise the results, one approach is to measure the average difference in AOPC for the given ordering to the AOPC of several random orderings:
where is the number of random orderings, and is the random ordering of . The repeated calls to make AOPC-R very expensive to compute. This can be reduced by examining the first perturbations rather than all possible perturbations, but that was not something we investigated in this work (we kept and
). Furthermore, due to the use of random orderings, this evaluation metric is inherently stochastic, meaning not only do we have variance in the orderings produced in the methods (e.g., from random sampling), we also have variance due to measurement. NDCG@n does not have either of the issues (i.e., its cheap to compute and deterministic), however it requires (at least some) instance labels.
a.4 Additional results
In this section, we detail our additional results on the Musk and Tiger, Elephant & Fox (TEF) classical MIL datasets. Although these datasets do not have instance labels, since they are widely used MIL datasets, it is useful to see how our interpretability methods perform on them. Each of these datasets are instance independent, and as they have a low number of instances per bag, we adapted our local surrogate methods to allow sampling with repeats (otherwise we cannot sample enough coalitions). In our previous experiments, we restricted the local surrogate methods to sampling without repeats, which was not an issue with the larger bag sizes in the SIVAL, 4-MNIST-Bags, and CRC datasets. For each of these classic datasets, we used the same model hyperparameters that we used for SIVAL (see Appendix A.6), i.e., we didn’t retune the training parameters or model architectures. However, we did tune the parameters for the interpretability methods, which we discuss in Appendix A.5. We give the interpretability results as well as the model performance for Musk (Table A1), Tiger (Table A2), Elephant (Table A3), and Fox (Table A4) below.
|Model Acc||0.871 0.024||0.836 0.027||0.793 0.029||0.793 0.028||0.823 0.016|
|Inherent||N/A||0.108 0.010||0.002 0.015||0.003 0.006||0.036 0.011|
|Single||0.123 0.010||0.108 0.010||0.113 0.010||0.090 0.010||0.108 0.010|
|One Removed||0.108 0.009||0.107 0.010||0.088 0.008||0.076 0.009||0.095 0.009|
|Combined||0.124 0.010||0.107 0.010||0.116 0.010||0.088 0.010||0.109 0.010|
|RandomSHAP||0.115 0.009||0.104 0.010||0.110 0.009||0.077 0.009||0.102 0.009|
|GuidedSHAP||0.122 0.009||0.106 0.010||0.116 0.010||0.084 0.009||0.107 0.009|
|RandomLIME||0.113 0.009||0.107 0.010||0.110 0.009||0.080 0.009||0.102 0.009|
|GuidedLIME||0.117 0.009||0.106 0.010||0.102 0.008||0.077 0.009||0.101 0.009|
|MILLI||0.117 0.009||0.105 0.010||0.109 0.009||0.084 0.010||0.104 0.009|
|Model Acc||0.827 0.024||0.807 0.029||0.807 0.019||0.800 0.028||0.810 0.005|
|Inherent||N/A||0.124 0.005||0.001 0.007||0.000 0.003||0.042 0.005|
|Single||0.132 0.005||0.125 0.005||0.120 0.005||0.082 0.003||0.115 0.004|
|One Removed||0.123 0.005||0.127 0.005||0.114 0.004||0.081 0.003||0.111 0.004|
|Combined||0.130 0.005||0.127 0.005||0.124 0.005||0.082 0.003||0.116 0.004|
|RandomSHAP||0.124 0.004||0.122 0.005||0.121 0.005||0.080 0.003||0.112 0.004|
|GuidedSHAP||0.130 0.005||0.125 0.005||0.121 0.005||0.082 0.003||0.115 0.004|
|RandomLIME||0.128 0.005||0.125 0.005||0.119 0.005||0.081 0.003||0.113 0.004|
|GuidedLIME||0.129 0.005||0.125 0.005||0.122 0.005||0.082 0.003||0.115 0.004|
|MILLI||0.128 0.005||0.125 0.005||0.120 0.004||0.081 0.003||0.114 0.004|
|Model Acc||0.857 0.017||0.863 0.017||0.867 0.016||0.853 0.020||0.860 0.003|
|Inherent||N/A||0.130 0.006||0.000 0.006||0.000 0.003||0.043 0.005|
|Single||0.127 0.004||0.131 0.006||0.127 0.005||0.105 0.004||0.122 0.005|
|One Removed||0.117 0.004||0.129 0.006||0.105 0.005||0.103 0.004||0.114 0.005|
|Combined||0.126 0.004||0.130 0.006||0.127 0.005||0.106 0.004||0.122 0.005|
|RandomSHAP||0.124 0.004||0.126 0.006||0.119 0.005||0.102 0.004||0.118 0.005|
|GuidedSHAP||0.127 0.004||0.130 0.006||0.124 0.005||0.105 0.004||0.122 0.005|
|RandomLIME||0.124 0.004||0.128 0.006||0.121 0.005||0.104 0.004||0.119 0.005|
|GuidedLIME||0.123 0.004||0.130 0.006||0.118 0.005||0.104 0.004||0.119 0.005|
|MILLI||0.124 0.005||0.128 0.006||0.121 0.005||0.103 0.005||0.119 0.005|
|Model Acc||0.600 0.011||0.610 0.017||0.620 0.023||0.580 0.013||0.603 0.007|
|Inherent||N/A||0.050 0.002||0.001 0.002||0.000 0.001||0.017 0.002|
|Single||0.044 0.002||0.049 0.002||0.041 0.002||0.021 0.001||0.039 0.002|
|One Removed||0.043 0.002||0.049 0.002||0.040 0.002||0.021 0.001||0.038 0.002|
|Combined||0.044 0.002||0.050 0.002||0.041 0.002||0.021 0.001||0.039 0.002|
|RandomSHAP||0.044 0.002||0.049 0.002||0.041 0.002||0.021 0.001||0.039 0.002|
|GuidedSHAP||0.045 0.002||0.050 0.002||0.042 0.002||0.021 0.001||0.040 0.002|
|RandomLIME||0.044 0.002||0.050 0.002||0.041 0.002||0.021 0.001||0.039 0.002|
|GuidedLIME||0.044 0.002||0.050 0.002||0.041 0.002||0.021 0.001||0.039 0.002|
|MILLI||0.043 0.002||0.049 0.002||0.041 0.002||0.022 0.001||0.039 0.002|
We find that there is very little difference in interpretability performance for each of our proposed methods across these four datasets. This is to be expected, as the instances are independent, therefore the independent-instance methods as well as the local surrogate methods are able to identify the important instances, i.e., there is little to be gained by sampling coalitions when each instance can be understood in isolation. However, this reinforces the applicability of all of our proposed methods. We note that the attention and GNN inherent interpretability methods perform poorly in these experiments — this is because they are unable to condition their outputs on a specific class (i.e., they can only answer which questions, not what questions). We also note that the performance is much worse on the Fox dataset. Here, the underlying MIL models perform poorly, so it is unsurprising that the interpretability methods also perform poorly.
a.5 Interpretability method hyperparameter selection
In this section we discuss our method for hyperparameter selection in the interpretability methods, and detail the hyperparameters that we found to be most effective. An advantage of the inherent interpretability and independent-instance methods is that they do not have hyperparameters, i.e., we only had to select hyperparameters for the local surrogate interpretability methods. First, we discuss our choice of hyperparameters for LIME, and then our choice of hyperparameters for MILLI.
When using the LIME weight kernel, there are two hyperparameters to tune. Firstly, the distance measures that are commonly used are L2 and cosine distance. In our experiments, we found very little difference between each of these distance measures, therefore we arbitrarily chose to use L2 distance in all of our experiments. The second hyperparameter is the kernel width, which determines the weighting of coalitions, i.e., a large kernel width means all coalitions are weighted more evenly, and a small kernel width prioritises larger coalitions. We chose to use a kernel width for all of our experiments that is determined by the average bag size, such that the half coalition (i.e., ) is weighted at 0.5.
For MILLI, there were three hyperparameters to tune: the number of samples, , and . For the number of samples, we generated sample size plots such as Figure 4 (also see Appendix A.9), and chose the number of samples to be at the point where all the methods had reasonably converged. For and we ran a grid search over the possible values, and chose the best performing pair of parameters. The hyperparameters were tuned for each dataset, except for the TEF datasets, in which we only tuned on Tiger and then used the same hyperparameters across all three datasets. In Table A5, we provide the chosen hyperparameters for each dataset, as well as the expected coalition size determined by and . For the CRC, Musk and TEF datasets, the chosen parameters produce small values for , i.e., the sampling is heavily biased towards smaller coalitions, which is expected as these are instance independent datasets. Conversely, the parameters for the 4-MNIST-Bags focus on sampling larger coalitions that are able to capture the instance interactions in the dataset. Although the SIVAL dataset is instance independent, the parameters also focus on sampling larger coalitions. This could be because, although the instances are independent, it may be difficult to classify an object from just one instance, i.e., several instances are needed to make the correct decision. We also note that, in all cases, , i.e., the sampling is biased towards instances ranked lower in the initial importance ordering used by MILLI. Our explanation for this is that it is easy to understand the contribution of discriminatory instances, as they have a large effect on the model prediction, but it is more difficult to understand the contribution of non-discriminatory instances, as they have much less of an effect. Therefore, more samples containing non-discriminatory instances are required to properly understand their effect, hence biasing the sampling towards them.
a.6 SIVAL experiment details
For the SIVAL dataset, each instance is represented by a 30-dimensional feature vector, and there are around 30 instances per bag. We chose 12 of the 25 original classes to be the positive classes, and randomly selected 30 images from each of the other 13 classes to form the negative class, meaning overall we had 13 classes (12 positive and one negative). In total, we had 60 bags for each of the 12 positive classes, and 390 bags for the single negative class, meaning the class distribution was for each positive class and for the negative class. The arbitrarily chosen positive classes were: apple, banana, checkeredscarf, cokecan, dataminingbook, goldmedal, largespoon, rapbook, smileyfacedoll, spritecan, translucentbowl, and wd40can
. We normalised each instance according to the dataset mean and standard deviation. No other data augmentation was used. The dataset was separated into train, validation, and test data using an 80/10/10 split. This was done with stratified sampling in order to maintain the same data distribution across all splits.
When training models against the SIVAL dataset, we used a batch size of one, i.e., a single bag of, on average, 30 instances. We trained the models to minimise cross entropy loss using the Adam optimiser; the hyperparamater details for learning rate (LR), weight decay (WD) and dropout (DO) are given in Table A7
. We utilised early stopping based on validation loss — if the validation loss had not decreased for 10 epochs then we terminated the training procedure and reset the model to the point at which it caused the last decrease in validation loss. Otherwise, the maximum number of epochs was 100. The tuned architectures for each model are given in TablesA7 to A10, and the results for each model are comapred in Table A11.
FC + ReLU + DO
|2||FC + ReLU + DO||128||256|
|1||FC + ReLU + DO||30||512|
|2||FC + ReLU + DO||512||256|
|3||FC + ReLU + DO||256||64|
|1||FC + ReLU + DO||30||128|
|2||FC + ReLU + DO||128||256|
|3||FC + ReLU + DO||256||128|
|4||mil-attn(256) + DO||128||128|
|1||FC + ReLU + DO||30||128|
|2a GNNembed||SAGEConv + ReLU + DO||128||128|
|SAGEConv + ReLU + DO||128||256|
|SAGEConv + ReLU + DO||256||64|
|2b GNNcluster||SAGEConv + Softmax||128||1|
|4||FC + ReLU + DO||64||128|
|Model||Train Accuracy||Val Accuracy||Test Accuracy|
|MI-Net||0.984 0.004||0.850 0.009||0.819 0.012|
|mi-Net||0.967 0.005||0.835 0.010||0.808 0.011|
|MI-Attn||0.972 0.007||0.830 0.011||0.813 0.012|
|MI-GNN||0.932 0.014||0.803 0.014||0.781 0.019|
a.7 4-MNIST-Bags experiment details
In the 4-MNIST-Bags experiments, the bag sizes were draw from a normal distribution, with a mean of 30 and a variance of 2. We used 2500 training bags, 1000 validation bags, and 1000 test bags. The instances in the training bags were only drawn from the original MNIST training split, and the instances in the validation and test bags were only drawn from the original MNIST test split, i.e., there was no overlap between training, validation, and test instances. The classes were balanced, so, on average, there were 625 bags per class in the training data, and 250 bags per class in the validation and test data. We normalised the MNIST images using the PyTorch normalise transformation, with a mean of 0.1307 and a stand deviation of 0.3081. No other data augmentation was carried out.
The training procedure was the same as for the SIVAL experiments: a batch size of one, early stopping with a patience of ten, and a maximum of 100 epochs. The training hyperparamater details are given in Table A13. For each model, we first passed the MNIST instances through a convolutional architecture to produce initial instance embeddings. This encoder was not tuned for each model (i.e., the architecture was fixed, but the weights were learnt). The architecture for this encoder is given in Table A13, and then the model architectures are given in Tables A15 to A17. The encoder produces features vectors with 800 features, therefore the input size to each of the models is 800. The model results are given in Table A18.
|1||Conv2d(5, 1, 0) + ReLU||1||20|
|2||MaxPool2d(2, 2, 0) + DO||20||20|
|3||Conv2d(5, 1, 0) + ReLU||20||50|
|4||MaxPool2d(2, 2, 0) + DO||50||50|
|1||FC + ReLU + DO||800||128|
|2||FC + ReLU + DO||128||512|
|1||FC + ReLU + DO||800||512|
|2||FC + ReLU + DO||512||128|
|3||FC + ReLU + DO||128||64|
|Layer||Type||Input size||Output size|
|1||FC + ReLU + Dropout||800||64|
|2||FC + ReLU + Dropout||64||256|
|3||mil-attn(64) + Dropout||256||256|
|4||FC + ReLU + Dropout||256||64|
|Layer||Type||Input size||Output size|
|1||FC + ReLU + Dropout||800||64|
|2||FC + ReLU + Dropout||64||64|
|3a GNNembed||SAGEConv + ReLU + Dropout||64||128|
|SAGEConv + ReLU + Dropout||128||128|
|SAGEConv + ReLU + Dropout||128||128|
|3b GNNcluster||SAGEConv + Softmax||64||1|
|5||FC + ReLU + Dropout||128||64|
|Model||Train Accuracy||Val Accuracy||Test Accuracy|
|MI-Net||0.995 0.001||0.972 0.002||0.971 0.002|
|mi-Net||0.993 0.001||0.973 0.002||0.974 0.002|
|MI-Attn||0.991 0.002||0.967 0.003||0.967 0.003|
|MI-GNN||0.984 0.002||0.968 0.002||0.966 0.002|
a.8 CRC experiment details
In the CRC dataset, there are 100 microscopy images, and each image is annotated with four classes of nuclei: epithelial, inflammatory, fibroblast, and miscellaneous. Of these 100 images, 50 are in the negative class (no epithelial nuclei), and 50 are in the positive class (at least one epithelial nuclei). Each image is 500 x 500 pixels, and was split into 27 x 27 pixel patches to give 324 patches per slide. The patches were created by applying a non-overlapping grid over the image, where each cell of the grid was a 27 x 27 pixel region. The alternative would be to extract patches centred on each marked nuclei (as done by Ilse et al. (2018)), however this method then requires the nuclei to be marked for unseen data. This accounts for the difference between our results and the results of Ilse et al. (2018) — extracting the patches using a fixed grid will include patches that do not contain any marked nuclei, increasing the amount of non-discriminatory data in each bag and thus making the problem harder to learn. We removed slide background patches by using a brightness threshold — any patch with an average pixel value above 230 (using pixel values 0 to 255) was discarded. This left an average of 264 patches per image, and one image was discarded as it had zero foreground patches (this was also manually verified). The images were normalised using the dataset mean (0.8035, 0.6499, 0.8348) and standard deviation (0.0858, 0.1079, 0.0731). During training, we also applied three transformations to patches at random: horizontal flips, vertical flips, and 90 degree rotations. The dataset was separated into train, validation, and test data using a 60/20/20 split. This was done with stratified sampling in order to maintain the same data distribution across all splits.
The training procedure was the same as for the SIVAL and 4-MNIST-Bags experiments: a batch size of one, early stopping with a patience of ten, and a maximum of 100 epochs. The training hyperparamater details are given in Table A20. Following the same procedure as the 4-MNIST-Bags experiments, for each model, we first passed the patches through a convolutional architecture to produce initial instance embeddings. The architecture for this encoder is given in Table A20, and then the model architectures are given in Tables A22 to A24. The encoder produces features vectors with 1200 features, therefore the input size to each of the models is 1200. The model results are given in Table A25.
|1||Conv2d(4, 1, 0) + ReLU||3||36|
|2||MaxPool2d(2, 2, 0) + DO||36||36|
|3||Conv2d(3, 1, 0) + ReLU||36||48|
|4||MaxPool2d(2, 2, 0) + DO||48||48|
|1||FC + ReLU + DO||1200||64|
|2||FC + ReLU + DO||64||512|
|4||FC + ReLU + DO||512||128|
|4||FC + ReLU + DO||128||64|
|4||FC + ReLU + DO||64||2|
|1||FC + ReLU + DO||1200||64|
|2||FC + ReLU + DO||64||64|
|3||FC + ReLU + DO||64||64|
|Layer||Type||Input size||Output size|
|1||FC + ReLU + Dropout||1200||64|
|2||FC + ReLU + Dropout||64||64|
|3||FC + ReLU + Dropout||64||256|
|4||mil-attn(128) + Dropout||256||256|
|Layer||Type||Input size||Output size|
|1||FC + ReLU + Dropout||1200||64|
|2||FC + ReLU + Dropout||64||128|
|3a GNNembed||SAGEConv + ReLU + Dropout||128||128|
|3b GNNcluster||SAGEConv + Softmax||128||1|
|5||FC + ReLU + Dropout||128||128|
|Model||Train Accuracy||Val Accuracy||Test Accuracy|
|MI-Net||0.880 0.041||0.805 0.034||0.795 0.042|
|mi-Net||0.856 0.041||0.810 0.043||0.795 0.049|
|MI-Attn||0.870 0.014||0.860 0.017||0.830 0.028|
|MI-GNN||0.791 0.039||0.765 0.031||0.770 0.032|
a.9 Additional experiments
In this section, we provide additional sample size experiments for the SIVAL, 4-MNIST-Bags, and CRC datasets for MI-Net (Figure A1), mi-Net (Figure A2), and MI-GNN (Figure A3). We find the trends are relatively consistent across all the models. GuidedSHAP and MILLI are the most consistent and well performing methods, and MILLI is particularly effective on the 4-MNIST-Bags dataset. One considerable difference is the performance of RandomLIME and GuidedLIME on the CRC dataset for the MI-GNN model; for all other models, both LIME methods perform poorly on the CRC dataset. Further investigation is required to understand why this happens. However, the LIME methods are still outperformed by MILLI and GuidedSHAP on the CRC dataset, even for the MI-GNN model.
a.10 Additional outputs
In this section, we provide additional interpretability outputs from our studies, similar to the one in Section 5. First, we provide additional outputs for the 4-MNIST-Bags dataset. Then, we show the interpretability outputs for the SIVAL and CRC datasets. The SIVAL outputs are created by ranking the instances in a bag according to some interpretability function (i.e., by using MILLI), and then selecting the top , where is the known number of key instances in the bag. Then, the corresponding segment for each instance is weighted by the function , where is the instance’s position in the ranking (this is the same scaling used in NDCG@n). The other instances all received a weighting of zero. The brightness of each segment in the output image then corresponds with its relative importance according to the interpretability method. For the CRC interpretability outputs, the important patches are found by using an interpretability method to output the top patches that support a particular class, where is the number of known important instances for that class. We then highlight these patches while dimming the other patches to produce an interpretation of the model’s decision-making.