Machine learning models commonly exhibit unexpected failures post-deployment due to either data shifts or uncommon situations in the training environment. Domain experts typically go through the tedious process of inspecting the failure cases manually, identifying failure modes and then attempting to fix the model. In this work, we aim to standardise and bring principles to this process through answering two critical questions: (i) how do we know that we have identified meaningful and distinct failure types?; (ii) how can we validate that a model has, indeed, been repaired? We suggest that the quality of the identified failure types can be validated through measuring the intra- and inter-type generalisation after fine-tuning and introduce metrics to compare different subtyping methods. Furthermore, we argue that a model can be considered repaired if it achieves high accuracy on the failure types while retaining performance on the previously correct data. We combine these two ideas into a principled framework for evaluating the quality of both the identified failure subtypes and model repairment. We evaluate its utility on a classification and an object detection tasks. Our code is available at https://github.com/Rokken-lab6/Failure-Analysis-and-Model-RepairmentREAD FULL TEXT VIEW PDF
It is common for new failures to be discovered once a model has been deployed in the “wild”. While recent lines of research in medical imaging have shown promising results in designing robust machine learning (ML) models [1, 2, 3, 4, 5], it may not be realistic to achieve perfect generalisation to every relevant environment.
Consequently, recently published guidelines for the reliable application of ML systems in healthcare [6, 7] and recent work from Luke et al. stress the importance of analyzing and reporting clinically relevant failure cases. However, there is a lack of standardised protocols to identify, validate and analyze those failure types. Typically, domain experts manually inspect the failure cases and make sense of them by identifying a set of failure modes. But this approach can be both expensive and biased by the human expertise. For example, a critical yet rare subgroup could be missed with such an approach and go unreported . A notable recent work 
recognises this issue and makes a first step towards data-driven approaches to failure subtyping through clustering of the learned features based on whether its presence or absence is predictive of poor performance. However to date, little attention has been gathered around the evaluation metrics of the identified failure types, hampering the development of new methods in this direction. Furthermore, even if a set of meaningful failure types could be identified, methods forfixing them and evaluating its success remain undeveloped.
In an attempt to bring principles to the process of failure analysis and model repairment, we introduce a framework for not only deriving subtypes of failure cases and measuring their quality, but also repairing the models and verifying their generalisation. We put forward a set of desirable properties that a meaningful set of failure subtypes should meet and design surrogate metrics. Moreover, we propose a data-driven method for identifying failure types based on clustering in feature or gradient spaces. This method was able not only to identify failure types with the highest quality according to our metrics but also to identify clinically important failures like undetected catheters close to the ultrasound probe in intracardiac echocardiography. Finally, we argue that model repairment should not only aim to fix each failure type in a generalizable manner but also to ensure the performance on previously successful cases is retained.
We elaborate below the two phases (see Fig. 1) of our approach to model repair.
It is not evident how to optimally separate failure cases into a set of distinct failure types. Domain experts could split the failure cases according to the visual appearance or consider the importance of different failures from a clinical perspective, for example according to stages or clinical signs of a disease. However, we suggest clinical relevance doesn’t necessarily reflect the way the model distinctly fails on each failure type. Failure types should be specific not only to the model but the repairment methods that can be used to fix them. Moreover, objective metrics to quantify the quality of failure types for the purpose of model repairment are still lacking.
We postulate that a set of failure types should satisfy two desirable properties: Independence and Learnability. In addition, we propose two novel surrogate metrics to assess those properties in practice. They are measured by fine-tuning the model on a subset of the given failure type under a Compatibility constraint, and then calculating the performance on each . We assume a sufficiently large set of failures with no label noise or corrupted images.
(i) Independence: The subtypes should be as independent as possible from each other in the way the model fails. In other words, they can be fixed by comparatively distinct alterations to the decision boundary. If two types are independent, a model fine-tuned on one type should not be useful to the other. In practice, we suggest a continuous measure of Independence by calculating the average difference in the test performance between each failure type and the rest: where is a performance metric (e.g., accuracy).
(ii) Learnability: Each subtype should be homogeneous and consists of examples for which the model has failed in a similar way. In other words, a such failure type contains failure cases that can be fixed via a similar modification to the model’s decision boundary. If a subtype is heterogeneous, fixing it would require more modifications to the model and would be more challenging. We thus posit that a more homogeneous failure type would be easier to learn, and measure Learnability as the generalisation of the fine-tuned model on the chosen failure type .
Compatibility constraint: We argue in addition that the above surrogate metrics should be measured with the constraint of maintaining performance on correct data . This is necessary to avoid learning pathological discriminative rules just to solve a specific failure type. For example, given a failure type only containing images of a single class, the model could simply learn to ignore the input images and predict the same class everywhere, in which case the failure is fixed in a meaningless way. Compatibility ensures “locality” of the failure types by ensuring the required changes in the discriminative rules do not considerably influence the previous cases. Compatibility is achieved by fine-tuning on both a failure type and previously correct cases in equal proportions and early stopping when validation performance drops below 0.9.
Manual analysis can often be time-consuming but also suboptimal, potentially overlooking meaningful failure modes. Therefore, we wish to automatically uncover a set of failure types with good Independence and Learnability scores. We thus also explore two methods as specific instantiations of our framework; particularly, we experiment with clustering the failure cases in the feature space (Feature clustering) and gradient space of a differentiable model (Gradient clustering). The gradients of the loss with respect to the parameters are used (for object detection: the loss specific to the object). We expect that similar data will be close in the feature space and the failures whose correction requires similar changes to the model parameters will be close in the gradient space. Furthermore, features are averaged over spatial dimensions and both features and gradients are reduced by Gaussian random projection followed by UMAP 
to 10 dimensions. Finally, the data is clustered through k-means with the highest Silhouette score between 3 and 10 clusters.
Once a set of target failure types has been identified, the model needs to be repaired. We argue that a successful repairment leads to a model that generalizes on unseen cases of the target failure types while maintaining performance on the cases where the model previously performed well. First, a target set of failure types is selected by the end-users, and then each type is split into two sets, where one is used to first repair the model (e.g. ) and then the other is used to evaluate generalization (e.g. ). For repairment procedures, we experiment with fine-tuning on the failure types (with or without the correct cases) and elastic weight consolidation (EWC), a popular continual learning approach . We note, however, that more recent model adaptation approaches are also applicable in our framework such as more recent variants of continual learning [12, 13], meta-learning  and domain adaptation .
We evaluate the efficacy of the proposed model repairment framework on two medical imaging datasets. The details of the respective datasets along with the specification of models/optimisation are provided below.
Binary PathMNIST (BP-MNIST)
Binary PathMNIST (BP-MNIST)is a publicly available classification dataset consisting of colorectal cancer histology slides patches [16, 17] derived originally from the NCT-CRC-HE-100K  dataset and resized to 3x28x28 as a part of the MedMNIST benchmark . We simplify the original 9-way classification task into a binary task of discriminating benign and malignant classes (cancer-associated stroma and colorectal adenocarcinoma epithelium) and use the original classes of granular tissue types as metadata to interpret the discovered failure types. Moreover, 40% of the dataset was put into the test set to increase the sample size for the evaluation of both subtyping and model repairment. Finally, the model was trained with Adam with a learning rate of in combination with early stopping on the validation accuracy. The architecture is a version of VGG  with 6 convolutional layers starting at 16 channels and the fully connected layer replaced by a 1x1 convolution to two output channels and a spatial average.
ICE Catheter Detection (ICE-CD)
is a private real-world object detection dataset comprised of ultrasound images of intra-cardiac catheters made by a Intracardiac Echocardiography (ICE) device on pigs. Furthermore, for the purpose of evaluating the performance of catheter detection models, each catheter image was classified into different types representing known difficult situations based on catheter appearance or position. In addition, information about the rough anatomical locations of the probe is available as metadata. The architecture is composed of 5 residual blocks of two convolutional layers (starting at 8 channels and doubling up to 128 channels) followed by two 1x1 convolutions branches: a classification and a center position regression branch. This dataset has been acquired in accordance with animal experiment regulations.
Baselines and Experiments: we aim to quantify the quality of the proposed automatic subtyping methods (see Table 1 and Fig 2) which we compare against several baselines: Random (random clusters), False positives and negatives (FP/FN) (two clusters) and Metadata (BP-MNIST: the original 9 tissue types; ICE-CD: image types as identified by the clinicians and in addition anatomical locations of the images). For each failure type , the model was fine-tuned on both and the correct cases (to satisfy Compatibility). Early stopping is performed with the best validation score (accuracy on BP-MNIST and on ICE-CD) on while maintaining validation accuracy on . Table 1 displays the average metrics for the respective methods while Fig 2 shows granular results i.e., matrix which denotes the test accuracy on of the model fine-tuned on .
Analysis: first of all, Gradient clustering reached better scores than any other method with the exception of Independence for FP/FN clustering on BP-MNIST. However, it had a 18% higher Learnability score and is more informative with more identified failure types. Remarkably, Gradient clustering was better than using Metadata, including the ICE-CD metadata made through very time consuming visual inspection. In the case of BP-MNIST, the lack of independence of Metadata subtyping was clearly visible in Fig. 2(a). This implies that Gradient clustering might be able to identify independent failure types which are not obvious through human eyes but are relevant to the model. On the other hand, Feature clustering seemed to achieve lower scores than metadata. Furthermore, Gradient clustering might achieve higher Learnability and Independence due to being more aligned with the repairment method. Finally, Random resulted in by far the lowest independence scores as is apparent in Fig. 2(c) where all types had the same score. This shows that the Independence metric is effective in detecting when failure types are mixed together. Moreover, Random and FP/FN clustering showed lower Learnability which may be explained by the diversity of tasks to be learned within each cluster. However, FP/FN had the highest Independence due to matching the two classes.
|BP-MNIST (ACC)||Catheter Detection (ACC)|
|False positives and negatives||0.740.22||0.740.22||0.690.14||0.140.37|
|BP-MNIST: tissue type||0.920.05||0.580.16||-||-|
|ICE-CD: image type||-||-||0.770.17||0.460.16|
|ICE-CD: anatomical location||-||-||0.750.22||0.370.20|
We aim to inspect the automatically discovered subtypes on BP-MNIST and ICE-CD by Gradient clustering which achieved the best scores.
Binary PathMNIST (BP-MNIST): first, we observe that False positives and false negatives were mostly separated into two sets of clusters (i.e., each cluster contains mostly either circles or crosses as shown in Fig. 3(a)). Moreover, the two malignant tissue types were recovered separately: Cluster 2 and 4 for cancer-associated stroma and cluster 1 for colorectal adenocarcinoma epithelium. Secondly, even within one tissue type, Gradient clustering was able to discover independent failure types. Cluster 2 and 4 both focused on cancer-associated stroma but were relatively independent and differed when evaluating on , and (See Fig 2.). In addition, clusters 2 and 4 were visually different as seen in Fig 3 with cluster 4 corresponding to darker less textured images. Finally, only cluster 8 seemed to contain normal colon mucosa (in addition to Debris) and does seem to contain darker textured images than other false positive clusters.
Catheter detection (ICE-CD): Gradient clustering was able to recover a known and important but under-represented failure type: Cluster 4 (red cluster in Fig. 4(a)) focused on Near-probe catheters which are close to the ultrasound probe. Indeed, these catheters are hard to detect due to noise in this region of the images. Secondly, Gradient clustering was able to automatically discover some of the anatomical locations. Indeed, Cluster 6 (brown cluster in Fig. 4(a)) focused on the LAA and Cluster 3 (green cluster in Fig. 4(a)) focused mostly on the SVC/IVC. Finally, Gradient clustering was able to separate false positives (see the orange cluster in Fig. 4(a)) from false negatives (the others).
Experiments: we aim to evaluate how much failure types performance can be improved while retaining performance on correct cases (See Table 2). We compare several repairment approaches on both datasets based on fine-tuning and EWC . The fine-tuning is done on either a single failure type: or all: . Also, we compare to using with a 50% sample ratio. For all methods, early stopping is performed by selecting the best accuracy on . Models were fine-tuned with a learning rate of on BP-MNIST and on ICE-CD and weight decay of . Table 2 reports the accuracy on the test set of each failure type , the previously correct cases and the overall test set.
Analysis: first, fine-tuning on a single failure type generalized more on that specific failure type than fine-tuning on all incorrect cases at once (see Table 2). This may indicate that the failure types are conflicting during fine-tuning and it is more difficult to simultaneously learn a diverse set of cases than simple ones. Therefore, if learning unimportant failures is conflicting with critical ones, it makes sense to first start by repairing a carefully selected sub-set of the failures. Secondly, fine-tuning on the failures only couldn’t maintain performance on while including the correct cases helped to preserve performance. Fine-tuning on for ICE-CD dropped to 0.73 but this was still higher than 0.32 if using only for fine-tuning. Finally, while for EWC  the performance on correct cases didn’t drop as much as simple fine-tuning, EWC wasn’t able to maintain correct cases accuracy to more than 0.27 and 0.51.
|Fine-tuning on a single||0.92||0.94||0.97||1.00||0.99||0.94||0.96||0.92||0.81||0.77|
|EWC  on||0.34||0.65||0.79||0.79||0.85||0.64||0.79||0.76||0.27||0.30|
|Fine-tuning on a single||0.85||0.99||0.77||0.69||0.64||0.96||-||-||0.32||0.36|
|EWC  on||0.41||0.89||0.59||0.26||0.47||0.94||-||-||0.51||0.56|
We have introduced a principled framework to address the problems of failure identification, analysis and model repairment. Firstly, we put forward a set of desirable properties for meaningful failure types and novel surrogate metrics to assess those properties in practice. Secondly, we argued that model repairment should not only aim to fix the failures but also to retain performance on the previously correct data. Finally, we showed specific instantiations of our framework and demonstrated that clustering in feature and gradient space can automatically identify clinically important failures and outperform manual inspection.
Improving robustness of deep learning based knee mri segmentation: Mixup and adversarial domain adaptation.In
Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
Reporting of artificial intelligence prediction models.The Lancet, 393(10181):1577–1579, 2019.
Understanding failures of deep networks via robust feature extraction.In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
The Journal of Open Source Software, 3(29):861, 2018.
Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.