Practical Insights of Repairing Model Problems on Image Classification

by   Akihito Yoshii, et al.

Additional training of a deep learning model can cause negative effects on the results, turning an initially positive sample into a negative one (degradation). Such degradation is possible in real-world use cases due to the diversity of sample characteristics. That is, a set of samples is a mixture of critical ones which should not be missed and less important ones. Therefore, we cannot understand the performance by accuracy alone. While existing research aims to prevent a model degradation, insights into the related methods are needed to grasp their benefits and limitations. In this talk, we will present implications derived from a comparison of methods for reducing degradation. Especially, we formulated use cases for industrial settings in terms of arrangements of a data set. The results imply that a practitioner should care about better method continuously considering dataset availability and life cycle of an AI system because of a trade-off between accuracy and preventing degradation.



page 1

page 2


Stuttgart Open Relay Degradation Dataset (SOReDD)

Real-life industrial use cases for machine learning oftentimes involve h...

Learning degraded image classification with restoration data fidelity

Learning-based methods especially with convolutional neural networks (CN...

Positive-Congruent Training: Towards Regression-Free Model Updates

Reducing inconsistencies in the behavior of different versions of an AI ...

The impact of Use Cases in real-world software development projects: A systematic mapping study

Objective: To identify and classify the positive and negative impacts of...

Graceful Degradation and Related Fields

When machine learning models encounter data which is out of the distribu...

A note on efficient audit sample selection

Auditing is a widely used method for quality improvement, and many guide...

Degradation effects of water immersion on earbud audio quality

Earbuds are subjected to constant use and scenarios that may degrade sou...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Degradation in Deep Learning Tasks

A deep learning model can be trained multiple time (retrained) with additional datasets. Considering a classification task, the model accuracy is expected to be increased after the retraining.

However, a part of samples can change from a positive (classified correctly) to a negative (classified incorrectly) even if the accuracy increases totally. We call such a situation as ”degradation” in this talk. Especially in a real-world setting, available additional datasets are varying in size, quality and importance; and thus, a model can be susceptible to a degradation-causing sample.

In a use case that a misclassification of samples leads to serious accident, those samples should be maintained to be positive. However, such situation cannot always be measured with accuracy. This is because even with a highly accurate model which classifies almost all samples correctly, the model is not acceptable if the most of “important” samples are belong to the misclassified ones.

2. Existing Approaches and Our Motivation

In order to reduce the degradation, some works improve retraining procedure whereas others propose different approaches. Our existing work, NeuRecover(Tokui et al., 2022) is a non-retraining method which enables a developer to repair a part of a DNN model weights with smaller data samples. On the other hand, crafting retraining is another approach to reduce degradation. For example, Backward Compatibility ML(Bansal et al., 2019)(Srivastava et al., 2020)

introduces a penalty term to a loss function. It is designed to prevent new errors which did not exist before a retraining which affects a user expectation

(Bansal et al., 2019).

We need to grasp advantages and limitations of different methods because they have similar characteristics considering the reduction of the degradation while they are developed by different focuses.

In this talk, we will present implications on suitable use cases for each method, derived from a comparison of methods for reducing degradation. This discussion is important because maintaining model quality leads to a reliable AI-powered software performance. Although an accuracy is one of metrics indicating model performance, a trade-off exists between an accuracy and certain kinds of others. Hypothesizing real use cases will be a clue for making decision under the trade-off and achieving an intended task.

Figure 1. Overall processes

Overall processes

3. Experiment and Results

We formulated three use cases from an aspect of dataset arrangement: exclusive (operation-oriented), exclusive (development-oriented) and inclusive.

In an operation-oriented use case, an update set () is larger than a training set () while a development-oriented case is the opposite. The ”Exclusive” assumes no intersection between and ; on the other hand, is included by larger under the ”Inclusive” condition.

Figure 1 shows an overall process of the experiment. We executed two retraining methods (categorical cross entropy and Backward Compatibility111The loss function was imported from whose version is 1.4.2, with the reduction parameter was set to SUM_OVER_BATCH_SIZE. ML(Bansal et al., 2019)(Srivastava et al., 2020)) and one non-retraining method (NeuRecover(Tokui et al., 2022)) with datasets arranged based on the aforementioned conditions. We used CIFAR10(Krizhevsky, 2009)

and Fashion MNIST

(Xiao et al., 2017); and then split them into and .

We evaluated results with a test set apart from and in each dataset and metrics proposed in related work. Especially we compared BR (Break Rate)(Tokui et al., 2022), BEC (Backward Error Compatibility)(Srivastava et al., 2020) and accuracy improvement rate. BR is defined as (Tokui et al., 2022) and BEC is defined as (Srivastava et al., 2020). Accuracy improvement rate can be calculated by .

The results suggest that NeuRecover has relatively high BEC and BR while Accuracy improvement rate is lower than others. The difference of Accuracy improvement rate among methods is smaller under the condition than condition. On the other hand, BC(ne) keeps stably higher even when BR is higher and the accuracy improvement is small.

4. Lessons Learned

We observed following lessons from the results of our experiment.

Methods depending on a purpose and a dataset arrangement:

Suitable method can be chosen in an aspect of metrics stated in the previous section and each experiment condition.

A possible scenario of the ”exclusive” situation is that only a pre-trained model is available for a developer and the pre-trained model is updated with a dataset prepared by the developer at the development / operation time. On the other hand, an ”inclusive” situation assumes that the developer has a dataset available from the beginning without limitation; therefore the developer can include initial training data in a dataset for future updates.

If an environment change at an operation time are so significant that a model need to be retrained, ”exclusive (operation-oriented)” case can be assumed. While an ”exclusive (development-oriented)” is for the situations that an environment is relatively static.

Accuracy does not always earn the highest priority:

A practitioner should cope with a trade-off between accuracy and preventing degradation. For example, Bansal et al. are pointing out the discrepancy between human expectation of AI system and model updates (Bansal et al., 2019). Thus, higher accuracy and stably higher BEC can be the matter. On the other hand, even if accuracy is lower than other methods, a method whose BEC is higher and BR is lower can be a choice when the improvements of partial important case are needed.

5. Conclusion

We compared methods which aim to improve degradation of classification result. The degradation aspect is important because the diversity of data samples in real use cases can cause misclassification and thus decreases the classification performance of important data samples. Practitioners should be aware of trade-off between an accuracy and other non-accuracy metrics and dynamically choose each method depending on their purpose.


This work was partly supported by JST-Mirai Program Grant Number JPMJMI20B8, Japan.