1. Degradation in Deep Learning Tasks
A deep learning model can be trained multiple time (retrained) with additional datasets. Considering a classification task, the model accuracy is expected to be increased after the retraining.
However, a part of samples can change from a positive (classified correctly) to a negative (classified incorrectly) even if the accuracy increases totally. We call such a situation as ”degradation” in this talk. Especially in a real-world setting, available additional datasets are varying in size, quality and importance; and thus, a model can be susceptible to a degradation-causing sample.
In a use case that a misclassification of samples leads to serious accident, those samples should be maintained to be positive. However, such situation cannot always be measured with accuracy. This is because even with a highly accurate model which classifies almost all samples correctly, the model is not acceptable if the most of “important” samples are belong to the misclassified ones.
2. Existing Approaches and Our Motivation
In order to reduce the degradation, some works improve retraining procedure whereas others propose different approaches. Our existing work, NeuRecover(Tokui et al., 2022) is a non-retraining method which enables a developer to repair a part of a DNN model weights with smaller data samples. On the other hand, crafting retraining is another approach to reduce degradation. For example, Backward Compatibility ML(Bansal et al., 2019)(Srivastava et al., 2020)
introduces a penalty term to a loss function. It is designed to prevent new errors which did not exist before a retraining which affects a user expectation(Bansal et al., 2019).
We need to grasp advantages and limitations of different methods because they have similar characteristics considering the reduction of the degradation while they are developed by different focuses.
In this talk, we will present implications on suitable use cases for each method, derived from a comparison of methods for reducing degradation. This discussion is important because maintaining model quality leads to a reliable AI-powered software performance. Although an accuracy is one of metrics indicating model performance, a trade-off exists between an accuracy and certain kinds of others. Hypothesizing real use cases will be a clue for making decision under the trade-off and achieving an intended task.
3. Experiment and Results
We formulated three use cases from an aspect of dataset arrangement: exclusive (operation-oriented), exclusive (development-oriented) and inclusive.
In an operation-oriented use case, an update set () is larger than a training set () while a development-oriented case is the opposite. The ”Exclusive” assumes no intersection between and ; on the other hand, is included by larger under the ”Inclusive” condition.
Figure 1 shows an overall process of the experiment. We executed two retraining methods (categorical cross entropy and Backward Compatibility111The loss function was imported from https://github.com/microsoft/BackwardCompatibilityML whose version is 1.4.2, with the reduction parameter was set to SUM_OVER_BATCH_SIZE. ML(Bansal et al., 2019)(Srivastava et al., 2020)) and one non-retraining method (NeuRecover(Tokui et al., 2022)) with datasets arranged based on the aforementioned conditions. We used CIFAR10(Krizhevsky, 2009)
and Fashion MNIST(Xiao et al., 2017); and then split them into and .
We evaluated results with a test set apart from and in each dataset and metrics proposed in related work. Especially we compared BR (Break Rate)(Tokui et al., 2022), BEC (Backward Error Compatibility)(Srivastava et al., 2020) and accuracy improvement rate. BR is defined as (Tokui et al., 2022) and BEC is defined as (Srivastava et al., 2020). Accuracy improvement rate can be calculated by .
The results suggest that NeuRecover has relatively high BEC and BR while Accuracy improvement rate is lower than others. The difference of Accuracy improvement rate among methods is smaller under the condition than condition. On the other hand, BC(ne) keeps stably higher even when BR is higher and the accuracy improvement is small.
4. Lessons Learned
We observed following lessons from the results of our experiment.
- Methods depending on a purpose and a dataset arrangement:
Suitable method can be chosen in an aspect of metrics stated in the previous section and each experiment condition.
A possible scenario of the ”exclusive” situation is that only a pre-trained model is available for a developer and the pre-trained model is updated with a dataset prepared by the developer at the development / operation time. On the other hand, an ”inclusive” situation assumes that the developer has a dataset available from the beginning without limitation; therefore the developer can include initial training data in a dataset for future updates.
If an environment change at an operation time are so significant that a model need to be retrained, ”exclusive (operation-oriented)” case can be assumed. While an ”exclusive (development-oriented)” is for the situations that an environment is relatively static.
- Accuracy does not always earn the highest priority:
A practitioner should cope with a trade-off between accuracy and preventing degradation. For example, Bansal et al. are pointing out the discrepancy between human expectation of AI system and model updates (Bansal et al., 2019). Thus, higher accuracy and stably higher BEC can be the matter. On the other hand, even if accuracy is lower than other methods, a method whose BEC is higher and BR is lower can be a choice when the improvements of partial important case are needed.
We compared methods which aim to improve degradation of classification result. The degradation aspect is important because the diversity of data samples in real use cases can cause misclassification and thus decreases the classification performance of important data samples. Practitioners should be aware of trade-off between an accuracy and other non-accuracy metrics and dynamically choose each method depending on their purpose.
This work was partly supported by JST-Mirai Program Grant Number JPMJMI20B8, Japan.
Bansal et al. (2019)
Gagan Bansal, Besmira
Nushi, Ece Kamar, Dan Weld,
Walter Lasecki, and Eric Horvitz.
Updates in Human-AI Teams: Understanding and
Addressing the Performance/Compatibility Tradeoff. In
AAAI Conference on Artificial Intelligence. AAAI. https://www.microsoft.com/en-us/research/publication/updates-in-human-ai-teams-understanding-and-addressing-the-performance-compatibility-tradeoff/
- Krizhevsky (2009) Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. Technical Report.
- Srivastava et al. (2020) Megha Srivastava, Besmira Nushi, Ece Kamar, Shital Shah, and Eric Horvitz. 2020. An Empirical Analysis of Backward Compatibility in Machine Learning Systems. In KDD. https://www.microsoft.com/en-us/research/publication/an-empirical-analysis-of-backward-compatibility-in-machine-learning-systems/
Tokui et al. (2022)
Shogo Tokui, Susumu
Tokumoto, Akihito Yoshii, Fuyuki
Ishikawa, Takao Nakagawa, Kazuki
Munakata, and Shinji Kikuchi.
NeuRecover: Regression-Controlled Repair of Deep Neural Networks with Training History.Proceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER).
- Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. CoRR abs/1708.07747 (2017). arXiv:1708.07747 http://arxiv.org/abs/1708.07747