GroupBC
None
view repo
Machine learning models are updated as new data is acquired or new architectures are developed. These updates usually increase model performance, but may introduce backward compatibility errors, where individual users or groups of users see their performance on the updated model adversely affected. This problem can also be present when training datasets do not accurately reflect overall population demographics, with some groups having overall lower participation in the data collection process, posing a significant fairness concern. We analyze how ideas from distributional robustness and minimax fairness can aid backward compatibility in this scenario, and propose two methods to directly address this issue. Our theoretical analysis is backed by experimental results on CIFAR-10, CelebA, and Waterbirds, three standard image classification datasets. Code available at github.com/natalialmg/GroupBC
READ FULL TEXT VIEW PDFNone
Machine learning (ML) leverages increasing amounts of data to improve model performance. Consequently, continual data collection and model improvement is an integral part of the ML life-cycle Chen and Liu (2016); De Lange et al. (2019); Parisi et al. (2019). Although model updates are often designed to improve average performance, some segments of the population may actually see their performance degraded by the model update. Backward Compatibility (BC) was discussed in Bansal et al. (2019); Srivastava et al. (2020)
to monitor this undesirable phenomenon where, in a classification setting, samples that were accurately classified by a previous model are now incorrectly classified under the updated model, reducing the perceived reliability of the system. The current notion of BC may have shortcomings in measuring how compatibility errors arise in different populations, this is especially relevant since some population segments may be misrepresented in the data collecting process. Even when this is not the case, datasets and populations evolve over time, and potential compatibility mismatches that were previously tolerable may be exacerbated.
Here we provide a general analysis of BC were we formulate the BC definitions in Srivastava et al. (2020); Yan et al. (2021)
as a statistical property of the data distribution and the models under comparison. We first show how BC measures are dependent on the error rates of the models. We then analyze how variance of the learning algorithm affects compatibility of independently-trained models, providing a consistent explanation on why ensembles have empirically proven to improve model compatibility at no accuracy cost
Yan et al. (2021). Moreover, we show that classifier confidence is directly related to per-sample BC.We analyze group distributional robustness of BC, were we consider model updates under data shifts. We assume data samples come from a mixture of distributions corresponding to different groups (e.g., domains, demographics, classes); as new samples are acquired, the overall group composition of the training set may drift over time. We wish to update our model over the newly augmented dataset in a way that guarantees a desired BC measure across all groups. We evaluate how existing ideas from group robust approaches Hashimoto et al. (2018); Sagawa et al. (2019); Martinez et al. (2020); Diana et al. (2021); Martinez et al. (2021) can help in this scenario. We additionally propose two methods, Sample-Referenced Model (SRM) and Group-Referenced Model (GRM) to specifically tackle backward-compatible model updating. We use three common image classification datasets, CelebA, Waterbird, and CIFAR10 Liu et al. (2018); Sagawa et al. (2019); Wah et al. (2011); Krizhevsky et al. (2021) for empirical validation.
We consider the supervised classification scenario were we have an input variable and target variable
related by the joint distribution
. Given a dataset , a learning kernel defines a conditional distribution over functions in an hypothesis class . The function outputs a score over the possible values of given . We distinguish between and where outputs a member of the target variable based on the score provided by model (e.g., or ). We note the accuracy and error of a model on an arbitrary distribution as and .Backward compatibility (BC) was introduced ML to evaluate the prediction consistency between two models trained to perform the same classification task.
Let us consider the general case where we are given models and where , are datasets over distributions that share support. For a given test distribution on the same support, we can define the total compatibility (TC) and the (directed) negative flip rate (NFR) between models and on distribution as
(1) |
Note that TC can be decomposed as the sum of two disjoint events, one being positive compatibility (PC) or the probability of both models getting the correct answer, and negative compatibility (NC), which measures the probability of a double error. NFR measures the likelihood of errors occurring on samples that the reference model
had previously classified correctly. These quantities are the probabilistic version of metrics that were first defined empirically in Srivastava et al. (2020); Yan et al. (2021). It is straightforward to show that they can be bounded using the error rate and accuracies of the respective models.Given two models defined over the same input and output support and a joint distribution we have the following:
(2) |
Since is upper bounded by the error rate of model on dataset , it is partially addressed by reducing the empirical error on samples from . Proof is provided in Appendix A.1. We explore how properties of the learning kernel can affect these metrics in the Appendix A.2, which also formally addresses the impact of ensemble learning on BC.
Consider a model trained on a dataset containing samples from a variety of groups , that is . This model achieves a given per-group performance, empirically measured using a held out database . We further assume we expand our training dataset to such that by potentially changing the fraction of samples of each group (e.g., for some ), without modifying the group-conditional data distributions ; this models non-uniform data acquisition.
We now analyze how BC is affected at the group level when a model is updated due to new data being acquired. We evaluate how per-group accuracy and NFR are affected when updated models are obtained by empirical risk minimization (ERM) or by worst group risk minimization (e.g., minimax group fairness Martinez et al. (2020); Diana et al. (2021) or group distributional robustness Sagawa et al. (2019)). Lastly, we propose a method to account for the worst NFR at group or at sample level, and show how this can improve the updated model in terms of group BC.
We identify two objectives to learn a model , expected risk minimization (ERM) and group minimax fairness (GMMF), a relaxed version of Martinez et al. (2020); Diana et al. (2021), 111We note that they share the objective with group distributional robustness Sagawa et al. (2019). which is already robust to group shifts. We present their statistical formulations, and refer the reader to Appendix A.3 for implementation details,
(3) |
where
is a trainable loss function (e.g., cross entropy or Brier score), and
is a baseline group probability.On scenarios where maintaining or improving NFR is of interest, we note that the updated model can also be made to explicitly depend on the previous model . We first note that the loss of a sample is a strong predictor of NFR. In particular, for binary classification, any model satisfying is automatically NFR-compatible in that sample. With this in mind, we can define the sample-referenced model (SRM) as
(4) |
Here controls the minimum importance weight of each sample in the support. Note that, from the model’s perspective, the SRM objective reduces to minimizing a positively-weighted per-sample loss. However, the distribution is updated based on the per-sample excess loss the current model has over the previous model. Similarly to Martinez et al. (2021)
, we use an empirical estimate of this objective and maximize it via projected gradient ascent (PGA) on the empirical distribution, with SGD for model optimization, Algorithm A.
2 in Appendix A.3 describes the optimization procedure.While SRM directly tackles NFR, it may suffer from generalization issues in practice, where NFR compatibility on train samples may be a poor predictor of NFR compatibility on new samples from a similar data distribution. To address this, we also propose a group-referenced model (GRM), where all per-group losses of the updated model are expected to improve w.r.t the baseline; a compromise objective between SRM and ERM,
(5) |
Here we minimize the group with worst average loss improvement with respect to the previous model , Algorithm 1 shows the procedure.
Input: Dataset , reference model , : model learning rate, : adversary learning rate, batch size , aggregation size .
return
We perform experiments on the CIFAR-10, CelebA Liu et al. (2018), and Waterbird Sagawa et al. (2019) datasets. CIFAR-10 is a standard benchmark for multi-label classification with 10 different classes that we considered as both target and group variables. CelebA contains over 200 thousand facial images with various attribute annotations; similarly to Sagawa et al. (2019), we use CelebA as a binary classification dataset by predicting the “blond” label. We used the Cartesian product of “blond” and “binary gender” as the group label,222Note that we use the “gender” binary label as a group descriptor for testing purposes, without attempting to ascribe the broader diversity of gender identities to this limiting notion. giving 4 groups total. The Waterbird dataset is a subset from the CUB image dataset Wah et al. (2011), where the goal is to predict the type of bird (“waterbird” or “landbird”). As in Sagawa et al. (2019), we used the Cartesian product of the bird class and the image background ("land" or "water"), leading to a total of 4 groups. We use ResNet-18 and ResNet-34 architectures He et al. (2016) as our base classifiers in all cases, and use SGD with cosine annealing learning rate Loshchilov and Hutter (2016). Details are provided in Appendix A.4.
We evaluate the effects that different training and model updating techniques have on NFR in the demographic shift scenario outlined in Section 3. We train a baseline model using either ERM or GMMF on a dataset comprising of the original training samples from either Cifar10, CelebA, or Waterbird. To model extreme demographic shift and its effect on NFR, we train a successor model on an expanded dataset that comprises all of the samples in , plus all samples from a given group that were on the original training dataset. Since has access to both an expanded dataset and a reference model , we evaluate all options for model updating (ERM, GMMF, SRM, GRM).
Experiments include all possible combinations of base training methods (ERM, GMMF), model update algorithms (ERM, GMMF, SRM, GRM), and extension groups (i.e., the experiments are repeated for every possible choice of group extension). Table 1 summarizes the average performance over these extensions for each dataset, detailed results are available in Appendix A.4. We note that GRM has the smallest negative impact on accuracy, and correspondingly better worst case NFR performance, as expected by design. When transitioning from ERM to GMMF, we see large accuracy improvements on the worst class, but we take a significant accuracy penalty on at least one of the other classes as a result. We note that SRM generally obtains similar results to ERM as an update method, showing that per-sample generalization on test set is a hard-to-achieve objective, which suggests GRM may be a more suitable objective in practice. More work is warranted to tackle this important scenario.
Baseline Method () | ERM | GMMF | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset | Update | Acc | 1-NFR | NFR | Acc | 1-NFR | NFR | ||||||
Method | min | max | min | max | min | max | min | max | min | max | min | max | |
CelebA | ERM | -4.7% | 7.9% | 0.932 | 0.996 | 0.003 | 0.315 | -48.3% | 5.5% | 0.516 | 1.000 | 0.006 | 0.100 |
GRM | -1.5% | 24.2% | 0.953 | 0.987 | 0.004 | 0.285 | -1.7% | 1.3% | 0.958 | 0.987 | 0.037 | 0.074 | |
GMMF | -7.9% | 53.1% | 0.921 | 0.994 | 0.006 | 0.090 | -3.4% | 1.6% | 0.953 | 0.983 | 0.041 | 0.063 | |
SRM | -2.9% | 6.3% | 0.917 | 0.997 | 0.003 | 0.294 | -41.1% | 5.2% | 0.589 | 1.000 | 0.008 | 0.102 | |
Waterbird | ERM | -3.7% | 7.7% | 0.940 | 0.996 | 0.004 | 0.327 | -15.1% | 5.5% | 0.833 | 0.996 | 0.009 | 0.204 |
GRM | -3.6% | 10.3% | 0.953 | 0.991 | 0.006 | 0.305 | -3.8% | 1.9% | 0.944 | 0.986 | 0.039 | 0.189 | |
GMMF | -5.8% | 18.3% | 0.931 | 0.996 | 0.007 | 0.238 | -2.7% | 1.5% | 0.946 | 0.985 | 0.038 | 0.188 | |
SRM | -5.3% | 5.4% | 0.932 | 0.995 | 0.005 | 0.353 | 17.2% | 4.0% | 0.817 | 0.996 | 0.015 | 0.209 | |
CIFAR-10 | ERM | -7.7% | 5.0% | 0.865 | 0.981 | 0.040 | 0.189 | -11.1% | 5.9% | 0.842 | 0.981 | 0.030 | 0.164 |
GRM | -5.0% | 6.9% | 0.902 | 0.986 | 0.035 | 0.178 | -5.8% | 5.2% | 0.888 | 0.986 | 0.030 | 0.159 | |
GMMF | -6.5% | 6.0% | 0.896 | 0.982 | 0.033 | 0.166 | -6.1% | 5.6% | 0.893 | 0.979 | 0.031 | 0.152 | |
SRM | 4.3% | 6.9% | 0.898 | 0.988 | 0.031 | 0.175 | -24.7% | 7.4% | 0.714 | 0.979 | 0.024 | 0.139 |
In this work we analyzed BC in the context of machine learning and how it can be controlled. We analyzed group BC in the context of iterative model updating when our training dataset is expanded without respecting prior demographic or group proportions. We empirically show how the performance on the groups can be affected when updating ERM or GMMF models. Moreover, we proposed and analyzed two approaches to update the previous model with the goal of minimizing the NFR and the decrease in accuracy of each group in held-out data. Our experimental results on common image classification datasets serve as empirical validation for the ideas advanced in this work. Future work should focus on analyzing the generalization properties of these approaches.
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 2429–2437. Cited by: §1.Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §A.2, §4.Sgdr: stochastic gradient descent with warm restarts
. arXiv preprint arXiv:1608.03983. Cited by: §4.Continual lifelong learning with neural networks: a review
. Neural Networks 113, pp. 54–71. Cited by: §1.Given the definitions of total compatibility (TC) and negative flip rate (NFR) from Equation 1. TC can be further decomposed as the sum of Positive compatibility (PC) and Negative compatibility (NC) as follows
(6) |
We next proceed to prove Preposition 2.1, which is extended to include upper bounds on PC and NC that arise naturally from the proof.
Given two models defined over the same input and output support and a joint distribution we have the following:
(7) |
From the definition of NFR, and using the fact that the likelihood of two events occurring simultaneously is smaller than the likelihood of either of the events happening on their own (, it follows that
(8) |
From the definition of PC and the upper bound of NFR it follows that
(9) |
A symmetrical analysis can be done to obtain the lower bound of NC. Combining the bounds for NC and PC we obtain the lower bound for TC.
∎
It is common in deep learning to train models to reach near-zero training loss, meaning that models are perfectly compatible on the training dataset. However, due to the overparametrized nature of deep neural networks, the models are unlikely to make identical decisions on a different set of data points. The models may achieve the same average performance on unseen data, but will distribute the errors differently.
We analyze the compatibility between models that are independently obtained with the same learning algorithm applied to the same dataset , . We further consider the binary classification problem where , , where the decision is obtained via thresholding, , with a predefined threshold . We can express the model output as the sum of the expected model output given the dataset and learning algorithm plus an input-dependent error ,
(10) |
with a corresponding decision probability
Given an input , errors and are i.i.d., in this setting, TC can be expressed as, 333TC can be expressed as since we assume .
(11) |
For every input , TC reaches its maximum value if takes value or . We can identify two ways of accomplishing this, one is choosing a lopsided threshold (e.g., or ), since , this reduces the chances of two models making different decisions, but makes the classifiers non-informative, since the decision is only weakly affected by our learned classifier. The other option is to reduce the variance of the noise variable ; we can use Hoeffding’s inequality for this by observing that
(12) |
which provides an upper bound on the probability of the error being capable of flipping the decision of the current model w.r.t the idealized expected model . One effective way of addressing noise variance is by augmenting the learning kernel with ensembles, since the variance of the ensemble model decays by a factor of w.r.t the model without ensembles, with the number of elements in the ensemble.
Ensembles can have a major impact on model compatibility, this has been empirically reported in other works such as Yan et al. [2021], and this simple analysis provides an explanation for their effectiveness. We empirically showcase the effect ensembles have on several compatibility metrics on CelebA Liu et al. [2018] and CIFAR-10 Krizhevsky et al. [2021], where we train several independent ResNet networks He et al. [2016] (ResNet-18 for Cifar-10 and Resnet-34 for CelebA) on the same training dataset using SGD (effectively taking independent samples from the SGD learning kernel). We used a simple ensemble technique where the average output of independent networks constitute our ensemble model. We show how total compatibility and positive compatibility are effectively managed via ensembling in Table 2.
CIFAR-10 | CelebA | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
model | Acc | TC | TC | PC | PC | model | Acc | TC | TC | PC | PC |
base | 0.936 | 0.952 | 0.079 | 0.912 | 0.04 | base | 0.958 | 0.982 | 0.066 | 0.949 | 0.033 |
2ens | 0.943 | 0.967 | 0.08 | 0.927 | 0.04 | 2ens | 0.96 | 0.987 | 0.067 | 0.953 | 0.034 |
5ens | 0.947 | 0.979 | 0.084 | 0.937 | 0.042 | 5ens | 0.961 | 0.992 | 0.07 | 0.957 | 0.035 |
10ens | 0.95 | 0.985 | 0.086 | 0.942 | 0.043 | 10ens | 0.96 | 0.994 | 0.073 | 0.957 | 0.037 |
15ens | 0.95 | 0.989 | 0.088 | 0.945 | 0.044 | 15ens | 0.961 | 0.995 | 0.074 | 0.958 | 0.037 |
20ens | 0.951 | 0.991 | 0.09 | 0.946 | 0.045 | 20ens | 0.961 | 0.996 | 0.074 | 0.959 | 0.037 |
We observe that the average accuracy increases for larger ensemble sizes, this effect is well known, and one of the main reasons why ensembles are a popular choice for increasing model performance when feasible. This accuracy increasing effect has diminishing returns with the number of ensembles. Most importantly for our analysis, we note that both total compatibility and positive compatibility also increase significantly as a function of ensemble size, as predicted by the analysis of section A.2. Interestingly, this increase outpaces the natural growth of TC and PC predicted by the simple increase in accuracy, as shown by the TC and PC columns in the same table, further suggesting that the excess increase is due to the reduction in output variance that ensembles provide as a learning algorithm.
Decision confidence () indicates how close a datapoint is to the decision boundary of the model. A large confidence value also indicates robustness to perturbations caused by differences in learned models, since it would require larger values to observe different decisions between two models. We verify this observation empirically in the same setting as before. Figure 1 shows how TC and PC vary as a function of confidence level, and the improvements introduced by the use of ensembles.
|
|
From Figure 1 we observe that confidence is indeed a strong predictor of both total and positive compatibility; with lower confidence values associated with a smaller likelihood of samples being compatible across other models. We note that ensembles also share this effect, but the compatibility across models is uniformly improved for all confidence levels.
Here we provide additional details for all algorithm implementations. Model updating in all cases is done via SGD, so here we provide the empirical objective for all update algorithms and discuss how the adversarial distribution is modelled and updated.
To start with, ERM has the following statistical and empirical formulation:
(13) |
GMMF and GRM can both be likewise expressed in both statistical and empirical formulation as:
(14) |
(15) |
Here we use the vector
to denote the empirical estimate of distributions satisfying , and in the empirical formulation represents the number of samples of group . We note that the distribution represents a potential shift in group composition w.r.t the training distribution , the distributions denoted by indicates the underlying distribution from which the current training dataset is sampled from. The training algorithm for GRM is presented in Algorithm 1, and the algorithm for GMMF is identical if we set all per-group reference values .Finally, SRM is also presented in both formulations:
(16) |
Here the vector already represents the importance weight ratio . The algorithm, shown in Algorithm 2, uses PGA on the importance weights .
Input: Dataset , reference model , parametric model , : model learning rate, : adversary learning rate, batch size , aggregation size .
return
Table 3 outlines all key architectural choices and hyperarameters used on each dataset.
Dataset | Network | Optimization | Learning rate | Weight decay | Batch size | N | |
---|---|---|---|---|---|---|---|
Waterbirds | ResNet-34 | SGD | 64 | 80 | 0.1 | ||
CelebA | ResNet-34 | SGD | 64 | 160 | 0.1 | ||
CIFAR-10 | ResNet-18 | SGD | 128 | 80 | 0.1 |
Summary of architecture and hyperparameters for each dataset. We also use momentum (
) and cosine annealingOther notable changes include reducing the kernel size of the initial convolution layer to for CIFAR-10 (as standard for this dataset with smaller image sizes). Images for Waterbird and CelebA are scaled, normalized, and cropped to for faster training. We use the default train and test partitions for all datasets.
Here we provide the full experimental results. The initial model (baseline ) is trained with of the original train partition, and for ERM it achieved worst and best group accuracies of and on CelebA, and on Waterbird, and and on CIFAR-10. For the GMMF baseline the worst and best group accuracies were and on CelebA, and on Waterbird, and and on CIFAR-10. We next report results for each updated dataset , where we augmented the original training dataset exclusively with the remaining training samples of one of the groups. Tables 4, 5, and 6 show these results for the CelebA, CIFAR-10, and Waterbirds datasets respectively.
Baseline | ERM | GMMF | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Augmented | Update | Acc | 1-NFR | NFR | Acc | 1-NFR | NFR | ||||||
Group | Method | min | max | min | max | min | max | min | max | min | max | min | max |
non blond | ERM | -8.6% | 1.8% | 0.900 | 0.998 | 0.003 | 0.288 | -54.8% | 6.1% | 0.452 | 1.000 | 0.005 | 0.102 |
non male | GRM | -1.9% | 36.2% | 0.977 | 0.981 | 0.005 | 0.243 | -2.8% | 1.6% | 0.949 | 0.991 | 0.036 | 0.079 |
GMMF | -8.0% | 55.9% | 0.920 | 0.994 | 0.006 | 0.062 | -3.4% | 2.3% | 0.950 | 0.982 | 0.043 | 0.057 | |
SRM | -2.9% | 4.0% | 0.927 | 0.997 | 0.004 | 0.299 | -45.8% | 5.3% | 0.542 | 1.000 | 0.008 | 0.102 | |
non blond | ERM | -6.8% | 0.4% | 0.904 | 1.000 | 0.001 | 0.277 | -61.0% | 5.9% | 0.390 | 1.000 | 0.002 | 0.102 |
male | GRM | -1.9% | 27.7% | 0.959 | 0.979 | 0.004 | 0.316 | -1.7% | 1.6% | 0.955 | 0.991 | 0.036 | 0.073 |
GMMF | -8.0% | 52.5% | 0.920 | 0.989 | 0.006 | 0.090 | -3.4% | 1.1% | 0.958 | 0.984 | 0.034 | 0.068 | |
SRM | -4.0% | 02.4% | 0.904 | 0.999 | 0.001 | 0.277 | -55.9% | 5.9% | 0.441 | 1.000 | 0.002 | 0.102 | |
blond | ERM | -2.6% | 9.6% | 0.966 | 0.996 | 0.003 | 0.339 | -43.5% | 5.4% | 0.565 | 0.999 | 0.007 | 0.102 |
non male | GRM | -1.7% | 31.6% | 0.961 | 0.988 | 0.004 | 0.294 | -1.7% | 1.4% | 0.960 | 0.988 | 0.034 | 0.079 |
GMMF | -8.1% | 52.5% | 0.918 | 0.994 | 0.006 | 0.096 | -3.4% | 1.7% | 0.959 | 0.982 | 0.041 | 0.069 | |
SRM | -4.5% | 1.8% | 0.876 | 0.999 | 0.002 | 0.249 | -39.0% | 5.5% | 0.610 | 1.000 | 0.006 | 0.102 | |
blond | ERM | -0.8% | 19.8% | 0.958 | 0.991 | 0.004 | 0.356 | -33.9% | 4.8% | 0.655 | 0.999 | 0.012 | 0.096 |
male | GRM | -0.4% | 1.1% | 0.915 | 0.998 | 0.003 | 0.288 | -0.5% | 0.6% | 0.966 | 0.981 | 0.043 | 0.063 |
GMMF | -7.4% | 51.4% | 0.926 | 1.000 | 0.006 | 0.113 | -3.2% | 1.1% | 0.944 | 0.985 | 0.047 | 0.056 | |
SRM | -0.3% | 16.9% | 0.961 | 0.995 | 0.004 | 0.350 | -23.7% | 4.4% | 0.763 | 0.999 | 0.017 | 0.102 | |
Avg | ERM | -4.7% | 7.9% | 0.932 | 0.996 | 0.003 | 0.315 | -48.3% | 5.5% | 0.516 | 1.000 | 0.006 | 0.100 |
GRM | -1.5% | 24.2% | 0.953 | 0.987 | 0.004 | 0.285 | -1.7% | 1.3% | 0.958 | 0.987 | 0.037 | 0.074 | |
GMMF | -7.9% | 53.1% | 0.921 | 0.994 | 0.006 | 0.090 | -3.4% | 1.6% | 0.953 | 0.983 | 0.041 | 0.063 | |
SRM | -2.9% | 6.3% | 0.917 | 0.997 | 0.003 | 0.294 | -41.1% | 5.2% | 0.589 | 1.000 | 0.008 | 0.102 |
Baseline | ERM | GMMF | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Augmented | Update | Acc | 1-NFR | NFR | Acc | 1-NFR | NFR | ||||||
Group | Method | min | max | min | max | min | max | min | max | min | max | min | max |
landbird | ERM | -5.0% | 0.3% | 0.914 | 0.999 | 0.004 | 0.389 | -25.7% | 5.5% | 0.740 | 0.999 | 0.003 | 0.217 |
landback | GRM | -4.9% | 3.0% | 0.937 | 0.992 | 0.006 | 0.366 | -8.7% | 2.4% | 0.900 | 0.991 | 0.026 | 0.207 |
GMMF | -6.4% | 17.0% | 0.923 | 0.995 | 0.007 | 0.251 | -3.6% | 0.1% | 0.939 | 0.983 | 0.041 | 0.195 | |
SRM | -8.4% | 0.2% | 0.897 | 0.998 | 0.003 | 0.407 | -27.4% | 5.2% | 0.726 | 0.999 | 0.006 | 0.220 | |
landbird | ERM | -5.0% | 6.7% | 0.939 | 0.996 | 0.004 | 0.352 | -16.5% | 9.0% | 0.822 | 0.999 | 0.006 | 0.207 |
waterback | GRM | -4.1% | 17.8% | 0.958 | 0.995 | 0.006 | 0.243 | -1.3% | 1.6% | 0.973 | 0.979 | 0.051 | 0.179 |
GMMF | -5.6% | 21.0% | 0.943 | 0.997 | 0.006 | 0.212 | -0.7% | 2.1% | 0.969 | 0.980 | 0.047 | 0.178 | |
SRM | -2.3% | 7.3% | 0.963 | 0.993 | 0.004 | 0.332 | -13.1% | 3.1% | 0.854 | 0.992 | 0.020 | 0.204 | |
waterbird | ERM | -1.3% | 15.6% | 0.975 | 1.000 | 0.006 | 0.269 | -5.1% | 3.9% | 0.925 | 0.996 | 0.016 | 0.196 |
landback | GRM | -2.0% | 14.3% | 0.975 | 0.992 | 0.006 | 0.274 | -1.9% | 2.2% | 0.961 | 0.989 | 0.040 | 0.174 |
GMMF | -3.9% | 19.9% | 0.949 | 0.998 | 0.007 | 0.224 | -1.1% | 1.6% | 0.953 | 0.989 | 0.035 | 0.184 | |
SRM | -1.1% | 9.2% | 0.969 | 0.991 | 0.006 | 0.324 | -10.1% | 3.7% | 0.882 | 0.998 | 0.020 | 0.202 | |
waterbird | ERM | -3.6% | 8.3% | 0.930 | 0.988 | 0.003 | 0.298 | -13.2% | 3.7% | 0.844 | 0.988 | 0.010 | 0.196 |
waterback | GRM | -3.3% | 6.1% | 0.941 | 0.984 | 0.005 | 0.336 | -3.4% | 1.4% | 0.941 | 0.984 | 0.038 | 0.195 |
GMMF | -7.4% | 15.4% | 0.907 | 0.992 | 0.007 | 0.263 | -5.3% | 2.2% | 0.922 | 0.988 | 0.031 | 0.195 | |
SRM | -9.4% | 4.8% | 0.900 | 0.998 | 0.006 | 0.350 | -18.2% | 4.0% | 0.808 | 0.996 | 0.015 | 0.210 | |
Avg | ERM | -3.7% | 7.7% | 0.940 | 0.996 | 0.004 | 0.327 | -15.1% | 5.5% | 0.833 | 0.996 | 0.009 | 0.204 |
GRM | -3.6% | 10.3% | 0.953 | 0.991 | 0.006 | 0.305 | -3.8% | 1.9% | 0.944 | 0.986 | 0.039 | 0.189 | |
GMMF | -5.8% | 18.3% | 0.931 | 0.996 | 0.007 | 0.238 | -2.7% | 1.5% | 0.946 | 0.985 | 0.038 | 0.188 | |
SRM | -5.3% | 5.4% | 0.932 | 0.995 | 0.005 | 0.353 | 17.2% | 4.0% | 0.817 | 0.996 | 0.015 | 0.209 |
Baseline | ERM | GMMF | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Augmented | Update | Acc | 1-NFR | NFR | Acc | 1-NFR | NFR | ||||||
Group | Method | min | max | min | max | min | max | min | max | min | max | min | max |
airplane & | ERM. | -3.9% | 4.2% | 0.892 | 0.997 | 0.032 | 0.195 | -5.7% | 5.4% | 0.889 | 0.991 | 0.019 | 0.168 |
automobile | GRM | -2.5% | 4.7% | 0.918 | 0.997 | 0.026 | 0.177 | -2.3% | 5.8% | 0.908 | 0.992 | 0.021 | 0.153 |
GMMF | -2.2% | 5.6% | 0.922 | 0.997 | 0.022 | 0.165 | -2.5% | 6.0% | 0.914 | 0.993 | 0.020 | 0.161 | |
SRM | -1.8% | 5.9% | 0.923 | 0.996 | 0.025 | 0.167 | -1.7% | 6.5% | 0.915 | 0.996 | 0.016 | 0.154 | |
bird & | ERM | -9.5% | 10.1% | 0.876 | 0.977 | 0.044 | 0.150 | -8.7% | 10.6% | 0.885 | 0.973 | 0.029 | 0.136 |
cat | GRM | -4.5% | 15.0% | 0.914 | 0.983 | 0.035 | 0.129 | -4.1% | 5.8% | 0.916 | 0.985 | 0.030 | 0.133 |
GMMF | -6.8% | 13.5% | 0.899 | 0.975 | 0.031 | 0.127 | -3.6% | 11.1% | 0.918 | 0.977 | 0.036 | 0.118 | |
SRM | -8.0% | 12.5% | 0.889 | 0.984 | 0.040 | 0.148 | -8.4% | 14.0% | 0.883 | 0.984 | 0.037 | 0.131 | |
deer & | ERM | -8.8% | 5.5% | 0.850 | 0.971 | 0.047 | 0.202 | -17.4% | 5.3% | 0.789 | 0.972 | 0.036 | 0.185 |
dog | GRM | -7.0% | 7.7% | 0.873 | 0.979 | 0.045 | 0.207 | -9.7% | 3.2% | 0.862 | 0.971 | 0.036 | 0.176 |
GMMF | -13.3% | 4.3% | 0.853 | 0.968 | 0.040 | 0.179 | -12.3% | 5.1% | 0.857 | 0.972 | 0.032 | 0.156 | |
SRM | -4.4% | 6.8% | 0.885 | 0.981 | 0.033 | 0.193 | -10.5% | 8.0% | 0.847 | 0.991 | 0.032 | 0.174 | |
frog & | ERM | -10.6% | 2.8% | 0.834 | 0.976 | 0.047 | 0.204 | -13.5% | 4.9% | 0.813 | 0.979 | 0.037 | 0.170 |
horse | GRM | -6.7% | 4.5% | 0.894 | 0.985 | 0.039 | 0.192 | -9.1% | 8.3% | 0.865 | 0.996 | 0.031 | 0.178 |
GMMF | -7.5% | 4.4% | 0.884 | 0.984 | 0.040 | 0.189 | -8.6% | 1.7% | 0.876 | 0.964 | 0.039 | 0.172 | |
SRM | -3.6% | 5.9% | 0.889 | 0.991 | 0.031 | 0.183 | -9.2% | 7.2% | 0.859 | 0.988 | 0.034 | 0.173 | |
ship & | ERM | -5.9% | 2.5% | 0.875 | 0.984 | 0.031 | 0.195 | -10.1% | 3.5% | 0.836 | 0.990 | 0.029 | 0.159 |
truck | GRM | -4.2% | 2.6% | 0.911 | 0.987 | 0.029 | 0.184 | -3.9% | 3.0% | 0.890 | 0.986 | 0.034 | 0.157 |
GMMF | -2.8% | 2.2% | 0.922 | 0.984 | 0.034 | 0.172 | -3.4% | 4.0% | 0.899 | 0.991 | 0.030 | 0.155 | |
SRM | -3.8% | 3.4% | 0.905 | 0.990 | 0.028 | 0.186 | -93.6% | 1.1% | 0.064 | 0.936 | 0.000 | 0.064 | |
Avg | ERM | -7.7% | 5.0% | 0.865 | 0.981 | 0.040 | 0.189 | -11.1% | 5.9% | 0.842 | 0.981 | 0.030 | 0.164 |
GRM | -5.0% | 6.9% | 0.902 | 0.986 | 0.035 | 0.178 | -5.8% | 5.2% | 0.888 | 0.986 | 0.030 | 0.159 | |
GMMF | -6.5% | 6.0% | 0.896 | 0.982 | 0.033 | 0.166 | -6.1% | 5.6% | 0.893 | 0.979 | 0.031 | 0.152 | |
SRM | 4.3% | 6.9% | 0.898 | 0.988 | 0.031 | 0.175 | -24.7% | 7.4% | 0.714 | 0.979 | 0.024 | 0.139 |