Log In Sign Up

OccamNets: Mitigating Dataset Bias by Favoring Simpler Hypotheses

Dataset bias and spurious correlations can significantly impair generalization in deep neural networks. Many prior efforts have addressed this problem using either alternative loss functions or sampling strategies that focus on rare patterns. We propose a new direction: modifying the network architecture to impose inductive biases that make the network robust to dataset bias. Specifically, we propose OccamNets, which are biased to favor simpler solutions by design. OccamNets have two inductive biases. First, they are biased to use as little network depth as needed for an individual example. Second, they are biased toward using fewer image locations for prediction. While OccamNets are biased toward simpler hypotheses, they can learn more complex hypotheses if necessary. In experiments, OccamNets outperform or rival state-of-the-art methods run on architectures that do not incorporate these inductive biases. Furthermore, we demonstrate that when the state-of-the-art debiasing methods are combined with OccamNets results further improve.


Learning from others' mistakes: Avoiding dataset biases without modeling them

State-of-the-art natural language processing (NLP) models often learn to...

Learning De-biased Representations with Biased Representations

Many machine learning algorithms are trained and evaluated by splitting ...

Neural Anisotropy Directions

In this work, we analyze the role of the network architecture in shaping...

Less Learn Shortcut: Analyzing and Mitigating Learning of Spurious Feature-Label Correlation

Many recent works indicate that the deep neural networks tend to take da...

Learning Inductive Biases with Simple Neural Networks

People use rich prior knowledge about the world in order to efficiently ...

Interpreting Bias in the Neural Networks: A Peek Into Representational Similarity

Neural networks trained on standard image classification data sets are s...

Calibrate and Prune: Improving Reliability of Lottery Tickets Through Prediction Calibration

The hypothesis that sub-network initializations (lottery) exist within t...

1 Introduction

Frustra fit per plura quod potest fieri per pauciora

William of Occam, Summa Totius Logicae (1323 CE)

Spurious correlations and dataset bias greatly impair generalization in deep neural networks [2, 6, 24, 63]. This problem has been heavily studied. The most common approaches are re-sampling strategies [8, 15, 23, 58], altering optimization to mitigate bias [57], adversarial unlearning [1, 21, 54, 78], learning invariant representations [5, 11, 68], and ensembling with bias-amplified models [7, 12, 48]. Here, we propose a new approach: incorporating architectural inductive biases that combat dataset bias.

Figure 1: *OccamNets focus on architectural inductive biases, which is an orthogonal direction to tackling dataset biases compared to the existing works.

In a typical feedforward network, each layer can be considered as computing a function of the previous layer, with each additional layer making the hypothesis more complex. Given a system trained to predict multiple categories, with some being highly biased, this means the network uses the same level of complexity across all of the examples, even when some examples should be classified with simpler hypotheses (e.g., less depth). Likewise, pooling in networks is typically uniform in nature, so every location is used for prediction, rather than only the minimum amount of information. In other words, typical networks violate Occam’s razor. Consider the *BiasedMNIST dataset 

[63], where the task is to recognize a digit while remaining invariant to multiple spuriously correlated factors, which include colors, textures, and contextual biases. The most complex hypothesis would exploit every factor during classification, including the digit’s color, texture, or background context. A simple hypothesis would instead be to focus on the digit’s shape and to ignore these spuriously correlated factors that work very well during training but do not generalize. We argue that a network should be capable of adapting its hypothesis space for each example, rather than always resorting to the most complex hypothesis, which would help it to ignore extraneous variables that hinder generalization.

Here, we propose convolutional *OccamNets which have architectural inductive biases that favor using the minimal amount of network depth and the minimal number of image locations during inference for a given example. The first inductive bias is implemented using early exiting, which has been previously studied for speeding up inference. The network is trained such that later layers focus on examples earlier layers find hard, with a bias toward exiting early. The second inductive bias replaces global average pooling before a classification layer with a function that is regularized to favor pooling with fewer image locations. We hypothesize this would be especially useful for combating background and contextual biases [3, 64]. OccamNets are complementary to existing approaches and can be combined with them.

In this paper, we demonstrate that architectural inductive biases are effective at mitigating dataset bias. Our specific contributions are:
  • We introduce the *OccamNet architecture, which has architectural inductive biases for favoring simpler solutions to help overcome dataset biases. *OccamNets do not require the biases to be explicitly specified during training, unlike many state-of-the-art debiasing algorithms.

  • In experiments using biased vision datasets, we demonstrate that *OccamNets greatly outperform architectures that do not use the proposed inductive biases. Moreover, we show that *OccamNets outperform or rival existing debiasing methods that use conventional network architectures.

  • We combine *OccamNets with four recent debiasing methods, which all show improved results compared to using them with conventional architectures.

2 Related Work

Dataset Bias and Bias Mitigation. When trained with empirical risk minimization (ERM), deep networks have been shown to exploit dataset biases in the training data resulting in poor test generalization [24, 46, 63, 71]. Existing works for mitigating this problem have focused on these approaches: 1) focusing on rare data patterns through re-sampling [8, 41], 2) loss re-weighting [15, 58], 3) adversarial debiasing [21, 35], 4) model ensembling [7, 12], 5) minority/counterfactual sample generation [8, 9, 36] and 6) invariant/robust risk minimization [5, 37, 56]. Most of these methods require bias variables, e.g., sub-groups within a category, to be annotated [21, 35, 41, 58, 63]. Some recent methods have also attempted to detect and mitigate biases without these variables by training separate bias-amplified models for de-biasing the main model  [13, 48, 59, 73]. This paper is the first to explore architectural inductive biases for combating dataset bias.

Early Exit Networks. *OccamNet is a multi-exit architecture designed to encourage later layers to focus on samples that earlier layers find difficult. Multi-exit networks have been studied in past work to speed up average inference time by minimizing the amount of compute needed for individual examples [10, 32, 67, 75], but their impact on bias-resilience has not been studied. In [60], a unified framework for studying early exit mechanisms was proposed, which included commonly used training paradigms [27, 39, 65, 74] and biological plausibility [47, 51, 50]. During inference, multi-exit networks choose the earliest exit based on either a learned criterion [10]

or through a heuristic, e.g., exit if the confidence score is sufficiently high 

[20], exit if there is low entropy [67], or exit if there is agreement among multiple exits [80]. Recently, [20] proposed early exit networks for long-tailed datasets; however, they used a class-balanced loss and did not study robustness to hidden covariates, whereas, *OccamNets generalize to these hidden variables without oracle bias labels during training.

Exit Modules and Spatial Maps. *OccamNets are biased toward using fewer spatial locations for prediction, which we enable by using spatial activation maps [25, 45, 55]

. While most recent convolutional neural networks (CNNs) use global average pooling followed by a linear classification layer 

[26, 29, 33], alternative pooling methods have been proposed, including spatial attention [4, 22, 31, 76] and dynamic pooling [31, 34, 38]. However, these methods have not been explored for their ability to combat bias mitigation, with existing bias mitigation methods adopting conventional architectures that use global average pooling instead. For *OccamNets, each exit produces a class activation map, which is biased toward using fewer visual locations.

3 OccamNets

Figure 2: *OccamNets are multi-exit architectures capable of exiting early through the exit decision gates. The exits yield class activation maps that are trained to use a constrained set of visual regions.

3.1 OccamNet Architecture for Image Classification

*OccamNets have two inductive biases: a) they prefer exiting as early as possible, and b) they prefer using fewer visual regions for predictions. We implement these ideas in a CNN. Recent CNN architectures, such as ResNets [26] and DenseNets [33], consist of multiple blocks of convolutional layers. As shown in Fig. 2, these inductive biases are enabled by attaching an exit module to block of the CNN, as the blocks serve as natural endpoints for attaching them. Below, we describe how we implement these two inductive biases in *OccamNets.

In an *OccamNet, each exit module takes in feature maps produced by the backbone network and processes them with , which consists of two convolutional layers, producing feature maps used by the following components:

Suppressed CAM Predictors (). Each consists of a single convolutional layer, taking in the feature maps from to yield class activation maps, . The maps provide location wise predictions of classes. Following Occam’s principles, the usage of visual regions in these CAMs is suppressed through the CAM suppression loss described in Sec. 3.2.

Output Predictors (). The output predictor applies global average pooling on the suppressed CAMs predicted by

to obtain the output prediction vector,

, where is the total number of classes. The entire network is trained with the output prediction loss , which is a weighted sum of cross entropy losses between the ground truth and the predictions from each of the exits. Specifically, the weighting scheme is formulated to encourage the deeper layers to focus on the samples that the shallower layers find difficult. The detailed training procedure is described in Sec. 3.3.

Exit Decision Gates (). During inference, OccamNet needs to decide whether or not to terminate the execution at on a per-sample basis. For this, each consists of an exit decision gate, that yields an exit decision score

, which is interpreted as the probability that the sample can exit from


is realized via a ReLU layer followed by a sigmoid layer, taking in representations from

. The gates are trained via exit decision gate loss, which is based on whether or not made correct predictions. The loss and the training procedure is elaborated further in Sec. 3.4.

The total loss used to train OccamNets is given by:

3.2 Training the Suppressed CAMs

To constrain the usage of visual regions, *OccamNets regularize the CAMs so that only some of the cells exhibit confidence towards the ground truth class, whereas rest of the cells exhibit inconfidence i.e., have uniform prediction scores for all the classes. Specifically, let be the CAM where each cell encodes the score for the ground truth class. Then, we apply regularization on the locations that obtain softmax scores lower than the average softmax score for the ground truth class. That is, let be the softmax score averaged over all the cells in , then the cells at location , are regularized if the softmax score for the ground truth class, is less than . The CAM suppression loss is:


where, is the KL-divergence loss with respect to a uniform class distribution and ensures that the loss is applied only if the ground truth class scores lower than . The loss weight for is , which is set to 0.1 for all the experiments.

3.3 Training the Output Predictors

The prediction vectors obtained by performing global average pooling on the suppressed CAMs are used to compute the output prediction losses. Specifically, we train a bias-amplified first exit , using a loss weight of: , where, is the softmax score for the ground truth class. Here, encourages to amplify biases i.e., it provides higher loss weights for the samples that already have high scores for the ground truth class. This encourages to focus on the samples that it already finds easy to classify correctly. For all the experiments, we set to sufficiently amplify the biases. The subsequent exits are then encouraged to focus on samples that the preceding exits find difficult. For this, the loss weights are defined as:


where, is the exit decision score predicted by exit decision gate and is a small offset to ensure that all the samples receive a minimal, non-zero loss weight. For the samples where is low, the weight loss for exit, becomes high. The total output loss is then:


where, is the cross-entropy loss and is the total number of exits. Note that ’s are 0-indexed and the first bias-amplified exit is not used during inference. Furthermore, during training, we prevent the gradients of from passing through to avoid degrading the representations available for the deeper blocks and exits.

3.4 Training the Exit Decision Gates

Each exit decision gate yields an exit probability score . During inference, samples with exit from and samples with continue to the next block , if available. During training, all the samples use the entire network depth and is used to weigh losses as described in Sec. 3.3. Now, we specify the exit decision gate loss used to train :


where is the ground truth value for the gate, which is set to if the predicted class is the same as the ground truth class and 0 otherwise. That is, is trained to exit if the sample is correctly predicted at depth , else it is trained to continue onto the next block. Furthermore, the denominator: balances out the contributions from the samples with and to avoid biasing one decision over the other. With this setup, sufficiently parameterized models that obtain training accuracy will result in a trivial solution where is always set to i.e., the exit will learn that all the samples can exit. To avoid this issue, we stop computing once ’s mean-per-class training accuracy reaches a predefined threshold . During training, we stop the gradients from from passing through and , since this improved the training stability and overall accuracy in the preliminary experiments. The loss weight is set to 1 in all the experiments.

4 Experimental Setup

4.1 Datasets

(a) *BiasedMNIST
(b) *COCO
(c) *BAR
Figure 3: For each dataset, the first two columns show bias-aligned (majority) samples, and the last column shows bias-conflicting (minority) samples. For *BAR, the train set does not contain any bias-conflicting samples.

*BiasedMNIST [63]. As shown in Fig. 2(a), *BiasedMNIST requires classifying MNIST digits while remaining robust to multiple sources of biases, including color, texture, scale, and contextual biases. This is more challenging than the widely used Colored MNIST dataset [5, 35, 41], where the only source of bias is the spuriously correlated color. In our work, we build on the version created in [63]. We use images with grids of cells, where the target digit is placed in one of the grid cells and is spuriously correlated with: a) digit size/scale (number of cells a digit occupies), b) digit color, c) type of background texture, d) background texture color, e) co-occurring letters, and f) colors of the co-occurring letters. Following [63], we denote the probability with which each digit co-occurs with its biased property in the training set by . For instance, if , then 95% of the digit 1s are red, 95% of digit 1s co-occur with letter ‘a’ (not necessarily colored red) and so on. We set to 0.95 for all the experiments. The validation and test sets are unbiased. *BiasedMNIST has 10 classes and 50K train, 10K validation, and 10K test samples.

*COCO [3]. As shown in Fig. 2(b), *COCO puts COCO objects [42] on spuriously correlated Places backgrounds [79]. For instance, buses mostly appear in front of balloons and birds in front of trees. The dataset provides three different test sets: a) biased backgrounds (in-distribution), which reflects the object-background correlations present in the train set, b) unseen backgrounds (non-systematic shift), where the objects are placed on backgrounds that are absent from the train set and c) seen, but unbiased backgrounds (systematic shift) where the objects are placed on backgrounds that were not spuriously correlated with the objects in the train set. Results in [3]

show it is difficult to maintain high accuracy on both the in-distribution and shifted-distribution test sets. Apart from that, *COCO also includes an anomaly detection task, where anomalous samples from unseen object class need to be distinguished from the in-distribution samples. *COCO has 9 classes with 7200 train, 900 validation, and 900 test images.

Biased Action Recognition (*BAR) [48]. *BAR reflects real world challenges where bias attributes are not explicitly labeled for debiasing algorithms, with the test set containing additional correlations not seen during training. The dataset consists of correlated action-background pairs, where the train set consists of selected action-background pairs, e.g., climbing on a rock, whereas the evaluation set consists of differently correlated action-background pairs, e.g., climbing on snowy slopes (see Fig. 2(c)). The background is not labeled for the debiasing algorithms, making it a challenging benchmark. *BAR has 6 classes with 1941 train and 654 test samples.

4.2 Comparison Methods

We compare *OccamNets with four state-of-the-art bias mitigation methods, apart from the vanilla empirical risk minimization procedure:

  • Empirical Risk Minimization (*ERM)

    is the default method used by most deep learning models and it often leads to dataset bias exploitation since it minimizes the train loss without any debiasing procedure.

  • Spectral Decoupling (SD) [52] applies regularization to model outputs to help decouple features. This can help the model focus more on the signal.

  • Group Upweighting (*UpWt) balances the loss contributions from the majority and the minority groups by multiplying the loss by , where is the number of samples in group and is a hyper-parameter.

  • Group DRO (*gDRO) [57] is an instance of a broader family of distributionally robust optimization techniques [19, 49, 53], that optimizes for the difficult groups in the dataset.

  • Predictive Group Invariance (*PGI) [3] is another grouping method, that encourages matched predictive distributions across easy and hard groups within each class. It penalizes the KL-divergence between predictive distributions from within-class groups.

Dataset Sub-groups. For debiasing, *UpWt, *gDRO, and *PGI require additional labels for covariates (sub-group labels). Past work has focused on these labels being supplied by an oracle; however, having access to all relevant sub-group labels is often impractical for large datasets. Some recent efforts have attempted to infer these sub-groups. Just train twice (JTT) [43]

uses a bias-prone ERM model by training for a few epochs to identify the difficult groups. Environment inference for invariant learning (EIIL) 

[14] learns sub-group assignments that maximize the invariant risk minimization objective [5]. Unfortunately, inferred sub-groups perform worse in general than when they are supplied by an oracle [3, 43]. For the methods that require them, which excludes *OccamNets, we use oracle group labels (i.e., for *BiasedMNIST and *COCO). Inferred group labels are used for *BAR, as oracle labels are not available.

For *BiasedMNIST, all the samples having the same class and the same value for all of the spurious factors are placed in a single group. For *COCO, objects placed on spuriously correlated backgrounds form the majority group, while the rest form the minority group. *BAR does not specify oracle group labels, so we adopt the JTT method. Specifically, we train an ERM model for single epoch, reserving 20% of the samples with the highest losses as the difficult group and the rest as the easy group. We chose JTT over EIL for its simplicity. *OccamNets, of course do not require such group labels to be specified.

Architectures. ResNet-18 is used as the standard baseline architecture for our studies. We compare it with an OccamNet version of ResNet-18, i.e., OccamResNet-18. To create this architecture, we add early exit modules to each of ResNet-18’s convolutional blocks. To keep the number of parameters in OccamResNet-18 comparable to ResNet-18, we reduce the feature map width from 64 to 48. Assuming 1000 output classes, ResNet-18 has 12M parameters compared to 8M in OccamResNet-18. Further details are provided in Sec. A.7.

4.3 Metrics and Model Selection

We report the means and standard deviations of test set accuracies computed across five different runs for all the datasets. For *BiasedMNIST, we report the unbiased test set accuracy (i.e.,

) alongside the majority and minority group accuracies for each bias variable. For *COCO, unless otherwise specified, we report accuracy on the most challenging test split: with seen, but unbiased backgrounds. We also report the average precision score to measure the ability to distinguish 100 anomalous samples from the in-distribution samples for the anomaly detection task of *COCO. For *BAR, we report the overall test accuracies. We use unbiased validation set of *BiasedMNIST and validation set with unbiased backgrounds for *COCO for hyperparameter tuning. The hyperparameter search grid and selected values are specified in Sec. 


5 Results and Analysis

5.1 Overall Results

Results on Standard *ResNet-18
ResNet+ERM 36.8 0.7 35.6 1.0 51.3 1.9
ResNet+*SD [52] 37.1 1.0 35.4 0.5 51.3 2.3
ResNet+*UpWt 37.7 1.6 35.2 0.4 51.1 1.9
ResNet+*gDRO [56] 19.2 0.9 35.3 38.7 2.2
ResNet+*PGI [3] 48.6 0.7 42.7 0.6 53.6 0.9
Results on *OccamResNet-18
*OccamResNet 65.0 43.4 1.0 52.6 1.9
Table 1: Unbiased test set accuracies comparing *OccamResNet to the more conventional *ResNet architectures without early exits and constrained class activation maps. We format the first, second and best results.

OccamNets vs. ERM and Recent Bias Mitigation Methods. To examine how *OccamNets fare against ERM and state-of-the-art bias mitigation methods, we run the comparison methods on ResNet and compare the results with *OccamResNet. Results are given in Table 1. *OccamResNet outperforms state-of-the-art methods on *BiasedMNIST and *COCO and rivals *PGI on *BAR, demonstrating that architectural inductive biases alone can help mitigate dataset bias. The gap between *OccamResNet and other methods is large on *BiasedMNIST (16.4 - 46.0% absolute difference). For *COCO, *PGI rivals *OccamResNet, and clearly outperforms all other methods, in terms of accuracy on the test split with seen, but unbiased backgrounds. *OccamResNet’s results are impressive considering that Up Wt, gDRO, and PGI all had access to the bias group variables, unlike OccamNet, ERM, and SD.

Combining OccamNets with Recent Bias Mitigation Methods. Because *OccamNets are a new network architecture, we used OccamResNet-18 with each of the baseline methods instead of ResNet-18. These results are shown in Table 2, where we provide unbiased accuracy along with any improvement or impairment of performance when OccamResNet-18 is used instead of ResNet-18. All methods benefit from using the *OccamResNet architecture compared to ResNet-18, with gains of 10.6% - 28.2% for *BiasedMNIST, 0.9% - 7.8% for *COCO, and 1.0% - 14.2% for *BAR.

*OccamResNet 65.0 (+28.2) 43.4 (+7.8) 52.6 (+1.3)
*OccamResNet+*SD [52] 55.2 (+18.1) 39.4 (+4.0) 52.3 (+1.0)
*OccamResNet+*UpWt 65.7 (+28.0) 42.9 (+7.7) 52.2 (+1.1)
OccamResNet+*gDRO [56] 29.8 (+10.6) 40.7 (+5.4) 52.9 (+14.2)
*OccamResNet+*PGI [3] 69.6 (+21.0) 43.6 (+0.9) 55.9 (+2.3)
Table 2: Unbiased accuracies alongside  improvementimpairment when the comparison methods are run on *OccamResNet instead of *ResNet.

5.2 Analysis of the Proposed Inductive Biases

(a) *BiasedMNIST
(b) *COCO
(c) *BAR
Figure 4: Percentage of samples exited (Exit%) from each exit (barring ).

In this section, we analyze the impacts of each of the proposed modifications in *OccamNets and their success in achieving the desired behavior.

Analysis of Early Exits. *OccamResNet has four exits, the first exit is used for bias amplification and the rest are used to potentially exit early during inference. To analyze the usage of the earlier exits, we plot the percentage of samples that exited from each exit in Fig. 4. For *BiasedMNIST dataset, a large portion of the samples, i.e., 59.8% exit from the shallowest exit of and only 13.3% exit from the final exit . For *COCO and *BAR, 50.4% and 44.1% samples exit before , with 49.6% and 55.9% samples using the full depth respectively. These results show that the *OccamNets favor exiting early, but that they do use the full network depth if necessary.

Figure 5: Original image, and Grad-CAM visualizations for ERM and PGI on ResNet, and CAM visualizations on OccamResNet. The visualizations are for the ground truth.

CAM Visualizations. To compare the localization capabilities of OccamResNets to ResNets, we present CAM visualizations in Fig. 5. For ResNets that were run with ERM and PGI, we show Grad-CAM visualizations [61], whereas for OccamResNets, we directly visualize the CAM heatmaps obtained from the earliest exit used for each sample. As shown in the figure, OccamResNet generally prefers smaller regions that include the target object. On the other hand, comparison methods tend to focus on larger visual regions that include irrelevant object/background cues leading to lower accuracies.

Ablation Description
Using only one exit at the end 35.9 35.0
Weighing all the samples equally i.e., 66.3 40.8
Without the CAM suppression loss i.e., 48.2 37.1
Full *OccamResNet 65.0 43.4
Table 3: Ablation Studies on *OccamResNet

Ablations. To study the importance of the introduced inductive biases, we perform ablations on *BiasedMNIST and *COCO. First, to examine if the multi-exit setup is helpful, we train networks with single exit attached to the end of the network. This caused accuracy drops of 29.1% on *BiasedMNIST and 8.4% on *COCO, indicating that the multi-exit setup is critical. To examine if the weighted output prediction losses are helpful or harmful, we set all the loss weights () to 1. This resulted in an accuracy drop of 2.6% on *COCO. However, it improved accuracy on *BiasedMNIST by 1.3%. We hypothesize that since the earlier exits suffice for a large number of samples in *BiasedMNIST (as indicated in Fig. 3(a)), the later exits may not receive sufficient training signal with the weighted output prediction losses. Finally, we ran experiments without the CAM suppression loss by setting to 0. This caused large accuracy drops of 16.8% on *BiasedMNIST and 6.3% on *COCO. These experiments demonstrate that both inductive biases are vital for *OccamNets.

5.3 Robustness to Different Types of Shifts

A robust system must handle different types of bias shifts. To test this ability, we examine the robustness to each bias variable in *BiasedMNIST and we also compare the methods on the differently shifted test splits of *COCO.

In Table 4, we compare how *ResNet and *OccamResNet are affected by the different bias variables in *BiasedMNIST. For this, we present majority and minority group accuracies for each variable. Bias variables with large differences between the majority and the minority groups, i.e., large majority/minority group discrepancy (MMD) are the most challenging spurious factors. *OccamResNets, with and without PGI, improve on both majority and minority group accuracies across all the bias variables. OccamResNets are especially good at ignoring the distracting letters and their colors, obtaining MMD values between 0-1%. ResNets, on the other hand are susceptible to those spurious factors, obtaining MMD values between 7.9-20.7%. Among all the variables, digit scale and digit color are the most challenging ones and OccamResNets mitigate their exploitation to some extent.

maj/min maj/min maj/min maj/min maj/min maj/min
ResNet+ERM 36.8 87.2/31.3 78.5/32.1 76.1/32.4 41.9/36.3 46.7/35.7 45.7/35.9
ResNet+PGI 48.6 91.9/43.8 84.8/44.6 79.5/45.1 51.3/48.3 67.2/46.5 55.8/47.9
OccamResNet 65.0 94.6/61.7 96.3/61.5 81.6/63.1 66.8/64.8 64.7/65.1 64.7/65.1
OccamResNet+PGI 69.6 95.4/66.7 97.0/66.5 88.6/67.4 71.4/69.4 69.6/69.6 70.5/69.5
Table 4: Accuracies on majority (maj)/minority (min) groups for each bias variable in *BiasedMNIST (). We embolden the results with the lowest differences between the groups.
Seen, but
Results on Standard ResNet-18
ResNet+ERM 84.9 0.5 53.2 0.7 35.6 1.0 20.1 1.5
ResNet+PGI [3] 77.5 0.6 52.8 0.7 42.7 0.6 20.6 2.1
Results on *OccamResNet-18
*OccamResNet 84.0 1.0 55.8 1.2 43.4 1.0 22.3 2.8
*OccamResNet+*PGI [3] 82.8 0.6 55.3 1.3 43.6 0.6 21.6 1.6
Table 5: Accuracies on all three test splits of *COCO, alongside mean average precision for the anomaly detection task.

Next, we show accuracies for base method and PGI on both ResNet-18 and OccamResNet-18 on all of the test splits of *COCO in Table 5. The different test splits have different kinds of object-background combinations, and ideally the method should work well on all three test splits. *PGI run on ResNet-18 improves on the split with seen, but non-spurious backgrounds but incurs a large accuracy drop of 7.4% on the in-distribution test set, with biased backgrounds. On the other hand, OccamResNet-18 shows only 0.9% drop on the biased backgrounds, while showing 2.6% accuracy gains on the split with unseen backgrounds and 7.8% accuracy gains on the split with seen, but non-spurious backgrounds. It further obtains 2.2% gains on the average precision metric for the anomaly detection task. PGI run on OccamResNet exhibits a lower drop of 2.1% on the in-distribution split as compared to the PGI run on ResNet, while obtaining larger gains on rest of the splits. These results exhibit that *OccamNets obtain high in-distribution and shifted-distribution accuracies.

5.4 Evaluation on Other Architectures

To examine if the proposed inductive biases improve bias-resilience in other architectures too, we created OccamEfficientNet-B2 and OccamMobileNet-v3 by modifying EfficientNet-B2 [66] and MobileNet-v3 [30, 29]. *OccamNet variants outperform standard architectures on both *BiasedMNIST (OccamEfficientNet-B2: 59.2 vs. EfficientNet-B2: 34.4 and OccamMobileNet-v3: 49.9 vs. MobileNet-v3: 40.4) and *COCO (OccamEfficientNet-B2: 39.2 vs. EfficientNet-B2: 34.2 and OccamMobileNet-v3: 40.1 vs. MobileNet-v3: 34.9). The gains show that the proposed modifications improve robustness in other commonly used architectures too. We provide the detailed results in Sec. A.6.

5.5 Do OccamNets Work on Less Biased Datasets?

To examine if *OccamNets also work well on datasets with less bias, we train ResNet-18 and OccamResNet-18 on 100 classes of the ImageNet dataset 

[16]. OccamResNet-18 obtains competitive numbers compared to the standard ResNet-18 (OccamResNet-18: 92.1, vs. ResNet-18: 92.6, top-5 accuracies). However, as described in rest of this paper, OccamResNet-18 achieves this with improved resistance to bias (e.g., if the test distributions were to change in future). Additionally, it also reduces the computations with 47.6% of the samples exiting from and 13.3% of samples exiting from . As such, *OccamNets have the potential to be the de facto network choice for visual recognition tasks regardless of the degree of bias.

6 Discussion

Relation to Mixed Capacity Models. Recent studies have shown that sufficiently simple models, e.g., models with a limited number of parameters [13, 28] or models trained for a few epochs [43, 73] amplify biases. Specifically, [28] shows that model compression disproportionately hampers minority samples. Seemingly, this is an argument against smaller models (simpler hypotheses), i.e., against Occam’s principles. However, [28] does not study network depth, unlike our work. Our paper suggests that using only the necessary capacity for each example yields greater robustness.

Relation to Multi-Hypothesis Models. Some recent works generate multiple plausible hypotheses [40, 69] and use extra information at test time to choose the best hypothesis. The techniques include training a set of models with dissimilar input gradients [69] and training multiple prediction heads that disagree on a target distribution [40]. An interesting extension of *OccamNets could be making diverse predictions through the multiple exits and through CAMs that focus on different visual regions. This could help avoid discarding complex features in favor of simpler ones [62]. This may also help with under-specified tasks, where there are equally viable ways of making the predictions [40].

Other Architectures and Tasks. We tested Occam’s razor for image classification with CNNs, however, it would be interesting to test the general principles on other architectures. The ability to exit dynamically could be used with many non-vision classification tasks where feed-forward networks are used for classification tasks (e.g., graphs, text, audio graphs, etc.). There are promising signs, for instance, early exit mechanisms showed improved efficiency and robustness on natural language inference tasks [80] with BERT architectures [17]. While the spatial bias is more vision specific, it could be readily integrated into recent non-CNN approaches for image classification [18, 44, 70, 77].

7 Conclusion

In summary, the proposed *OccamNets have architectural inductive biases favoring simpler solutions. The experiments show improvements over state-of-the-art bias mitigation techniques. Furthermore, existing methods tend to do better with *OccamNets as compared to the standard architectures.


This work was supported in part by NSF awards #1909696 and #2047556. The views and conclusions contained herein are those of the authors and should not be interpreted as representing the official policies or endorsements of any sponsor.


  • [1] E. Adeli, Q. Zhao, A. Pfefferbaum, E. Sullivan, L. Fei-Fei, J. C. Niebles, and K. Pohl (2019) Bias-resilient neural network. ArXiv abs/1910.03676. Cited by: §1.
  • [2] A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi (2018) Don’t just assume; look and answer: overcoming priors for visual question answering. In

    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

    pp. 4971–4980. External Links: Document Cited by: §1.
  • [3] F. Ahmed, Y. Bengio, H. van Seijen, and A. Courville (2020) Systematic generalisation with group invariant predictions. In International Conference on Learning Representations, Cited by: §1, 5th item, §4.1, §4.2, Table 1, Table 2, Table 5, §A.10, Table A10, Table A12, Table A6, Table A7, Table A8.
  • [4] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018)

    Bottom-up and top-down attention for image captioning and visual question answering

    In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 6077–6086. External Links: Document Cited by: §2.
  • [5] M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz (2019) Invariant risk minimization. arXiv preprint arXiv:1907.02893. Cited by: §1, §2, §4.1, §4.2.
  • [6] T. Bolukbasi, K. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai (2016) Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4349–4357. Cited by: §1.
  • [7] R. Cadène, C. Dancette, H. Ben-younes, M. Cord, and D. Parikh (2019) RUBi: Reducing Unimodal Biases for Visual Question Answering. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 839–850. Cited by: §1, §2.
  • [8] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002) SMOTE: synthetic minority over-sampling technique.

    Journal of artificial intelligence research

    16, pp. 321–357.
    Cited by: §1, §2.
  • [9] L. Chen, X. Yan, J. Xiao, H. Zhang, S. Pu, and Y. Zhuang (2020) Counterfactual samples synthesizing for robust visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10800–10809. Cited by: §2.
  • [10] X. Chen, H. Dai, Y. Li, X. Gao, and L. Song (2020) Learning to stop while learning to predict. In

    International Conference on Machine Learning

    pp. 1520–1530. Cited by: §2.
  • [11] Y. J. Choe, J. Ham, and K. Park (2020) An empirical study of invariant risk minimization. ICML 2020 Workshop on Uncertainty and Robustness in Deep Learning. Cited by: §1.
  • [12] C. Clark, M. Yatskar, and L. Zettlemoyer (2019) Don’t take the easy way out: ensemble based methods for avoiding known dataset biases. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    Hong Kong, China, pp. 4069–4082. External Links: Document Cited by: §1, §2.
  • [13] C. Clark, M. Yatskar, and L. Zettlemoyer (2020) Learning to model and ignore dataset bias with mixed capacity ensembles. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 3031–3045. External Links: Document Cited by: §2, §6.
  • [14] E. Creager, J. Jacobsen, and R. Zemel (2021) Environment inference for invariant learning. In International Conference on Machine Learning, pp. 2189–2200. Cited by: §4.2.
  • [15] Y. Cui, M. Jia, T. Lin, Y. Song, and S. J. Belongie (2019) Class-balanced loss based on effective number of samples. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 9268–9277. External Links: Document Cited by: §1, §2.
  • [16] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §5.5.
  • [17] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §6.
  • [18] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §6.
  • [19] J. C. Duchi, T. Hashimoto, and H. Namkoong (2019) Distributionally robust losses against mixture covariate shifts. Under review. Cited by: 4th item.
  • [20] R. Duggal, S. Freitas, S. Dhamnani, D. Horng, J. Sun, et al. (2020) ELF: an early-exiting framework for long-tailed classification. arXiv preprint arXiv:2006.11979. Cited by: §2.
  • [21] G. Grand and Y. Belinkov (2019) Adversarial regularization for visual question answering: strengths, shortcomings, and side effects. In Proceedings of the Second Workshop on Shortcomings in Vision and Language, Minneapolis, Minnesota, pp. 1–13. External Links: Document Cited by: §1, §2.
  • [22] M. Guo, T. Xu, J. Liu, Z. Liu, P. Jiang, T. Mu, S. Zhang, R. R. Martin, M. Cheng, and S. Hu (2021) Attention mechanisms in computer vision: a survey. arXiv preprint arXiv:2111.07624. Cited by: §2.
  • [23] H. He, Y. Bai, E. A. Garcia, and S. Li (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), pp. 1322–1328. Cited by: §1.
  • [24] H. He and E. A. Garcia (2009) Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 21 (9), pp. 1263–1284. Cited by: §1, §2.
  • [25] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §2.
  • [26] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2, §3.1.
  • [27] C. Hettinger, T. Christensen, B. Ehlert, J. Humpherys, T. Jarvis, and S. Wade (2017) Forward thinking: building and training neural networks one layer at a time. arXiv preprint arXiv:1706.02480. Cited by: §2.
  • [28] S. Hooker, N. Moorosi, G. Clark, S. Bengio, and E. Denton (2020) Characterising bias in compressed models. arXiv preprint arXiv:2010.03058. Cited by: §6.
  • [29] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §2, §5.4.
  • [30] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. (2019) Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324. Cited by: §5.4.
  • [31] J. Hu, L. Shen, S. Albanie, G. Sun, and A. Vedaldi (2018) Gather-excite: exploiting feature context in convolutional neural networks. Advances in neural information processing systems 31. Cited by: §2.
  • [32] T. Hu, T. Chen, H. Wang, and Z. Wang (2020) Triple wins: boosting accuracy, robustness and efficiency together by enabling input-adaptive inference. arXiv preprint arXiv:2002.10025. Cited by: §2.
  • [33] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §2, §3.1.
  • [34] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. Advances in neural information processing systems 28. Cited by: §2.
  • [35] B. Kim, H. Kim, K. Kim, S. Kim, and J. Kim (2019) Learning not to learn: training deep neural networks with biased data. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 9012–9020. External Links: Document Cited by: §2, §4.1.
  • [36] E. Kim, J. Lee, and J. Choo (2021) Biaswap: removing dataset bias with bias-tailored swapping augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14992–15001. Cited by: §2.
  • [37] D. Krueger, E. Caballero, J. Jacobsen, A. Zhang, J. Binas, D. Zhang, R. Le Priol, and A. Courville (2021) Out-of-distribution generalization via risk extrapolation (rex). In International Conference on Machine Learning, pp. 5815–5826. Cited by: §2.
  • [38] C. Lee, P. W. Gallagher, and Z. Tu (2016) Generalizing pooling functions in convolutional neural networks: mixed, gated, and tree. In Artificial intelligence and statistics, pp. 464–472. Cited by: §2.
  • [39] C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu (2015) Deeply-supervised nets. In Artificial intelligence and statistics, pp. 562–570. Cited by: §2.
  • [40] Y. Lee, H. Yao, and C. Finn (2022) Diversify and disambiguate: learning from underspecified data. arXiv preprint arXiv:2202.03418. Cited by: §6.
  • [41] Y. Li and N. Vasconcelos (2019) REPAIR: removing representation bias by dataset resampling. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 9572–9581. External Links: Document Cited by: §2, §4.1.
  • [42] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.1.
  • [43] E. Z. Liu, B. Haghgoo, A. S. Chen, A. Raghunathan, P. W. Koh, S. Sagawa, P. Liang, and C. Finn (2021) Just train twice: improving group robustness without training group information. In International Conference on Machine Learning, pp. 6781–6792. Cited by: §4.2, §6.
  • [44] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022. Cited by: §6.
  • [45] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2.
  • [46] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan (2021) A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR) 54 (6), pp. 1–35. Cited by: §2.
  • [47] H. Mostafa, V. Ramesh, and G. Cauwenberghs (2018)

    Deep supervised learning using local errors

    Frontiers in neuroscience, pp. 608. Cited by: §2.
  • [48] J. Nam, H. Cha, S. Ahn, J. Lee, and J. Shin (2020) Learning from failure: training debiased classifier from biased classifier. In Advances in Neural Information Processing Systems, Cited by: §1, §2, §4.1.
  • [49] H. Namkoong and J. C. Duchi (2016) Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 2208–2216. Cited by: 4th item.
  • [50] A. Nøkland and L. H. Eidnes (2019) Training neural networks with local error signals. In International conference on machine learning, pp. 4839–4850. Cited by: §2.
  • [51] A. Nøkland (2016) Direct feedback alignment provides learning in deep neural networks. Advances in neural information processing systems 29. Cited by: §2.
  • [52] M. Pezeshki, S. Kaba, Y. Bengio, A. Courville, D. Precup, and G. Lajoie (2020) Gradient starvation: a learning proclivity in neural networks. arXiv preprint arXiv:2011.09468. Cited by: 2nd item, Table 1, Table 2, Table A10, Table A12, Table A6, Table A7, Table A8.
  • [53] H. Rahimian and S. Mehrotra (2019) Distributionally robust optimization: a review. arXiv preprint arXiv:1908.05659. Cited by: 4th item.
  • [54] S. Ramakrishnan, A. Agrawal, and S. Lee (2018) Overcoming language priors in visual question answering with adversarial regularization. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 1548–1558. Cited by: §1.
  • [55] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.
  • [56] S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang (2019) Distributionally robust neural networks for group shifts: on the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731. Cited by: §2, Table 1, Table 2, §A.9, Table A10, Table A12, Table A6, Table A7, Table A8.
  • [57] S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang (2019) Distributionally robust neural networks for group shifts: on the importance of regularization for worst-case generalization. CoRR abs/1911.08731. External Links: 1911.08731, Link Cited by: §1, 4th item.
  • [58] S. Sagawa, A. Raghunathan, P. W. Koh, and P. Liang (2020) An investigation of why overparameterization exacerbates spurious correlations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 8346–8356. Cited by: §1, §2.
  • [59] V. Sanh, T. Wolf, Y. Belinkov, and A. M. Rush (2020) Learning from others’ mistakes: avoiding dataset biases without modeling them. arXiv preprint arXiv:2012.01300. Cited by: §2.
  • [60] S. Scardapane, M. Scarpiniti, E. Baccarelli, and A. Uncini (2020) Why should we add early exits to neural networks?. Cognitive Computation 12 (5), pp. 954–966. Cited by: §2.
  • [61] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §5.2.
  • [62] H. Shah, K. Tamuly, A. Raghunathan, P. Jain, and P. Netrapalli (2020) The pitfalls of simplicity bias in neural networks. Advances in Neural Information Processing Systems 33. Cited by: §6.
  • [63] R. Shrestha, K. Kafle, and C. Kanan (2022) An investigation of critical issues in bias mitigation techniques. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1943–1954. Cited by: §1, §1, §2, §4.1.
  • [64] K. K. Singh, D. Mahajan, K. Grauman, Y. J. Lee, M. Feiszli, and D. Ghadiyaram (2020) Don’t judge an object by its context: learning to overcome contextual bias. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 11067–11075. External Links: Document Cited by: §1.
  • [65] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2.
  • [66] M. Tan and Q. Le (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In International conference on machine learning, pp. 6105–6114. Cited by: §5.4.
  • [67] S. Teerapittayanon, B. McDanel, and H. Kung (2016) Branchynet: fast inference via early exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2464–2469. Cited by: §2.
  • [68] D. Teney, E. Abbasnejad, and A. v. d. Hengel (2020) Unshuffling data for improved generalization. arXiv preprint arXiv:2002.11894. Cited by: §1.
  • [69] D. Teney, E. Abbasnejad, S. Lucey, and A. van den Hengel (2021) Evading the simplicity bias: training a diverse set of models discovers solutions with superior ood generalization. arXiv preprint arXiv:2105.05612. Cited by: §6.
  • [70] I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, et al. (2021) Mlp-mixer: an all-mlp architecture for vision. Advances in Neural Information Processing Systems 34. Cited by: §6.
  • [71] A. Torralba and A. A. Efros (2011) Unbiased look at dataset bias. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, pp. 1521–1528. External Links: Document Cited by: §2.
  • [72] P. A. Utama, N. S. Moosavi, and I. Gurevych (2020) Towards debiasing NLU models from unknown biases. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 7597–7610. External Links: Document, Link Cited by: §A.4.
  • [73] P. A. Utama, N. S. Moosavi, and I. Gurevych (2020) Towards debiasing NLU models from unknown biases. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 7597–7610. External Links: Document, Link Cited by: §2, §6.
  • [74] S. Venkataramani, A. Raghunathan, J. Liu, and M. Shoaib (2015) Scalable-effort classifiers for energy-efficient machine learning. In Proceedings of the 52nd Annual Design Automation Conference, pp. 1–6. Cited by: §2.
  • [75] M. Wołczyk, B. Wójcik, K. Bałazy, I. Podolak, J. Tabor, M. Śmieja, and T. Trzcinski (2021) Zero time waste: recycling predictions in early exit neural networks. Advances in Neural Information Processing Systems 34. Cited by: §2.
  • [76] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §2.
  • [77] W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, J. Feng, and S. Yan (2021) Metaformer is actually what you need for vision. arXiv preprint arXiv:2111.11418. Cited by: §6.
  • [78] B. H. Zhang, B. Lemoine, and M. Mitchell (2018) Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335–340. Cited by: §1.
  • [79] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2017)

    Places: a 10 million image database for scene recognition

    IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1452–1464. Cited by: §4.1.
  • [80] W. Zhou, C. Xu, T. Ge, J. McAuley, K. Xu, and F. Wei (2020) Bert loses patience: fast and robust inference with early exit. Advances in Neural Information Processing Systems 33, pp. 18330–18341. Cited by: §2, §6.

A Appendix

a.1 Detailed Results on Each Dataset

*BiasedMNIST. In Table A6, we present the unbiased test accuracies and majority/minority group accuracies for each bias variable. Methods run on Occam ResNet-18 lower the majority/minority discrepancy (MMD) compared to the methods run on ResNet-18 for all of the variables, indicating that OccamNets lower the tendencies to latch onto all of the spurious factors. OccamResNet-18 is especially robust to letter, letter color and texture color biases as shown by the low MMD values of compared to the larger MMD values obtained by ResNet-18.

maj/min maj/min maj/min maj/min maj/min maj/min
Results on ResNet-18
ResNet+ERM 36.8 87.2/31.3 78.5/32.1 76.1/32.4 41.9/36.3 46.7/35.7 45.7/35.9
ResNet+SD [52] 37.1 83.4/32.0 76.9/32.7 76.7/32.7 42.3/36.6 48.3/35.8 48.9/35.9
ResNet+UpWt 37.7 88.0/32.1 80.4/32.9 75.6/33.4 41.9/37.2 46.7/36.6 46.9/36.7
ResNet+gDRO [56] 19.2 55.0/15.2 50.2/15.7 63.4/14.2 24.8/18.6 26.7/18.3 29.5/18.1
ResNet+PGI [3] 48.6 91.9/43.8 84.8/44.6 79.5/45.1 51.3/48.3 67.2/46.5 55.8/47.9
Results on OccamResNet-18
OccamResNet 65.0 94.6/61.7 96.3/61.5 81.6/63.1 66.8/64.8 64.7/65.1 64.7/65.1
OccamResNet+SD [52] 55.2 92.3/51.1 92.9/50.9 78.9/52.5 57.4/54.9 55.9/55.1 55.3/55.2
OccamResNet+UpWt 65.7 95.1/62.5 96.3/62.3 82.4/63.9 68.3/65.5 65.3/65.8 65.3/65.8
OccamResNet+gDRO [56] 29.8 72.8/25.0 69.5/25.3 45.8/28.0 39.4/28.8 29.7/29.8 36.1/29.1
OccamResNet+PGI [3] 69.6 95.4/66.7 97.0/66.5 88.6/67.4 71.4/69.4 69.6/69.6 70.5/69.5
Table A6: Accuracies on majority (maj)/minority (min) groups for each bias variable in *BiasedMNIST ().

*COCO. In Table A7, we present the accuracies on each of the test splits of *COCO, alongside the average precision for the anomaly detection task. As discussed in 5.1, methods run on OccamResNet-18 show improvements over the methods run on ResNet-18 on the shifted test splits and the anomaly detection task. Furthermore, while PGI run on ResNet-18 shows a large drop of 7.4% on the in-distribution test split, methods (barring gDRO) run on OccamResNet-18 show smaller drops of , indicating robustness to distributions consisting of the same or different biases as compared to the train distribution.

Seen, but
Results on ResNet-18
ResNet+ERM 84.9 0.5 53.2 0.7 35.6 1.0 20.1 1.5
ResNet+SD [52] 85.3 0.3 52.8 0.9 35.4 0.5 19.9 1.4
ResNet+UpWt 84.9 0.6 52.3 0.7 35.2 0.4 20.4 1.9
ResNet+*gDRO [56] 81.6 0.7 49.3 1.3 35.3 0.1 19.6 1.7
ResNet+*PGI [3] 77.5 0.6 52.8 0.7 42.7 0.6 20.6 2.1
Results on OccamResNet-18
OccamResNet 84.0 1.0 55.8 1.2 43.4 1.0 22.3 2.8
OccamResNet+*SD [52] 84.8 0.4 55.3 0.5 39.4 0.6 20.3 1.0
OccamResNet+*UpWt 82.9 0.5 56.6 1.0 42.9 0.8 21.0 0.9
OccamResNet+*gDRO [56] 78.6 0.7 50.7 2.0 40.7 1.5 19.3 2.3
OccamResNet+*PGI [3] 82.8 0.6 55.3 1.3 43.6 0.6 21.6 1.6
Table A7: Accuracy on the three splits of *COCO, alongside average precision for the anomaly detection task.

*BAR. First of all, BAR consists of only 1941 samples, so we pre-trained ResNet-18 and OccamResNet-18 on 100 classes of ImageNet (obtaining 92.6% and 92.1% top-5 accuracies respectively) before training on BAR. Without the pre-trained weights, BAR obtains 15-20% lower test set accuracies for both ResNet and OccamResNet as compared to the results with pre-trained weights. Now, as shown in Table A8, methods run on OccamResNet show gains in terms of the overall test set accuracies over the methods run on ResNet. The per-class standard deviations are larger (1.8-16.2%) as compared to the standard deviations for the overall test set accuracies (0.7-2.4%). That is, across the five different experiments run with different random seeds, the same methods run on the same architectures end up favoring different classes. We hypothesize that despite starting off from the same initial conditions i.e., the same pre-trained parameters, the randomness in the mini-batches drive the models to favor certain classes over the others. Tuning the optimizer e.g., switching to SGD, lowering the learning rates or increasing the weight decay can potentially help mitigate the unstable behavior.

Results on ResNet-18
ResNet+ERM 51.3  69.5  29.2  39.9  55.5  75.6  31.8 
ResNet+*SD [52] 51.3  62.1  35.8  51.2  62.4  71.6  18.5 
ResNet+*UpWt 51.1  61.7  43.9  42.3  52.3  67.9  28.2 
ResNet+*gDRO [56] 38.7  49.5  40.3  44.0  39.9  41.7  13.5 
ResNet+*PGI [3] 53.6  61.2  38.4  42.9  73.3  68.9  23.5 
Results on OccamResNet-18
OccamResNet 52.6  59.3  42.3  44.6  60.5  74.1  22.1 
OccamResNet+*SD [52] 52.3  56.4  34.3  55.4  69.1  72.9  21.8 
OccamResNet+*UpWt 52.2  57.9  35.7  51.8  64.3  71.8  27.4 
OccamResNet+*gDRO [56] 52.9  51.2  42.8  52.3  63.5  74.2  25.3 
OccamResNet+*PGI [3] 55.9  64.2  52.3  51.4  64.4  70.9  18.6 
Table A8: Overall and per-class accuracies on *BAR

a.2 Early Exit Statistics

To examine the efficiency and robustness of each exit for all of the datasets, we present the exit %, accuracy on the exited samples and overall exit-wise accuracies on all the samples for OccamResNet-18 in Table A9. For *BiasedMNIST, the earliest exits and have high exit percentages of and respectively, alongside high accuracies on the exited samples: and respectively. These results show that OccamResNet has learned to identify and trigger earlier exits whenever appropriate. For *COCO, we observe large accuracies of 50.8% and 50.2% on the 13.9% and 49.6% samples exited from and respectively. The large percentage of samples exiting from shows that OccamResNet is capable of using the full network depth whenever needed. The accuracy on the samples that exited from is however low: 31.3%, even though the overall accuracy is 42.4%, indicating need for improvement in terms of training the earlier exit gates. We believe that tuning the training thresholds more comprehensively can potentially close this gap. Finally, for *BAR, more than half i.e., 55.9% of the samples exit from . The accuracies on the samples exited from and : 55.0% and 65.3% are higher than the overall accuracies computed on all the samples i.e., 47.4% and 52.3% respectively. This again shows the ability to exit early whenever appropriate and the ability to utilize the full network depth only for the remaining samples.

Exit% 0.0 59.8 26.9 13.3 0.0 36.5 13.9 49.6 0.0 23.7 20.3 55.9
Acc. (exited) N/A 68.1 64.8 52.1 N/A 31.3 50.8 50.2 N/A 55.0 65.3 46.6
Acc. (all) 12.7 65.1 65.5 65.5 10.0 42.4 43.4 41.4 26.5 47.4 52.3 52.5
Table A9: Percentage samples exited: (Exit %), accuracy (Acc.) on exited samples and accuracy on all the samples for each exit ().

a.3 Using comparable # of parameters

In the main paper, we compared OccamResNet-18 with 8M parameters (feature width = 48) and ResNet-18 with 12M parameters (feature width = 64). To examine if the lower number number of parameters is helping e.g., due to implicit regularization, we test an OccamResNet-18 with 12M parameters by setting the feature width to 58. As shown in Table A10, OccamResNet-18 with 12M parameters shows small improvements over OccamResNet-18 with 8M parameters in all the datasets. A more thorough analysis of model sizes and their impacts on accuracy is an interesting study and we leave this to future work.

Results on Standard *ResNet-18 (12M parameters, feature width=64)
ResNet+ERM 36.8 0.7 35.6 1.0 51.3 1.9
ResNet+*SD [52] 37.1 1.0 35.4 0.5 51.3 2.3
ResNet+*UpWt 37.7 1.6 35.2 0.4 51.1 1.9
ResNet+*gDRO [56] 19.2 0.9 35.3 38.7 2.2
ResNet+*PGI [3] 48.6 0.7 42.7 0.6 53.6 0.9
Results on *OccamResNet-18 (8M parameters, feature width=48)
*OccamResNet 65.0 43.4 1.0 52.6 1.9
Results on *OccamResNet-18-width-58 (12M parameters, feature width=58)
*OccamResNet 65.9 43.8 1.1 53.5 2.2
Table A10: We train on OccamResNet-18-width-58 with 12M parameters (feature width set to 58) to make the number of parameters comparable to ResNet-18 (12M parameters, feature width=48).

a.4 Sample Complexity

It is desirable to have models that generalize despite being trained with a limited number of samples i.e., with reduced sample complexity.

Figure A6: Unbiased accuracies obtained when trained with the indicated percent of training data.

This is especially true for biased datasets, where reducing the train set size can amplify biases [72]. To study the ability to generalize when only a subset of the training data is available, we train ResNet and OccamResNet on of *BiasedMNIST’s train set. As shown in Fig. A6, OccamResNet (without PGI) trained on only 25% of the data outperforms ResNet+ERM and ResNet+PGI trained on 100% of the data showing increased sample complexity. When trained on only 10% of the training set, OccamResNet+PGI outperforms rest of the methods by large margins of showing that OccamResNet with group labels show the greatest efficacy in the low-shot data regime. When only 1% of the training data is available, all the methods obtain chance-level accuracies (i.e., near 10%) indicating lack of enough sufficient training samples for classification. For the rest, methods run on OccamResNet-18 outperform the methods run on ResNet-18, showing improved sample complexity.

a.5 Robustness to Varying Levels of Bias in *BiasedMNIST

Figure A7: Unbiased accuracies at varying bias levels () in *BiasedMNIST.

To gauge the robustness of models, it is important to examine their behaviors across varying levels of biases. For this, we present the unbiased accuracies obtained by training separate models on training sets with . As shown in fig. A7, all of the methods obtain similar accuracies at , where bias is not severe. OccamResNet+PGI outperform rest of the methods at . The gap between OccamResNet+PGI and other methods are especially drastic for , indicating that when OccamResNet is trained to have similar prediction distributions across groups, it is capable of tackling highly biased training distributions too.

a.6 Evaluation on Other Architectures

Apart from ResNet, we also tested the proposed inductive biases on EfficientNet and MobileNet. The results are presented in Table A11. For both *BiasedMNIST and *COCO, Occam variants outperform the standard architectures, showing the efficacy of the proposed modifications.

Number of
ResNet-18 12M 36.8 35.6
OccamResNet-18 12M 65.9 43.8
EfficientNet-B2 9M 34.4 34.2
OccamEfficientNet-B2 9M 59.2 39.2
MobileNet-v3 5.5M 40.4 34.9
OccamMobileNet-v3 5.5M 49.9 40.1
Table A11: Unbiased test set accuracies comparing the standard and Occam variants of ResNet-18, EfficientNet-B2 and MobileNetv3 architectures, run without additional debiasing procedures.

a.7 OccamNet Implementation Details

In OccamNet, each exit module takes in feature maps produced by the corresponding block of the backbone network. consists of two convolutional layers for the initial pre-processing of the feature maps. consists of convolutional layers with the number of channels set to: , where is the number of channels in the feature maps produced by , and is set to 32 for OccamResNet and OccamMobileNet and 16 for OccamEfficientNet. Feature maps from are fed into the CAM predictor and the exit gate . is a convolutional layer with the number of output channels set to the number of classes, . consists of a 16-dimensional hidden ReLU layer followed by a sigmoid layer that predicts the exit probability.

Exit Details.

For convenience, we specify the exit locations with reference to PyTorch 1.7.1 implementations of the architectures. For ResNet, the residual layers that yield the same number of output channels are grouped together and we refer to each of those groups as a ‘block’. ResNet-18 consists of 4 blocks and we attach an exit to each of the blocks. For OccamResNet-18, with feature width of 58, the exit-wise input dimensions are:

, , and . Similarly, EfficientNet-B2 consists of 9 blocks and we attach the exits to the blocks. We decrease the width multiplier of 1.1 in the standard architecture to 0.88 in OccamEfficientNet-B2 to create a model with comparable number of parameters of 9M for both. The input dimensions of the corresponding exits are: , , and . Finally, MobileNetv3-large consists of 17 blocks, and the exits are attached to the and the blocks. We decrease the width multiplier from a value of in MobileNet-v3-large to 0.95 in OccamMobileNet-v3-large, so that both models have 5.5M parameters. The input dimensions of the corresponding exits are: .

Modifications for *COCO. For *COCO, the images are small (), so for ResNet-18 and OccamResNet-18, we replace the first convolutional layer (kernel size=

, padding=

, stride=

), with a smaller layer (kernel size=, padding= and stride=

) and also remove the initial max pooling layer. For the standard and Occam variants of EfficientNet-B2 and MobileNet-v3, we scale up the image size to

, which improved the accuracy.

a.8 Hyperparameters and Other Settings

In Table A12, we present the details about optimizers, training epochs and other hyperparameters for each method on each dataset. The hyperparameter search grids for OccamResNet-18 and all of the comparison methods are shown below. For each dataset, we tune ResNet-18 and OccamResNet-18 separately.

Spectral Decoupling (SD). The output decay term is used to penalize the model predictions by using a regularizer: . We search for the output decay term .

Group Upweighting (UpWt). Group adjustment hyperparameter, i.e., the exponentiation factor is used to balance the group-wise contributions , where is the number of samples in group . We search for .

Group DRO (gDRO). Again, we search for the group adjustment hyperparameter, i.e., . Group weight step size, which is used to control the group-wise loss weights is selected by searching from these values: .

Predictive Group Invariance (PGI). The search range for the invariance penalty loss, i.e., the KLD loss between different groups from the same class is: {1, 10, 50, 100, 500, 1000}.

OccamNets. We fix the CAM suppression loss weight to 0.1 and bias-amplification factor to 3. We use arithmetic progression for the mean-per-class accuracy thresholds , with the difference , set to 0.1, i.e., the threshold is increased by every subsequent exit. We search for the initial training threshold .

Common to all the methods Optimizer Adam SGD Adam
Learning Rate (LR) 0.1 5e-4
LR Decay Milestones [50,70] [100,120,140] -
LR Decay Gamma 0.1 0.1 -
Weight Decay
Momentum - 0.9 -
Batch Size 128 64 128
Epochs 90 150 150
Spectral Decoupling (SD)
on ResNet-18 [52]
Output Decay () = 0.1
Spectral Decoupling (SD)
on OccamResNet-18 [52]
Output Decay () =
Factor ()
2 1 1
Group DRO (gDRO)
on ResNet-18 [56]
Step size
Factor ()
Group DRO (gDRO)
on OccamResNet-18 [56]
Step size
Factor ()
Predictive Group
Invariance (PGI)
on ResNet-18 [3]
Invariance Loss
100 50 10
Predictive Group
Invariance (PGI)
on OccamResNet-18 [3]
Invariance Loss
50 1 50
for ()
0.5 0.5 0.1
CAM Suppression
Loss Weight ()
0.1 0.1 0.1
Table A12: Hyperparameters and other settings used for each method on all of the datasets.

a.9 Issues Training with GroupDRO (gDRO)

We find that gDRO on *BiasedMNIST and ResNet+gDRO on *BAR obtain accuracies lower than ResNet+ERM. To alleviate this issue, we tried to tune the hyperparameters by lowering the learning rates to and increasing the weight decays to as suggested in [56], yet *gDRO obtained low accuracies. We believe the challenge stems from the large number of dataset groups in *BiasedMNIST and the small training set size of *BAR. While optimizing *gDRO on such conditions still remains a challenge, gDRO run on OccamResNet showed accuracy gains of 10.6% on *BiasedMNIST and 14.2% on BAR over gDRO run on ResNet, indicating that OccamNets also offer better training process.

a.10 Augmentations

*BiasedMNIST. We do not perform any augmentation.

*COCO. Following [3], we apply random cropping by padding the original images by 8 pixels on all the sides (reflection padding) and taking random crops. We also apply random horizontal flips.

*BAR. We apply random resized crops using a scale range of 0.7 to 1.0 and selecting aspect ratios between 1.0 to . We also apply random horizontal flips.