Concept Bottleneck Models

by   Pang Wei Koh, et al.

We seek to learn models that we can interact with using high-level concepts: if the model did not think there was a bone spur in the x-ray, would it still predict severe arthritis? State-of-the-art models today do not typically support the manipulation of concepts like "the existence of bone spurs", as they are trained end-to-end to go directly from raw input (e.g., pixels) to output (e.g., arthritis severity). We revisit the classic idea of first predicting concepts that are provided at training time, and then using these concepts to predict the label. By construction, we can intervene on these concept bottleneck models by editing their predicted concept values and propagating these changes to the final prediction. On x-ray grading and bird identification, concept bottleneck models achieve competitive accuracy with standard end-to-end models, while enabling interpretation in terms of high-level clinical concepts ("bone spurs") or bird attributes ("wing color"). These models also allow for richer human-model interaction: accuracy improves significantly if we can correct model mistakes on concepts at test time.



There are no comments yet.


page 1

page 7

page 8


Do Concept Bottleneck Models Learn as Intended?

Concept bottleneck models map from raw inputs to concepts, and then from...

Interpretability Beyond Classification Output: Semantic Bottleneck Networks

Today's deep learning systems deliver high performance based on end-to-e...

EDUCE: Explaining model Decisions through Unsupervised Concepts Extraction

With the advent of deep neural networks, some research focuses towards u...

Learning Perceptual Concepts by Bootstrapping from Human Queries

Robots need to be able to learn concepts from their users in order to ad...

Algorithmic Concept-based Explainable Reasoning

Recent research on graph neural network (GNN) models successfully applie...

ConceptVision: A Flexible Scene Classification Framework

We introduce ConceptVision, a method that aims for high accuracy in cate...

Predicate Invention by Learning From Failures

Discovering novel high-level concepts is one of the most important steps...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Suppose that a radiologist is collaborating with a machine learning model to grade the severity of knee osteoarthritis. She might ask why the model made its prediction—did it deem the space between the knee joints too narrow? Or she might seek to intervene on the model—if she told it that the x-ray showed a bone spur, would its prediction change?

State-of-the-art models today do not typically support such queries: they are end-to-end models that go directly from raw input (e.g., pixels) to target (e.g., arthritis severity), and we cannot easily interact with them using the same high-level concepts that practitioners reason with, like “joint space narrowing” or “bone spurs”.

We approach this problem by revisiting the simple, classic idea of first predicting an intermediate set of human-specified concepts like “joint space narrowing” and “bone spurs”, then using to predict the target . In this paper, we refer to such models as concept bottleneck models. These models are trained on data points , where the input is annotated with both concepts and target . At test time, they take in an input , predict concepts , and then use those concepts to predict the target (Figure 1).

Figure 1: We study concept bottleneck models that first predict an intermediate set of human-specified concepts , then use to predict the final output . We illustrate the two applications we consider: knee x-ray grading and bird identification.

Earlier versions of concept bottleneck models were overtaken in predictive accuracy by end-to-end neural networks (e.g.,


for face recognition and


for animal identification), leading to a perceived tradeoff between accuracy and interpretability in terms of concepts. In this paper, we propose a straightforward method for turning any end-to-end neural network into a concept bottleneck model, given concept annotations at training time: we simply resize one of the layers to match the number of concepts provided, and add an intermediate loss that encourages the neurons in that layer to align component-wise to the provided concepts. We show that concept bottleneck models trained in this manner can achieve task accuracies competitive with or even higher than standard models. We emphasize that concept annotations are not needed at test time; the model predicts the concepts, then uses the predicted concepts to make a final prediction.

Importantly—and unlike standard end-to-end models—these bottleneck models support interventions on concepts: we can edit the concept predictions and propagate those changes to the target prediction . Interventions enable richer human-model interaction: e.g., if the radiologist realizes that what the model thinks is a bone spur is actually an artifact, she can update the model’s prediction by directly changing the corresponding value of . When we simulate this injection of human knowledge by partially correcting concept mistakes that the model makes at test time, we find that accuracy improves substantially beyond that of a standard model.

Interventions also make concept bottleneck models interpretable in terms of high-level concepts: by manipulating concepts and observing the model’s response, we can obtain counterfactual explanations like “if the model did not think the joint space was too narrow for this patient, then it would not have predicted severe arthritis”. In contrast, prior work on explaining end-to-end models in terms of high-level concepts has been restricted to post-hoc interpretation of already-trained end-to-end models: for example, predicting concepts from hidden layers (kim2018interpretability) or measuring the correlation of individual neurons with concepts (bau2017network).

The validity of interventions on a model depends on the alignment between its predicted concepts and the true concepts

. We can estimate this alignment by measuring the model’s concept accuracy on a held-out validation set


With the usual caveats of measuring accuracy: in practice, the validation set might be skewed such that models that learn spurious correlations can still achieve high concept accuracy.

A model with perfect concept accuracy across all possible inputs makes predictions that align with the true concepts . Conversely, if a model has low concept accuracy, then the model’s predictions need not match with the true concepts, and we would not expect interventions to lead to meaningful results.

Contributions. We systematically study variants of concept bottleneck models and contrast them with standard end-to-end models in different settings, with a focus on the previously-unexplored ability of concept bottleneck models to support concept interventions. Our goal is to characterize concept bottleneck models more fully: Is there a tradeoff between task accuracy and concept interpretability? Do interventions at test time help model accuracy, and is concept accuracy a good indicator of the ability to effectively intervene? Do different ways of training bottleneck models lead to significantly different outcomes in intervention?

We evaluate concept bottleneck models on the two applications in Figure 1: the osteoarthritis grading task (nevitt2006osteoarthritis) and a fine-grained bird species identification task (wah2011cub). On these tasks, we show that bottleneck models are competitive with standard end-to-end models while also attaining high concept accuracies. In contrast, the concepts cannot be predicted with high accuracy from linear combinations of neurons in a standard black-box model, making it difficult to do post-hoc interpretation in terms of concepts like in kim2018interpretability. We demonstrate that we can substantially improve model accuracy by intervening on these bottleneck models at test time to correct model mistakes on concepts. Finally, we show that bottleneck models guided to learn the right concepts can also be more robust to covariate shifts.

2 Related work

Concept bottleneck models. Models that bottleneck on human-specified concepts—where the model first predicts the concepts, then uses only those predicted concepts to make a final prediction—have been previously used for specific applications (kumar2009attribute; lampert2009learning). Early versions did not use end-to-end neural networks, which soon overtook them in predictive accuracy. Consequently, bottleneck models have historically been more popular for few-shot learning settings, where shared concepts might allow generalization to unseen contexts, rather than the standard supervised setting we consider here.

More recently, deep neural networks with concept bottlenecks have re-emerged as targeted tools for solving particular tasks, e.g., de2018clinically for retinal disease diagnosis, yi2018neural for visual question-answering, and bucher2018semantic

for content-based image retrieval.

losch2019interpretability and chen2020concept also explore learning concept-based models via auxiliary datasets.

Concept bottlenecks differ from traditional feature engineering: we learn mappings from raw input to high-level concepts, whereas feature engineering constructs low-level features that can be computed by handwritten functions.

Concepts as auxiliary losses or features. Non-bottleneck models that use human-specified concepts commonly use them in auxiliary objectives in a multi-task setup, or as auxiliary features; examples include using object parts (huang2016part; zhou2018interpretable), parse trees (zelenko2003kernel; bunescu2005shortest), or natural language explanations (murty2020expbert). However, these models do not support intervention on concepts. For instance, consider a multi-task model , with the concepts used in an auxiliary loss; simply intervening on at test time will not affect the model’s prediction of . Interventions do affect models that use as auxiliary features by first predicting and then predicting (e.g., sutton2005joint), but we cannot intervene in isolation on a single concept because of the side channel from .

Causal models. There has been extensive work on learning models that capture causal relationships in the world (pearl2000causality). While concept bottleneck models can represent causal relationships between if the set of concepts is chosen appropriately, they are flexible and do not require to cause . This is an advantage in settings where it is difficult or impossible to specify the causal graph. For example, imagine that arthritis grade () is highly correlated with swelling (). In this case, does not cause (hypothetically, if one could directly induce swelling in the patient, it would not affect whether they had osteoarthritis). However, concept bottleneck models can still exploit the fact that is highly predictive for , and intervening on the model by replacing the predicted concept value with the true value could still improve accuracy.

A central claim of this paper is that we can intervene on concept bottleneck models. Intervention is a standard notion in causal inference, and we emphasize that we intervene on the value of a predicted concept within the model, not on that concept in reality. In other words, we are interested in how changing the model’s predicted concept values would affect its final prediction , and not whether intervening on the true concept value in reality (e.g., by inducing swelling) would affect the true label . See, e.g., goyal2019explaining; o2020generative for discussions on causality in the context of interpreting models with concepts.

Post-hoc concept analysis. Many methods have been developed to interpret models post-hoc, including recent work on using human-specified concepts to generate explanations (bau2017network; kim2018interpretability; zhou2018interpretable; ghorbani2019towards). These techniques rely on models automatically learning those concepts despite not having explicit knowledge of them, and can be particularly useful when paired with models that attempt to learn more interpretable representations (bengio2013representation; chen2016infogan; higgins2017beta; melis2018towards). However, post-hoc methods can fail when the models do not learn these concepts, and also do not admit straightforward interventions on concepts. In this work, we instead directly guide models to learn these concepts at training time.

3 Setup

Consider predicting a target from input ; for simplicity, we present regression first and discuss classification later. We observe training points , where

is a vector of

concepts. We consider bottleneck models of the form , where maps an input into the concept space (“bone spurs”, etc.), and maps concepts into a final prediction (“arthritis severity”). We call these concept bottleneck models because their prediction relies on the input entirely through the bottleneck , which we train to align component-wise to the concepts . We define task accuracy as how accurately predicts , and concept accuracy as how accurately predicts (averaged over each concept). We will refer to as predicting , and to as predicting .

In our work, we systematically study different ways of learning concept bottleneck models. Let

be a loss function that measures the discrepancy between the predicted and true

-th concept, and let measure the discrepancy between predicted and true targets. We consider the following ways to learn a model :

  1. The independent bottleneck learns and independently: , and . While is trained using the true , at test time it still takes as input.

  2. The sequential bottleneck first learns in the same way as above. It then uses the concept predictions to learn .

  3. The joint bottleneck minimizes the weighted sum for some .

  4. The standard model ignores concepts and directly minimizes .

The hyperparameter

in the joint bottleneck controls the tradeoff between concept vs. task loss. The standard model is equivalent to taking , while the sequential bottleneck can be viewed as taking .

We propose a simple scheme to turn an end-to-end neural network into a concept bottleneck model: simply resize one of its layers to have neurons to match the number of concepts , then choose one of the training schemes above.

Classification. In classification, and

compute real-valued scores (e.g., concept logits

) that we then turn into a probabilistic prediction (e.g.,

for logistic regression). This does not change the independent bottleneck, since

(the model) is directly trained on the binary-valued . For the sequential and joint bottlenecks, we connect to the logits , i.e., we compute and .

Model RMSE (OAI) Error (CUB)
Independent 0.435 0.024 0.2400.012
Sequential 0.418 0.004 0.2430.006
Joint 0.418 0.004 0.1990.006
Standard 0.441 0.006 0.1750.008
  no bottleneck 0.443 0.008 0.1730.004
Multitask 0.425 0.010 0.1620.002
Table 1: Task errors with 2SD over random seeds. Overall, concept bottleneck models are competitive with standard models.

4 Benchmarking bottleneck model accuracy

We start by showing that concept bottleneck models achieve both competitive task accuracy and high concept accuracy. While this is necessary for bottleneck models to be viable in practice, their strength is that we can interpret and intervene on them; we explore those aspects in Sections 5 and 6.

4.1 Applications

We consider an x-ray grading and a bird identification task. Their corresponding datasets are annotated with high-level concepts that practitioners (radiologists/birders) use to reason about their decisions. (Dataset details in Appendix A.)

X-ray grading (OAI). We use knee x-rays from the Osteoarthritis Initiative (OAI) (nevitt2006osteoarthritis), which compiles radiological and clinical data on patients at risk of knee osteoarthritis (Figure 1-Top;

data points). Given an x-ray, the task is to predict the Kellgren-Lawrence grade (KLG), a 4-level ordinal variable assessed by radiologists that measures the severity of osteoarthritis, with higher scores denoting more severe disease.

222Due to technicalities in the data collection protocol, we use a modified version of KLG where the first two grades are combined. As concepts, we use ordinal variables describing joint space narrowing, bone spurs, calcification, etc.; these clinical concepts are also assessed by radiologists and used directly in the assessment of KLG (kellgren1957radiological).

Bird identification (CUB). We use the Caltech-UCSD Birds-200-2011 (CUB) dataset (wah2011cub), which comprises bird photographs (Figure 1

-Bot). The task is to classify the correct bird species out of 200 possible options. As concepts, we use

binary bird attributes representing wing color, beak shape, etc. Because the provided concepts are noisy (see Appendix A), we denoise them by majority voting, e.g., if more than of crows have black wings in the data, then we set all crows to have black wings. In other words, we use class-level concepts and assume that all birds of the same species in the training data share the same concept annotations. In contrast, the OAI dataset uses instance-level concepts: examples with the same can have different concept annotations .

Models. For each task, we construct concept bottleneck models by adopting model architectures and hyperparameters from previous high-performing approaches; see Appendix B for experimental details. For the joint bottleneck model, we search over the task-concept tradeoff hyperparameter and report results for the model that has the highest task accuracy while maintaining high concept accuracy on the validation set ( for OAI and for CUB). We model x-ray grading as a regression problem (minimizing mean squared error) on both the KLG target and concepts , following pierson2019using; we fine-tune a pretrained ResNet-18 model to predict (he2016resnet), and use a small 3-layer MLP for . We model bird identification as multi-class classification for the species and binary classification for the concepts . Following cui2018large, we fine-tune an Inception-v3 network (szegedy2016rethinking) to predict , and use a single linear layer (logistic regression) to predict .

4.2 Task and concept accuracies

Figure 2: Left: The shaded regions show the optimal frontier between task vs. concept error. We find little trade-off; models can do well on both task and concept prediction. For standard models, we plot the concept error of the mean predictor (OAI) or random predictor (CUB). Mid: Histograms of how accurate individual concepts are, averaged over multiple random seeds. In our tasks, each individual concept can be accurately predicted by bottleneck models. Right: Data efficiency curves. Especially on OAI, bottleneck models can achieve the same task accuracy as standard models with many fewer training points.
Model RMSE (OAI) Error (CUB)
Independent 0.5290.004 0.0340.002
Sequential 0.5270.004 0.0340.002
Joint 0.5430.014 0.0310.000
Standard [probe] 0.6800.038 0.0930.003
SENN [probe] 0.6760.026 -
Table 2: Average concept errors. Bottleneck models have lower error than linear probes on standard and SENN models.

Table 1 shows that concept bottleneck models achieve competitive task accuracy with standard black-box models on both tasks, despite the bottleneck constraint (all numbers reported are on a held-out test set). On OAI, joint and sequential bottlenecks are actually slightly better in root mean square error (RMSE) than the standard model,333To contextualize RMSE, our modified KLG ranges from 0-3, and average Pearson correlations between each predicted and true concept are for all bottleneck models. and on CUB, sequential and independent bottlenecks are slightly worse in 0-1 error; all other models perform similarly. From Table 1, joint bottlenecks can do slightly better than sequential bottlenecks, which in turn can do better than independent bottlenecks, though this difference is not consistent. Compared to independent bottlenecks, sequential bottlenecks allow the part of the model to adapt to how well it can predict ; and joint bottlenecks further allow the model’s version of the concepts to be refined to improve predictive performance.

At the same time, the bottleneck models are able to accurately predict each concept well (Figure 2), and they achieve low average error across all concepts (Table 2). As discussed in Section 1, low concept error suggests that the model’s concepts are aligned with the true concepts, which in turn suggests that we might intervene effectively on them; we will explore this in Section 6.

Overall, we do not observe a tradeoff between high task accuracy and high concept accuracy: pulling the bottleneck layer towards the concepts does not substantially affect the model’s ability to predict in our tasks, even when the bottleneck is trained jointly. We illustrate this in Figure 2-Left, which plots the task vs. concept errors of each model.

Additional baselines. We ran two further baselines to determine if the bottleneck architecture impacted model performance. First, standard models in the literature do not use concept bottlenecks, so we trained a variant of the standard model without the bottleneck layer (directly using a ResNet-18 or Inception-v3 model to predict ); this performed similarly to the standard bottleneck model (“Standard, no bottleneck” in Table 1). Second, we tested a typical multi-task setup using an auxiliary loss to encourage the activations of the last layer to be predictive of the concepts , hyperparameter searching across different weightings of this auxiliary loss. These models also performed comparably (“Multitask” in Table 1), but since they do not support concept interventions, we focus on comparing standard vs. concept bottleneck models in the rest of the paper.

Data efficiency. Another way to benchmark different models is by measuring data efficiency, i.e., how many training points they need for a desired level of accuracy. To study this, we subsampled the training and validation data and retrained each model (details in Appendix B.4). Concept bottleneck models are particularly effective on OAI: the sequential bottleneck model with of the full dataset performs similarly to the standard model. On CUB, the joint bottleneck and standard models are more accurate throughout, with the joint model slightly more accurate in lower data regimes (Figure 2-Right).

A drawback of concept bottleneck models is that they require annotated concepts at training time. However, if the set of concepts are good enough, then fewer training examples might be required to achieve a desired accuracy level (as in OAI). This allows model developers to trade off the cost of acquiring more detailed annotations against the cost of acquiring new training examples, which can be helpful when new training examples are expensive to acquire, e.g., in medical settings where adding training points might entail invasive/expensive procedures on patients, but the incremental cost in asking a doctor to add annotations to data points that they already need to look at might be low.

5 Benchmarking post-hoc concept analysis

Concept bottleneck models are trained to have a bottleneck layer that aligns component-wise with the human-specified concepts . For any test input , we can read out predicted concepts directly from the bottleneck layer, as well as intervene on concepts by manipulating the predicted concepts and inspecting how the final prediction changes. This enables explanations like “if the model did not think the joint space was too narrow for this patient, then it would not have predicted severe arthritis”. An alternative approach to interpreting models in terms of concepts is post-hoc analysis: take an existing model trained to directly predict without any concepts, and use a probe to recover the known concepts from the model’s activations. For example, bau2017network measure the correlation of individual neurons with concepts, while kim2018interpretability use a linear probe to predict concepts with linear combinations of neurons.

Post-hoc analysis does not enable interventions on concepts: even if we find a linear combination of neurons that predicts a concept well, it is unclear how to modify the model’s activations to change what it thinks of that concept alone. Without this ability to intervene, interpretations in terms of concepts is suggestive but fraught: even if we can say that “the model thinks the joint space is narrow”, it is hard to test if that actually affects its final prediction. This is an important limitation of post-hoc interpretation. Nonetheless, setting this point aside for a moment, post-hoc interpretations require high concept accuracy. We therefore evaluate how accurately probes can predict concepts post-hoc.

Following kim2018interpretability, we trained a linear probe to predict each concept from the layers of the standard model (see Appendix B). We found that these linear probes have lower concept accuracy compared to simply reading concepts out from a bottleneck model (Table 2). On OAI, the best-performing linear probe achieved an average concept RMSE of , vs. in the bottleneck models; average Pearson correlation dropped to from . On CUB, the linear probe achieved an average concept error of instead of ; average F1 score dropped to from .

We also tested if we could predict concepts post-hoc from models designed to learn an interpretable mapping from . Specifically, we evaluated self-explaining neural networks (SENN) (melis2018towards). As with standard models, SENN does not use any pre-specified concepts; it learns an input representation encouraged to be interpretable through diversity and smoothness constraints. However, linear probes on SENN also had lower concept accuracy on OAI ( concept RMSE; see Appendix B).444We were unable to run SENN on CUB because the default implementation was too memory-intensive; CUB has many more classes/concepts than the tasks SENN was originally used for.

The comparative difficulty in predicting concepts post-hoc suggests that if we have prior knowledge of what concepts practitioners would use, then it helps to directly train models with these concepts instead of hoping to recover them from a model trained without knowledge of these concepts. See chen2020concept for a related discussion.

6 Test-time intervention

The ability to intervene on concept bottleneck models enables human users to have richer interactions with them. For example, if a radiologist disagrees with a model’s prediction, she would not only be able to inspect the predicted concepts, but also simulate how the model would respond to changes in those predicted concepts. This kind of test-time intervention can be particularly useful in high-stakes settings like medicine, or in other settings where it is easier for users to identify the concepts (e.g., wing color) than the target (exact species of bird).

We envision that in practice, domain experts interacting with the model could intervene to “fix” potentially incorrect concepts. To study this setting, we use an oracle that can query the true value of any concept for a test input. Figure 3 shows several examples of interventions that lead to the model making a correct prediction.

Figure 3: Successful examples of test-time intervention, where intervening on a single concept corrects the model prediction. Here, we show examples from independent bottleneck models. Right: For CUB, we intervene on concept groups instead of individual binary concepts. The sample birds on the right illustrate how the intervened concept distinguishes between the original and new predictions.
Figure 4: Test-time intervention results. Left: Intervention substantially improves task accuracy, except for the control model, which is a joint model that heavily prioritizes label accuracy over concept accuracy. Mid: Replacing with a linear model degrades effectiveness. Right: Intervention improves task accuracy except for the joint model. Connecting

to probabilities rescues intervention but degrades normal accuracy.

6.1 Intervening on OAI

On OAI, we intervene on a concept by simply replacing the model’s corresponding predicted concept with its true value (Figure 3-Left). To simplify testing multiple interventions, we use an input-independent ordering over concepts computed from the held-out validation set (i.e., we always intervene on some concept first, followed by , etc.; see Appendix B).

Test-time intervention on OAI significantly improved task accuracy: e.g., querying for just 2 concepts reduces task RMSE from to (Figure 4-Left). Neural networks similar to ours have been previously noted to be comparable with individual radiologist performance on grading KLG (compared to the consensus grade, which we use as ground truth; see tiulpin2018automatic; pierson2019using). As the concept values used for intervention mostly come from a single radiologist instead of a consensus reading (see Appendix A), these results hint that a single radiologist collaborating with bottleneck models might be able to outperform either the radiologist or model alone, though more careful human studies would be needed to evaluate that.

The independent bottleneck achieved better test error when all concepts are replaced than the sequential or joint bottlenecks (Figure 4-Left). This is expected; when all concepts are replaced, the part of the model is irrelevant, and all that matters is the part. Recall that in the independent bottleneck, is trained using the true , which is what we replace the predicted concepts with. In contrast, in the sequential and joint models, is trained using the predicted , which in general will have a different distribution from the true . This example illustrates a trade-off between intervenability and task accuracy: the independent bottleneck performs worse without interventions (Table 1), but better with interventions.

To better understand what influences intervention effectiveness, we ran two ablations. First, we found that intervention can fail in joint models when is too small (recall that the smaller is, the more we prioritize fitting over in training). Specifically, the joint model with learned a concept representation that was not as well-aligned with the true concepts, and replacing with the true at test time slightly increased test error (“control” model in Figure 4-Left). Second, we changed the model from the 3-layer MLP used throughout the paper to a single linear layer. Test-time intervention was less effective here compared to the non-linear counterparts (Figure 4-Mid), even though task and concept accuracies were similar before intervention (concept RMSEs of the sequential and independent models are not even affected by the change in ).

Altogether, these results suggest that task and concept accuracies alone are insufficient for determining how effective test-time intervention will be on a model. Different inductive biases in different models control how effectively they can handle distribution shifts from (pre-intervention) to (post-intervention). Even without this distribution shift, as in the case of the linear vs. non-linear independent bottlenecks, the expressivity of has a large effect on intervention effectiveness. Moreover, it is possible that the average concept accuracy masks differences in individual concept accuracies that influence these results.

6.2 Intervening on CUB

Intervention on CUB is complicated by the fact that it is classification instead of regression. Recall from Section 3 that for sequential and joint bottleneck classifiers, we connect to the logits . To intervene on a concept , we therefore cannot directly copy over the true . Instead, we need to alter the logits such that is close to the true . Concretely, we intervene on by setting to the 5th (if ) or 95th (if ) percentile of over the training distribution.

Another difference is that for CUB, we group related concepts and intervene on them together. This is because many of the concepts encode the same underlying property, e.g., if the wing is red, if the wing is black, etc. We assume that the human (oracle) returns the true wing color in a single query, instead of only answering yes/no questions about the wing color; see Figure 4-Right.

An important caveat is that we use denoised class-level concepts in the CUB dataset (Section 4.1). To avoid unrealistic scenarios where a bird part is not visible in the image but we still ‘intervene’ on it, we only replace a concept value with the true concept value if that concept is actually visible in the image (visibility information is included in the dataset). The results here are nonetheless still optimistic, because they assume that human experts do not make mistakes in identifying concepts and that birds of the same species always share the same concept values.

Test-time intervention substantially improved accuracy on CUB bottleneck models (Figure 4-Right), though it took intervention on several concept groups to see a large gain. For simplicity, we queried concept groups in random order, which means that many queries were probably irrelevant for any given test example.

Test-time intervention was more effective on independent bottleneck models than on the sequential and joint models (Figure 4-Right). We hypothesize that this is partially due to the ad hoc fashion in which we set logits to the 5th or 95th percentiles for the latter models. To study this, we trained a joint bottleneck with the same task-concept tradeoff but with connected to the probabilities instead of the logits . This model had a higher task error of vs. with the normal joint model; we suspect that the squashing from the sigmoid makes optimization harder. However, test-time intervention worked better (“Joint, from sigmoid” vs. “Joint” in Figure 4-Right), and it is more straightforward as we can directly edit . This poses the question of how to effectively intervene in the classification setting while maintaining the computational advantages of avoiding the sigmoid in the connection.

7 Robustness to background shifts

Finally, we investigate if concept bottleneck models can be more robust than standard models to spurious correlations (e.g., the background) that hold in the training distribution but not the test distribution. Whether bottleneck models are more robust depends on the choice of the set of concepts and the shifts considered; a priori, we do not expect that an arbitrary set of concepts will lead to a more robust model.

Figure 5: We change the image backgrounds associated with each class from train to test time (illustrated above for a single class).
Model Error Error
Standard 0.6270.012 -
Joint 0.4830.022 0.0690.002
Sequential 0.4930.004 0.0720.002
Independent 0.4820.008 0.0720.002
Table 3: Task and concept error with background shifts. Bottleneck models have substantially lower task error than the standard model.

We constructed a variant of the CUB dataset where the target is spuriously correlated with image background in the training set. Specifically, we cropped each bird out of its original background (using segmentation masks from the original dataset) and onto a new background from the Places dataset (zhou2017places), with each bird class (species) assigned to a unique and randomly-selected category of places. At test time, we shuffle this mapping, so each class is associated with a different category of places. For example, at training time, all robins might be pictured against the sky, but at test time they might all be on grassy plains (Figure 5).

As images from each class now share common background features, standard models leverage this spurious correlation and consequently fail on the shifted test set (Table 3). Concept models do better as they rely less on background features, since each concept is shared among multiple bird classes and thus appears in training data points that span multiple background types, reducing the correlation between the concept and the background. This toy experiment shows that concept bottleneck models can be more robust to spurious correlations when the target is more correlated with training data artifacts compared to the concepts .

8 Discussion

Concept bottleneck models can compete on task accuracy while supporting intervention and interpretation, allowing practitioners to reason about these models in terms of high-level concepts they are familiar with, and enabling more effective human-model collaboration through test-time intervention. We believe that these models can be promising in settings like medicine, where the high stakes incentivize human experts to collaborate with models at test time, and where the tasks are often normatively defined with respect to a set of standard concepts (e.g., “osteoarthritis is marked by the presence of bone spurs”). A flurry of recent papers have used similar human concepts for post-hoc interpretation of medical and other scientific ML models, e.g., graziani2018regression for breast cancer histopathology; clough2019global for cardiac MRIs; and sprague2019interpretable for meteorology (storm prediction). We expect that concept bottleneck models can be applied directly to similar settings. Below, we discuss several directions for future work.

Learning concepts. In tasks that are not normatively defined, we can learn the right concepts by interactively querying humans. For example, cheng2015flock asked crowdworkers to generate concepts to differentiate between adaptively-chosen pairs of examples, and used those concepts to train models to recognize the artist of a painting, tell honest from deceptive reviews, and identify popular jokes. Similar methods can also be used to refine existing concepts and make them more discriminative (duan2012discovering).

Analyzing concept bottlenecks.

A better understanding of when and why concept bottlenecks improve task accuracy can inform how we collect concepts or design the architecture of bottleneck models. As an example of what this could entail, we sketch an analysis of a simple well-specified linear regression setting, where we assume that the input

is normally distributed, and that the concepts

and the target

are noisy linear transformations of

and respectively. We compared an independent bottleneck model (two linear regression problems for and ) to a standard model (a single linear regression problem) by deriving the ratio of their excess mean-squared-errors as the number of training points goes to infinity:

where and

are the variances of the noise in the concepts

and target , respectively. See Appendix C for a formal statement and proof. Note that the asymptotic relative excess error is small when is small and , suggesting that concept bottleneck models can be particularly effective when the number of concepts is much smaller than the input dimension and when the concepts have relatively low noise compared to the target.

Intervention effectiveness. Our exploration of the design space of concept bottleneck models showed that the training method (independent, sequential, joint) and choice of architecture have a large influence not just on task and concept accuracies, but also on how effective interventions are. This poses several open questions, for example: What factors drive the effectiveness of test-time interventions? Does concept accuracy suffice for comparing the interpretability of concept bottleneck models, or is intervention effectiveness more important? Could adaptive strategies (e.g., that query for the concepts that maximize expected information gain on the test example) make interventions more effective? Finally, how might we have models learn from interventions to avoid making similar mistakes in the future?


The code for replicating our experiments is available on GitHub at An executable version of the CUB experiments in this paper is on CodaLab at The post-processed CUB+Places dataset can also be downloaded at that link. While we are unable to release the OAI dataset publicly, an application to access the data can be made at


We are grateful to Jesse Mu, Justin Cheng, Michael Bernstein, Rui Shu, Sendhil Mullainathan, Shyamal Buch, Ziad Obermeyer, and our anonymous reviewers for helpful advice. PWK was supported by a Facebook PhD Fellowship. YST was supposed by an IMDA Singapore Digital Scholarship. SM was supported by an NSF Graduate Fellowship. EP was supported by a Hertz Fellowship. Other funding came from the PECASE Award. Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.


Appendix A Datasets

a.1 Osteoarthritis Initiative (OAI)

Description and statistics. The source of the knee x-ray dataset is the Osteoarthritis Initiative555, which compiles radiological and clinical data on patients who have or are at high risk of developing knee osteoarthritis. We follow the dataset processing procedure used by pierson2019using in their previous analysis. They analyzed data from the baseline visit and four follow-up timepoints (12-, 24-, 36-, and 48-month follow-ups). Two types of data from this dataset were used in our analysis: the knee x-rays, which served as the input to the neural network, and the clinical concepts associated with osteoarthritis, which were annotated by radiologists for each knee x-ray.

After filtering for observations which contain basic demographic and clinical data, the dataset contains 4,172 patients and 36,369 observations, where an observation is one knee for one patient at one timepoint. We randomly divided patients into training, validation, and test sets, with no overlap in the patient groups. Specifically, we have 21,340 observations from 2,456 people in the training set; 3,709 observations from 421 people in the validation set; and 11,320 observations from 1,295 people in the test set.

Image processing.

To process the knee x-rays, each x-ray was downsampled to 512 x 512 pixels and normalized by dividing pixel values by the maximum pixel value (so all pixel values were in the range 0-1) and then z-scoring. Images were removed if they did not pass OAI x-ray image metadata quality control filters.

Clinical concept assessment and KLG merging.

The primary clinical image feature used in analysis is Kellgren-Lawrence grade (KLG), a 5-level categorical variable (0 to 4) which is assessed by radiologists and used as a standard measure of radiographic osteoarthritis severity, with higher scores denoting more severe disease. In addition to KLG, each knee image is also assessed for 18 other clinical concepts (features) of osteoarthritis in various knee compartments, describing joint space narrowing (JSN), osteophytes, chondrocalcinosis, subchondral sclerosis, cysts, and attrition.

The Osteoarthritis Initiative only assessed these additional 18 clinical concepts (besides joint space narrowing, which is available for all participants) for participants with KLG 2 (a standard threshold for radiographic osteoarthritis) in at least one knee at any time point. Therefore, in their analysis (and in this paper), pierson2019using set these clinical concepts to zero for other participants. This corresponds to assuming that that participants who were never assessed to have osteoarthritis, and thus were not assessed for other clinical concepts, did not display those features. This procedure also means it is impossible to use the clinical concepts to distinguish most x-rays with KLG 0 from those with KLG 1 in the dataset. To evaluate concept bottleneck models on this dataset, we therefore merged the KLG 0 and KLG 1 classes into a single level and translated the other KLG levels downwards by 1, leading to a 4-level categorical variable (0 to 3).

Concept processing. Some of the clinical concepts are very sparse, with almost all x-rays in the dataset showing an absence of the associated radiographic feature. We found that there were insufficient positive training examples to be able to accurately predict these concepts; moreover, including these sparse concepts in the bottleneck models lowered the accuracy of KLG prediction. We therefore filtered out the clinical concepts for which the dominant class (corresponding to an absence of the feature) represents of the training data.

This procedure kept 10 clinical concepts: “osteophytes femur medial”, “sclerosis femur medial”, “joint space narrowing medial”, “osteophytes tibia medial”, “sclerosis tibia medial”, “osteophytes femur lateral”, “sclerosis femur lateral”, “joint space narrowing lateral”, “osteophytes tibia lateral”, and “sclerosis tibia lateral”. It filtered 8 concepts: “cysts femur medial”, “chondrocalcinosis medial”, “cysts tibia medial”, “attrition tibia medial”, “cysts femur lateral”, “chondrocalcinosis lateral”, “cysts tibia lateral”, “attrition tibia lateral”.

After filtering, we z-scored the remaining clinical concepts using the training set to bring them onto the same scale.

Some of the clinical concepts, such as joint space narrowing, are annotated with fractional grades (e.g., 1.2, 1.4, 1.6 etc.) in the dataset. These partial grades represent temporal progression and cannot be deduced by looking at a single timepoint, and they explicitly do not reflect fractional grades (e.g., 1.2 on one patient does not mean it is worse than 1.0 on another patient); we therefore truncate these fractional grades.

Reader disagreements and adjudication procedures. KLG was read by two expert readers (i.e., radiologists) for each x-ray. Discrepancies in these readings, if they met the adjudication criteria described below, were adjudicated by a third reader: if the third reading agreed with either of the existing readings, then that reading was taken to be final, and otherwise, the three readers attended an adjudication session to form a consensus reading. If discrepancies were not adjudicated, the final reading was taken to be the one from the more senior reader. KLG readings were adjudicated when they disagreed on whether KLG was within 0-1 or 2-4, or when they there was a difference in the direction of change of KLG between time points.

JSN was also read by two readers, with similar adjudication procedures. Discrepancies were adjudicated if the readers did not agree on the direction of change between time points.

All other clinical concepts in our dataset were read by a single reader. For more information on the adjudication procedures, please refer to the OAI documentation on Project 15.

a.2 Caltech-UCSD Birds-200-2011 (CUB)

Description and statistics. The Caltech-UCSD Birds-200-2011 (CUB) dataset (wah2011cub) comprises 11,788 photographs of birds from 200 species, with each image additionally annotated with 312 binary concepts (before processing) corresponding to bird attributes like wing color, beak shape, etc. Visibility information on each concept is also provided for each image (e.g., is the beak visible in this image?); we use this information to make our test-time intervention experiments more realistic, but not at training time. Since the original dataset only has train and test sets, we randomly split 20% of the data from the official train set to make a validation set.

Concept processing. The individual concept annotations are noisy: each annotation was provided by a single crowdworker (not a birding expert), and the concepts can be quite similar to each other, e.g., some crowdworkers might indicate that birds from some species have a red belly, while others might say that the belly is rufous (reddish-brown) instead.

To deal with this issue, we aggregate instance-level concept annotations into class-level concepts via majority voting: e.g., if more than of crows have black wings in the data, then we set all crows to have black wings. This makes the approximation that all birds of the same species in the training data should share the same concept annotations. While this approximation is mostly true for this dataset, there are some exceptions due to visual occlusion, as well as sexual and age dimorphism.

After majority voting, we further filter out concepts that are too sparse, keeping only concepts (binary attributes) that are present after majority voting in at least 10 classes. After this filtering, we are left with 112 concepts.

Appendix B Experimental details

b.1 OAI model architecture and training

The models we use to predict KLG from knee x-rays follow the hyperparameters and model setup used by pierson2019using, except for the learning rate and learning rate schedule, which we tune separately. Our models use a ResNet-18 (he2016resnet)

pretrained on ImageNet, with the last 12 convolutional layers fine-tuned on the OAI dataset.

For the bottleneck models, the ResNet-18 network extracts high-level features from the image that is used to regress to the concepts with a single fully-connected layer. Subsequently, there is a 3-layer MLP, with a dimensionality of 50 for the first two layers, that is used to regress to the final KLG . The standard model is similar, except without any loss term that encourages the bottleneck layer to align with the concepts.

For fine-tuning, we use a batch size of 8, with random horizontal and vertical translations as data augmentation. Network weights are optimized with Adam, with beta parameters of 0.9 and 0.999 and an initial learning rate determined by grid search over [0.00005, 0.0005, 0.005], which decays by a factor of 2 every 10 epochs. The network is trained for 30 epochs with early stopping; model weights are set at the conclusion of training to those after the epoch with lowest RMSE for KLG on the validation set.

b.2 CUB model architecture and training

The main architecture for fine-grained bird classification is Inception V3, pretrained on ImageNet (except for the fully-connected layers) and then finetuned end-to-end on the CUB dataset. We follow the preprocessing practices described in cui2018large. Each image used for training is augmented with random color jittering, random horizontal flip and random cropping with a resolution of 299. During inference, the original image is center-cropped and resized to 299.

For each model, we hyperparameter search on the validation set over a range of learning rates ([0.001, 0.01]), learning rate schedules (keeping learning rate constant or reducing learning rate by 10 times after every [10, 15, 20] epochs until it reaches 0.0001), and regularization strengths ([0.0004, 0.00004]), to find a good hyperparameter configuration. The best model is decided based on task accuracy (or concept accuracy for the part of sequential models) on the validation set. Once we have found the best-performing hyperparameter configuration, we then retrain the model on both the train and validation sets until convergence, following cui2018large.

All training is done with a batch size of 64, and SGD with momentum of 0.9 as the optimizer. For bottleneck models, we weight each concept’s contribution to the overall concept loss equally (which is in turn determined by for joint bottleneck models). However, the binary cross-entropy loss used for each individual concept prediction task is weighted by the ratio of class imbalance for that individual concept (which is about 1 : 9 on average) and normalized accordingly. This encourages the model to learn to predict positive concept labels, which are more rare, instead of mostly predicting negative labels.

b.3 Test-time intervention

OAI. For OAI, we use the held-out validation set to determine an input-independent ordering for concept intervention. Specifically, we use the concept labels in the validation set to intervene separately on each concept, replacing a single value in our original concept predictions with that ground truth concept. We obtain the intervention ordering by sorting the concepts in descending order of the improvement in KLG accuracy gained from intervening separately on each concept.

CUB. For CUB, the concept groups are determined by having a common prefix in the list of concept names. For example, “has_back_pattern::solid”, “has_back_pattern::spotted”, “has_back_pattern::striped”, “has_back_pattern::multi-colored” all describe the same group that concerns back-pattern. Since all models are retrained on both train and validation sets, as described above, we do not follow the OAI procedure of determining a fixed ordering. Instead, we randomly select concept groups to intervene on at test time, using the class-level labels for all concepts within that group to replace the predicted logits. To avoid intervening on concepts that are not even visible in the image, we use the concept visibility information that comes with the official CUB dataset: for all concepts that are not visible in a given test image, their corrected values are set to 0 regardless of what the corresponding class-level labels may be.

b.4 Data efficiency

For OAI, we subsampled the training and validation data uniformly at random. For CUB, to ensure that each of the 200 classes had similar numbers of examples, we subsampled the images from each class uniformly at random. To avoid the computational load of hyperparameter searching for each model and degree of subsampling, we adopted the hyperparameters chosen for the best-performing models on the full dataset but did early stopping on the subsampled validation datasets.

b.5 Linear probes

Standard (end-to-end) models. For OAI, we separately trained linear probes on the outputs after every ResNet block and the fully-connected layers of the MLP of the standard model. The best-performing linear probe was the one trained on the output of the final ResNet block. For CUB, we ran a linear probe on the fully-connected layer of the standard model, since the part of the bottleneck models are linear.

SENN. To evaluate self-explaining neural networks (SENNs) (melis2018towards)

, we first trained a SENN model to predict KLG on the OAI dataset and then trained linear probes on the concept layer in the SENN model. We used the open-source implementation from the authors of SENN,


and therefore used a classification objective for KLG prediction. To match the expressiveness of our bottleneck models, we swapped the small CNNs of the SENN concept encoder and relevance parameterizer with our ResNet-18 models. Similarly, for the decoder network in SENN, we used a more expressive decoder comprising 2 fully-connected layers with batch normalization, followed by 5 transposed convolutional layers with upsampling. The decoder was obtained by adapting a public auto-encoder implementation,

777 changing the dimensionalities of the fully-connected and transposed convolutional layers, and increasing upsampling layers to match our input image size. We set the number of concepts for SENN to 10, corresponding to the number of clinical features in OAI. The learning rate was set to 0.0005 and the batch size was set to 4, which was the maximum possible given the memory constraints. With the above settings, the experiments were ran with two different seeds.

Appendix C Excess errors of independent vs. standard models

We present an analysis of the independent bottleneck model, which uses concepts at training time, versus the standard model, which does not. For simplicity, we consider a well-specified linear regression setting with normally-distributed inputs , concepts , and target :


where and . In contrast to the main text, we use capital letters for , , and

here to emphasize the fact that the input, concepts, and target are random variables. In words, the input

is a normally distributed with dimension ; the concepts of dimension are a linear transformation of with additive Gaussian noise; and the output is a scalar-valued linear transformation with additive Gaussian noise. For analytical simplicity, we require and .

Independent bottleneck model. In this setting, the independent bottleneck model comprises two linear regression models: the first estimates the matrix that takes , and the second estimates the vector that takes . For ease of analysis, we assume that each linear regression is fit using least squares on a separate dataset: the first dataset has training points in data matrices and , and the second dataset has points in data matrices and . Concretely, we estimate


and then compose these estimators into the final prediction .

Standard model. In contrast, the standard model does not use concepts, and uses only one dataset with points in and . Concretely, we can express directly in terms of as , where and . This gives the least squares estimate


and the resulting prediction .

Excess errors. We compare these two models using their asymptotic excess error as the number of training points goes to infinity, where a model’s excess error is defined as how much higher its mean-squared-error is compared to the optimal estimator .

Proposition 1 (Relative excess error of independent bottleneck models vs. standard models in linear regression).

Let tend to infinity. Then the ratio of excess errors of the independent bottleneck model to the standard model in the well-specified linear regression setting above is

Note that asymptotic relative excess error is small—i.e., the independent bottleneck has lower excess error than the standard model—when is small and . This corresponds to low dimensional concepts (relative to the input dimension) and concepts with low noise (relative to the noise in the output).

To prove this proposition, we first derive the expected errors of the independent bottleneck model and the standard model.

Lemma 1 (Risk of the independent bottleneck model).

A direct calculation gives




We need to evaluate the expectation of this expression multiplied with itself, . Note that the cross terms will cancel since and are indepenent of other random variables and have mean , . This leaves three remaining direct (squared) terms, which we can evaluate separately since tr and are linear operators.

The first term is


The expression within the above expectation is distributed as an inverse Wishart distribution, and therefore


where the last equality comes from .

The second term follows a similar calculation:


where the second equality follows because is normally distributed with mean and covariance .

The third term is


Putting the three terms together,


so the expected squared error is


Lemma 2 (Risk of the standard model).

A direct calculation gives




we have


Plugging this back into the expression for yields


Proof of Proposition 1.

Note that the optimal estimator has risk


Thus, from Lemmas 1 and 2, the ratio of excess errors is


Taking the limit as goes to infinity and letting gives the desired result