CheXtransfer: Performance and Parameter Efficiency of ImageNet Models for Chest X-Ray Interpretation

by   Alexander Ke, et al.
Stanford University

Deep learning methods for chest X-ray interpretation typically rely on pretrained models developed for ImageNet. This paradigm assumes that better ImageNet architectures perform better on chest X-ray tasks and that ImageNet-pretrained weights provide a performance boost over random initialization. In this work, we compare the transfer performance and parameter efficiency of 16 popular convolutional architectures on a large chest X-ray dataset (CheXpert) to investigate these assumptions. First, we find no relationship between ImageNet performance and CheXpert performance for both models without pretraining and models with pretraining. Second, we find that, for models without pretraining, the choice of model family influences performance more than size within a family for medical imaging tasks. Third, we observe that ImageNet pretraining yields a statistically significant boost in performance across architectures, with a higher boost for smaller architectures. Fourth, we examine whether ImageNet architectures are unnecessarily large for CheXpert by truncating final blocks from pretrained models, and find that we can make models 3.25x more parameter-efficient on average without a statistically significant drop in performance. Our work contributes new experimental evidence about the relation of ImageNet to chest x-ray interpretation performance.


page 1

page 4

page 5

page 6

page 7


MoCo Pretraining Improves Representation and Transferability of Chest X-ray Models

Self-supervised approaches such as Momentum Contrast (MoCo) can leverage...

CheXbreak: Misclassification Identification for Deep Learning Models Interpreting Chest X-rays

A major obstacle to the integration of deep learning models for chest x-...

CheXphotogenic: Generalization of Deep Learning Models for Chest X-ray Interpretation to Photos of Chest X-rays

The use of smartphones to take photographs of chest x-rays represents an...

Core Risk Minimization using Salient ImageNet

Deep neural networks can be unreliable in the real world especially when...

Improved skin lesion recognition by a Self-Supervised Curricular Deep Learning approach

State-of-the-art deep learning approaches for skin lesion recognition of...

When, Why, and Which Pretrained GANs Are Useful?

The literature has proposed several methods to finetune pretrained GANs ...

Code Repositories


Covid19 Pneumonia Detection Project by using Artificial Intelligence Techniques. Project & Documentation.

view repo

1. Introduction

Deep learning models for chest X-ray interpretation have high potential for social impact by aiding clinicians in their workflow and increasing access to radiology expertise worldwide. Transfer learning using pretrained ImageNet

(Deng et al., 2009) models has been the standard approach for developing models not only on chest X-rays (Wang et al., 2017; Rajpurkar et al., 2017; Apostolopoulos and Mpesiana, 2020) but also for many other medical imaging modalities (Mitani et al., 2020; Zhang et al., 2020; Li et al., 2019; De Fauw et al., 2018; Esteva et al., 2017). This transfer assumes that better ImageNet architectures perform better and pretrained weights boost performance on their target medical tasks. However, there has not been a systematic investigation of how ImageNet architectures and weights both relate to performance on downstream medical tasks.

In this work, we systematically investigate how ImageNet architectures and weights both relate to performance on chest X-ray tasks. Our primary contributions are:

  1. For models without pretraining and models with pretraining, we find no relationship between ImageNet performance and CheXpert performance (Spearman , respectively). This finding suggests that architecture improvements on ImageNet may not lead to improvements on medical imaging tasks.

  2. For models without pretraining, we find that within an architecture family, the largest and smallest models have small differences (ResNet 0.005, DenseNet 0.003, EfficientNet 0.004) in CheXpert AUC, but different model families have larger differences in AUC (). This finding suggests that the choice of model family influences performance more than size within a family for medical imaging tasks.

  3. We observe that ImageNet pretraining yields a statistically significant boost in performance (average boost of 0.016 AUC) across architectures, with a higher boost for smaller architectures (Spearman with number of parameters). This finding supports the ImageNet pretraining paradigm for medical imaging tasks, especially for smaller models.

  4. We find that by truncating final blocks of pretrained models, we can make models 3.25x more parameter-efficient on average without a statistically significant drop in performance. This finding suggests model truncation may be a simple method to yield lighter pretrained models by preserving architecture design features while reducing model size.

Our study, to the best of our knowledge, contributes the first systematic investigation of the performance and efficiency of ImageNet architectures and weights for chest X-ray interpretation. Our investigation and findings may be further validated on other datasets and medical imaging tasks.

2. Related Work

2.1. ImageNet Transfer

Kornblith et al. (2018)

examined the performance of 16 convolutional neural networks (CNNs) on 12 image classification datasets. They found that using these ImageNet pretrained architectures either as feature extractors for logistic regression or fine tuning them on the target dataset yielded a Spearman

and between ImageNet accuracy and transfer accuracy respectively. However, they showed ImageNet performance was less correlated with transfer accuracy for some fine-grained tasks, corroborating He et al. (2018). They found that without ImageNet pretraining, ImageNet accuracy and transfer accuracy had a weaker Spearman . We extend Kornblith et al. (2018) to the medical setting by studying the relationship between ImageNet and CheXpert performance.

Raghu et al. (2019) explored properties of transfer learning onto retinal fundus images and chest X-rays. They studied ResNet50 and InceptionV3 and showed pretraining offers little performance improvement. Architectures composed of just four to five sequential convolution and pooling layers achieved comparable performance on these tasks as ResNet50 with less than 40% the parameters. In our work, we find pretraining does not boost performance for ResNet50, InceptionV3, InceptionV4, and MNASNet but does boost performance for the remaining 12 architectures. Thus, we were able to replicate Raghu et al. (2019)’s results, but upon studying a broader set of newer and more popular models, we reached the opposite conclusion that ImageNet pretraining yields a statistically significant boost in performance.

2.2. Medical Task Architectures

Irvin et al. (2019) compared the performance of ResNet152, DenseNet121, InceptionV4, and SEResNeXt101 on CheXpert, finding that DenseNet121 performed best. In a recent analysis, all but one of the top ten CheXpert competition models used DenseNets as part of their ensemble, even though they have been surpassed on ImageNet (Rajpurkar et al., 2020). Few groups design their own networks from scratch, preferring to use established ResNet and DenseNet architectures for CheXpert (Bressem et al., 2020). This trend extends to retinal fundus and skin cancer tasks as well, where Inception architectures remain popular (Mitani et al., 2020; Zhang et al., 2020; Li et al., 2019; De Fauw et al., 2018). The popularity of these older ImageNet architectures hints that there may be a disconnect between ImageNet performance and medical task performance for newer architectures generated through architecture search. We verify that these newer architectures generated through search (EfficientNet, MobileNet, MNASNet) underperform older architectures (DenseNet, ResNet) on CheXpert, suggesting that search has overfit to ImageNet and explaining the popularity of these older architectures in the medical imaging literature.

Bressem et al. (2020) postulated that deep CNNs that can represent more complex relationships for ImageNet may not be necessary for CheXpert, which has greyscale inputs and fewer image classes. They studied ResNet, DenseNet, VGG, SqueezeNet, and AlexNet performance on CheXpert and found that ResNet152, DenseNet161, and ResNet50 performed best on CheXpert AUC. In terms of AUPRC, they showed that smaller architectures like AlexNet and VGG can perform similarly to deeper architectures on CheXpert. Models such as AlexNet, VGG, and SqueezeNet are no longer popular in the medical setting, so in our work, we systematically investigate the performance and efficiency of 16 more contemporary ImageNet architectures with and without pretraining. Additionally, we extend (Bressem et al., 2020) by studying the effects of pretraining, characterizing the relationship between ImageNet and CheXpert performance, and drawing conclusions about architecture design.

2.3. Truncated Architectures

The more complex a convolutional architecture becomes, the more computational and memory resources are needed for its training and deployment. Model complexity thus may impede the deployment of CNNs to clinical settings with less resources. Therefore, efficiency, often reported in terms of the number of parameters in a model, the number of FLOPS in the forward pass, or the latency of the forward pass, has become increasingly important in model design. Low-rank factorization (Jaderberg et al., 2014; Chollet, 2016), transferred/compact convolutional filters (Cheng et al., 2017), knowledge distillation (Hinton et al., 2015), and parameter pruning (Srinivas and Babu, 2015) have all been proposed to make CNNs more efficient.

Layer-wise pruning is a type of parameter pruning that locates and removes layers that are not as useful to the target task (Ro and Choi, 2020)

. Through feature diagnosis, a linear classifier is trained using the feature maps at intermediate layers to quantify how much a particular layer contributes to performance on the target task

(Chen and Zhao, 2019). In this work, we propose model truncation as a simple method for layer-wise pruning where the final pretrained layers after a given point are pruned off, a classification layer is appended, and this whole model is finetuned on the target task.

3. Methods

3.1. Training and Evaluation Procedure

We train chest X-ray classification models with different architectures with and without pretraining. The task of interest is to predict the probability of different pathologies from one or more chest X-rays. We use the CheXpert dataset consisting of 224,316 chest X-rays of 65,240 patients

(Irvin et al., 2019) labeled for the presence or absence of 14 radiological observations. We evaluate models using the average of their AUROC metrics (AUC) on the five CheXpert-defined competition tasks (Atelectasis, Cardiomegaly, Consolidation, Edema, Pleural Effusion) as well as the No Finding task to balance clinical importance and prevalence in the validation set.

We select 16 models pretrained on ImageNet from public checkpoints implemented in PyTorch 1.4.0: DenseNet (121, 169, 201) and ResNet (18, 34, 50, 101) from

Paszke et al. (2019), Inception (V3, V4) and MNASNet from Cadene (2018), and EfficientNet (B0, B1, B2, B3) and MobileNet (V2, V3) from Wightman (2020). We finetune and evaluate these architectures with and without ImageNet pretraining.

For each model, we finetune all parameters on the CheXpert training set. If a model is pretrained, inputs are normalized using mean and standard deviation learned from ImageNet. If a model is not pretrained, inputs are normalized with mean and standard deviation learned from CheXpert. We use the Adam optimizer (, ) with learning rate of

, a batch size of 16, and a cross-entropy loss function. We train on up to four Nvidia GTX 1080 with CUDA 10.1 and Intel Xeon CPU ES-2609 running Ubuntu 16.04. For one run of an architecture, we train for three epochs and evaluate each model every 8192 gradient steps. We train each model and create a final ensemble model from the ten checkpoints, which achieved the best average CheXpert AUC across the six tasks on the validation set. We report all our results on the CheXpert test set.

We use the nonparametric bootstrap to estimate 95% confidence intervals for each statistic. 1,000 replicates are drawn from the test set, and the statistic is calculated on each replicate. This procedure produces a distribution for each statistic, and we report the 2.5 and 97.5 percentiles as a confidence interval. Significance is assessed at the


3.2. Truncated Architectures

We study truncated versions of DenseNet121, MNASNet, ResNet18, and EfficientNetB0. DenseNet121 and MNASNet were chosen because we found they have the greatest efficiency (by AUC per parameters) on CheXpert of the models we profile, ResNet18 was chosen because of its popularity as a compact model for medical tasks, and EfficientNetB0 was chosen because it is the smallest current-generation model of the 16 we study. DenseNet121 contains four dense blocks separated by transition blocks before the classification layer. By pruning the final dense block and associated transition block, the model now only contains three dense blocks, yielding DenseNet121Minus1. Similarly, pruning two dense blocks and associated transition blocks yields DenseNet121Minus2, and pruning three dense blocks and associated transition blocks yields DenseNet121Minus3. For MNASNet, we remove up to the four of the final MBConv blocks to produce MNASNetMinus1 through MNASNetMinus4. For ResNet18, we remove up to the three of the final residual blocks with a similar method to produce ResNet18Minus1 through ResNet18Minus3. For EfficientNet, we remove up to two of the final MBConv6 blocks to produce EfficientNetB0Minus1 and EfficientNetB0Minus2.

After truncating a model, we append a classification block containing a global average pooling layer followed by a fully connected layer to yield outputs of the correct shape. We initialize the model with ImageNet pretrained weights, except the randomly initialized classification block, and finetune using the same training procedure as the 16 ImageNet models.

3.3. Class Activation Maps

We compare the class activation maps (CAMs) among a truncated DenseNet121 family to visualize their higher resolution CAMs. We generate CAMs using the Grad-CAM method (Selvaraju et al., 2016), using a weighted combination of the model’s final convolutional feature maps, with weights based on the positive partial derivatives with respect to class score. This averaged map is scaled by the outputted probability so more confident predictions appear brighter. Finally, the map is upsampled to the input image resolution and overlain onto the input image, highlighting image regions that had the greatest influence on a model’s decision.

Figure 2. Average CheXpert AUC vs. ImageNet Top-1 Accuracy. The left plot shows results obtained without pretraining, while the right plot shows results with pretraining. There is no monotonic relationship between ImageNet and CheXpert performance without pretraining (Spearman ) or with pretraining (Spearman ).

Two scatterplots illustrating average CheXpert AUC against the ImageNet top-1 accuracy of each model.

4. Experiments

Figure 3. Average CheXpert AUC vs. Model Size. The left plot shows results obtained without pretraining, while the right plot shows results with pretraining. The logarithm of the model size has a near linear relationship with CheXpert performance when we omit pretraining (Spearman ). However once we incorporate pretraining, the monotonic relationship is weaker (Spearman ).

Two scatterplots illustrating average CheXpert AUC against the base-10 logarithm of the number of parameters for each model.

4.1. ImageNet Transfer Performance

We investigate whether higher performance on natural image classification translates to higher performance on chest X-ray classification. We display the relationship between the CheXpert AUC, with and without ImageNet pretraining, and ImageNet top-1 accuracy in Figure 2

When models are trained without pretraining, we find no monotonic relationship between ImageNet top-1 accuracy and average CheXpert AUC, with Spearman at . Model performance without pretraining would describe how a given architecture would perform on the target task, independent of any pretrained weights. When models are trained with pretraining, we again find no monotonic relationship between ImageNet top-1 accuracy and average CheXpert AUC with Spearman at .

Overall, we find no relationship between ImageNet and CheXpert performance, so models that succeed on ImageNet do not necessarily succeed on CheXpert. These relationships between ImageNet performance and CheXpert performance are much weaker than the relationships between ImageNet performance and performance on various natural image tasks reported by Kornblith et al. (2018).

We compare the CheXpert performance within and across architecture families. Without pretraining, we find that ResNet101 performs only 0.005 AUC greater than ResNet18, which is well within the confidence interval of this metric (Figure 2). Similarly, DenseNet201 performs 0.004 AUC greater than DenseNet121 and EfficientNetB3 performs 0.003 AUC greater than EfficientNetB0. With pretraining, we continue to find minor performance differences between the largest model and smallest model that we test in each family. We find AUC increases of 0.002 for ResNet, 0.004 for DenseNet and -0.006 for EfficientNet. Thus, increasing complexity within a model family does not yield increases in CheXpert performance as meaningful as the corresponding increases in ImageNet performance.

Without pretraining, we find that the best model studied performs significantly better than the worst model studied. Among models trained without pretraining, we find that InceptionV3 performs best with 0.866 (0.851, 0.880) AUC, while MobileNetV2 performs worst with 0.814 (0.796, 0.832) AUC. Their difference in performance is 0.052 (0.043, 0.063) AUC. InceptionV3 is also the third largest architecture studied and MobileNetV2 the smallest. We find a significant difference in the CheXpert performance of these models. This difference again hints at the importance of architecture design.

4.2. CheXpert Performance and Efficiency

We examine whether larger architectures perform better than smaller architectures on chest X-ray interpretation, where architecture size is measured by number of parameters. We display these relationships in Figure 3 and Table 1.

Model CheXpert AUC #Params (M)
DenseNet121 0.859 (0.846, 0.871) 6.968
DenseNet169 0.860 (0.848, 0.873) 12.508
DenseNet201 0.864 (0.850, 0.876) 18.120
EfficientNetB0 0.859 (0.846, 0.871) 4.025
EfficientNetB1 0.858 (0.844, 0.872) 6.531
EfficientNetB2 0.866 (0.853, 0.880) 7.721
EfficientNetB3 0.853 (0.837, 0.867) 10.718
InceptionV3 0.862 (0.848, 0.876) 27.161
InceptionV4 0.861 (0.846, 0.873) 42.680
MNASNet 0.858 (0.845, 0.871) 5.290
MobileNetV2 0.854 (0.839, 0.869) 2.242
MobileNetV3 0.859 (0.847, 0.872) 4.220
ResNet101 0.863 (0.848, 0.876) 44.549
ResNet18 0.862 (0.847, 0.875) 11.690
ResNet34 0.863 (0.849, 0.875) 21.798
ResNet50 0.859 (0.843, 0.871) 25.557
Table 1. CheXpert AUC (with 95% Confidence Intervals) and Number of Parameters for 16 ImageNet-Pretrained Models.

Without ImageNet pretraining, we find a positive monotonic relationship between the number of parameters of an architecture and CheXpert performance, with Spearman significant at . With ImageNet pretraining, there is a weaker positive monotonic relationship between the number of parameters and average CheXpert AUC, with Spearman at .

Although there exists a positive monotonic relationship between the number of parameters of an architecture and average CheXpert AUC, the Spearman does not highlight the increase in parameters necessary to realize marginal increases in CheXpert AUC. For example, ResNet101 is 11.1x larger than EfficientNetB0, but with only increase of 0.005 in CheXpert AUC with pretraining.

Within a model family, increasing the number of parameters does not lead to meaningful gains in CheXpert AUC. We see this relationship in all families studied without pretraining (EfficientNet, DenseNet, and ResNet) in Figure 3. For example, DenseNet201 has an AUC 0.003 greater than DenseNet121, but is 2.6x larger. EfficientNetB3 has an AUC 0.004 greater than EfficientNetB0, but is 1.9x larger. Despite the positive relationship between model size and CheXpert performance across all models, bigger does not necessarily mean better within a model family.

Since within a model family there is a weaker relationship between model size and CheXpert performance than across all models, we find that CheXpert performance is influenced more by the macro architecture design than by its size. Models within a family have similar architecture design choices but different sizes, so they perform similarly on CheXpert. We observe large discrepancies in performance between architecture families. For example DenseNet, ResNet, and Inception typically outperform EfficientNet and MobileNet architectures, regardless of their size. EfficientNet, MobileNet, and MNASNet were all generated through neural architecture search to some degree, a process that optimized for performance on ImageNet. Our findings suggest that this search could have overfit to the natural image objective to the detriment of chest X-ray tasks.

Figure 4. Pretraining Boost vs. Model Size. We define pretraining boost as the increase in the average CheXpert AUCs achieved with pretraining vs. without pretraining. Most models benefit significantly from ImageNet pretraining. Smaller models tend to benefit more than larger models (Spearman ).

Pretraining boost with 95-percent confidence intervals plotted against the base-10 logarithm of the number of parameters in a model.

4.3. ImageNet Pretraining Boost

We study the effects of ImageNet pretraining on CheXpert performance by defining the pretraining boost as the CheXpert AUC of a model initialized with ImageNet pretraining minus the CheXpert AUC of its counterpart without pretraining. The pretraining boosts of our architectures are reported in Figure 4.

We find that ImageNet pretraining provides a meaningful boost for most architectures (on average 0.015 AUC). We find a Spearman at between the number of parameters of a given model and the pretraining boost. Therefore, this boost tends to be larger for smaller architectures such as EfficientNetB0 (0.023), MobileNetV2 (0.040) and MobileNetV3 (0.033) and smaller for larger architectures such as InceptionV4 () and ResNet101 (0.013). Further work is required to explain this relationship.

Within a model family, the pretraining boost also does not meaningfully increase as as model size increases. For example, DenseNet201 has a pretraining boost only 0.002 AUC greater than DenseNet121 does. This finding supports our earlier conclusion that model families perform similarly on CheXpert regardless of their size.

4.4. Truncated Architectures

We truncate the final blocks of DenseNet121, MNASNet, ResNet18, and EfficientNetB0 with pretrained weights and study their CheXpert performance to understand whether ImageNet models are unnecessarily large for the chest X-ray task. We express efficiency gains in terms of Times-Smaller, or the number of parameters of the original architecture divided by the number of parameters of the truncated architecture: intuitively, how many times larger the original architecture is compared to the truncated architecture. The efficiency gains and AUC changes of model truncation on DenseNet121, MNASNet, ResNet18, and EfficientNetB0 are displayed in Table 2.

Model AUC Change Times-Smaller
EfficientNetB0 0.00% 1x
EfficientNetB0Minus1 0.15% 1.4x
EfficientNetB0Minus2 -0.45% 4.7x
MNASNet 0.00% 1x
MNASNetMinus1 -0.07% 2.5x
MNASNetMinus2* -2.30% 11.2x
MNASNetMinus3* -2.51% 20.0x
MNASNetMinus4* -6.40% 112.9x
DenseNet121 0.00% 1x
DenseNet121Minus1 -0.04% 1.6x
DenseNet121Minus2* -1.33% 5.3x
DenseNet121Minus3* -4.73% 20.0x
ResNet18 0.00% 1x
ResNet18Minus1 0.24% 4.2x
ResNet18Minus2* -3.70% 17.1x
ResNet18Minus3* -8.33% 73.8x
Table 2. Efficiency Trade-Off of Truncated Models. Pretrained models can be truncated without significant decrease in CheXpert AUC. Truncated models with significantly different AUC from the base model are denoted with an asterisk.

For all four model families, we find that truncating the final block leads to no significant decrease in CheXpert AUC but can save 1.4x to 4.2x the parameters. Notably, truncating the final block of ResNet18 yields a model that is not significantly different (difference -0.002 (-0.008, 0.004)) in CheXpert AUC, but is 4.2x smaller. Truncating the final two blocks of an EfficientNetB0 yields a model that is not significantly different (difference 0.004 (-0.003, 0.009)) in CheXpert AUC, but is 4.7x smaller. However, truncating the second block and beyond in each of MNASNet, DenseNet121, and ResNet18 yields models that have statistically significant drops in CheXpert performance.

Figure 5. Comparison of Class Activation Maps Among Truncated Model Family. CAMs yielded by models, from left to right, DenseNet121, DenseNet121Minus1, and DenseNet121Minus2. Displays frontal chest X-ray demonstrating Atelectasis (top) and Edema (bottom). Further truncated models more effectively localize the Atelectasis, as well as tracing the hila and vessel branching for Edema.

Six class activation maps of two frontal chest X-ray images showing finer detail from left to right.

Model truncation effectively compresses models performant on CheXpert, making them more parameter efficient while still using pretrained weights to capture the pretraining boost. Parameter efficient models are able to lighten the computational and memory burdens for deployment to low-resource environments such as portable devices. In the clinical setting, the simplicity of our model truncation method encourages its adoption for model compression.

This finding corroborates Raghu et al. (2019) and Bressem et al. (2020), which show simpler models can achieve performance comparable to more complex models on CheXpert. Our truncated models can use readily-available pretrained weights, which may allow these models to capture the pretraining boost and speed up training. However, we do not study the performance of these truncated models without their pretrained weights.

As an additional benefit, architectures that truncate pooling layers will also produce higher-resolution class activation maps, as shown in Figure 5. The higher-resolution class activation maps (CAMs) may more effectively localize pathologies with little to no decrease in classification performance. In clinical settings, improved explainability through better CAMs may be useful for validating predictions and diagnosing mispredictions. As a result, clinicians may have more trust in models that provide these higher-resolution CAMs.

5. Discussion

In this work, we study the performance and efficiency of ImageNet architectures for chest x-ray interpretation.

Is ImageNet performance correlated with CheXpert?

No. We show no statistically significant relationship between ImageNet and CheXpert performance. This finding extends Kornblith et al. (2018)—which found a significant correlation between ImageNet performance and transfer performance on typical image classification datasets—to the medical setting of chest x-ray interpretation. This difference could be attributed to unique aspects the chest X-ray interpretation task and data attributes. The chest X-ray interpretation task differs from natural image classification in that (1) disease classification may depend on abnormalities in a small number of pixels, (2) chest X-ray interpretation is a multi-task classification setup, and (3) there are far fewer classes than in many natural image classification datasets. Second, the data attributes for chest X-rays differ from natural image classification in that X-rays are greyscale and have similar spatial structures across images (always either anterior-posterior, posterior-anterior, or lateral).

Does model architecture matter?

Yes. For models without pretraining, we find that the choice of architecture family may influence performance more than model size. Our findings extend Raghu et al. (2019) beyond the effect of ImageNet weights, since we show that architectures that succeed on ImageNet do not necessarily succeed on medical imaging tasks. A notable finding of our work is that newer architectures generated through search on ImageNet (EfficientNet, MobileNet, MNASNet) underperform older architectures (DenseNet, ResNet) on CheXpert. This finding suggests that search may have overfit to ImageNet to the detriment of medical task performance, and ImageNet may not be an appropriate benchmark for selecting architectures for medical imaging tasks. Instead, medical imaging architectures could be benchmarked on CheXpert or other large medical datasets. Architectures derived from selection and search on CheXpert and other large medical datasets may be applicable to similar medical imaging modalities including other x-ray studies, or CT scans. Thus architecture search directly on CheXpert or other large medical datasets may allow us to unlock next generation performance for medical imaging tasks.

Does ImageNet pretraining help?

Yes. We find that ImageNet pretraining yields a statistically significant boost in performance for chest x-ray classification. Our findings are consistent with Raghu et al. (2019), who find no pretraining boost on ResNet50 and InceptionV3, but we find pretraining does boost performance for 12 out of 16 architectures. Our findings extend He et al. (2018)—who find models without pretraining had comparable performance to models pretrained on ImageNet for object detection and image segmentation of natural images—to the medical imaging setting. Future work may investigate the relationship between network architectures and the impact of self-supervised pre-training for chest x-ray interpretation as has recently been developed by Sowrirajan et al. (2020); Azizi et al. (2021); Sriram et al. (2021).

Can models be smaller?

Yes. We find that by truncating final blocks of ImageNet-pretrained architectures, we can make models 3.25x more parameter-efficient on average without a statistically significant drop in performance. This method preserves the critical components of architecture design while cutting its size. This observation suggests model truncation may be a simple method to yield lighter models, using ImageNet pretrained weights to boost CheXpert performance. In the clinical setting, truncated models may provide value through improved parameter-efficiency and higher resolution CAMs. This change may enable deployment to low-resource clinical environments and further develop model trust through improved explainability.

In closing, our work contributes to the understanding of the transfer performance and parameter efficiency of ImageNet models for chest X-ray interpretation. We hope that our new experimental evidence about the relation of ImageNet to medical task performance will shed light on potential future directions for progress.


  • I. D. Apostolopoulos and T. A. Mpesiana (2020) Covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks. Physical and Engineering Sciences in Medicine, pp. 1. Cited by: §1.
  • S. Azizi, B. Mustafa, F. Ryan, Z. Beaver, J. Freyberg, J. Deaton, A. Loh, A. Karthikesalingam, S. Kornblith, T. Chen, V. Natarajan, and M. Norouzi (2021) Big self-supervised models advance medical image classification. External Links: 2101.05224 Cited by: §5.
  • K. K. Bressem, L. Adams, C. Erxleben, B. Hamm, S. Niehues, and J. Vahldiek (2020) Comparing different deep learning architectures for classification of chest radiographs. External Links: 2002.08991 Cited by: §2.2, §2.2, §4.4.
  • R. Cadene (2018) Pretrainedmodels 0.7.4. Note: Cited by: §3.1.
  • S. Chen and Q. Zhao (2019) Shallowing deep networks: layer-wise pruning based on feature representations. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (12), pp. 3048–3056. Cited by: §2.3.
  • Y. Cheng, D. Wang, P. Zhou, and T. Zhang (2017) A survey of model compression and acceleration for deep neural networks. CoRR abs/1710.09282. External Links: Link, 1710.09282 Cited by: §2.3.
  • F. Chollet (2016) Xception: deep learning with depthwise separable convolutions. CoRR abs/1610.02357. External Links: Link, 1610.02357 Cited by: §2.3.
  • J. De Fauw, J. R. Ledsam, B. Romera-Paredes, S. Nikolov, N. Tomasev, S. Blackwell, H. Askham, X. Glorot, B. O’Donoghue, D. Visentin, G. van den Driessche, B. Lakshminarayanan, C. Meyer, F. Mackinder, S. Bouton, K. Ayoub, R. Chopra, D. King, A. Karthikesalingam, C. O. Hughes, R. Raine, J. Hughes, D. A. Sim, C. Egan, A. Tufail, H. Montgomery, D. Hassabis, G. Rees, T. Back, P. T. Khaw, M. Suleyman, J. Cornebise, P. A. Keane, and O. Ronneberger (2018) Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature Medicine 24 (9), pp. 1342–1350. External Links: ISSN 1546-170X, Document, Link Cited by: §1, §2.2.
  • J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    Vol. , pp. 248–255. Cited by: §1.
  • A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun (2017) Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 (7639), pp. 115–118. External Links: Document Cited by: §1.
  • K. He, R. B. Girshick, and P. Dollár (2018) Rethinking imagenet pre-training. CoRR abs/1811.08883. External Links: Link, 1811.08883 Cited by: §2.1, §5.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. External Links: 1503.02531 Cited by: §2.3.
  • J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. L. Ball, K. S. Shpanskaya, J. Seekins, D. A. Mong, S. S. Halabi, J. K. Sandberg, R. Jones, D. B. Larson, C. P. Langlotz, B. N. Patel, M. P. Lungren, and A. Y. Ng (2019) CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. CoRR abs/1901.07031. External Links: Link, 1901.07031 Cited by: §2.2, §3.1.
  • M. Jaderberg, A. Vedaldi, and A. Zisserman (2014) Speeding up convolutional neural networks with low rank expansions. CoRR abs/1405.3866. External Links: Link, 1405.3866 Cited by: §2.3.
  • S. Kornblith, J. Shlens, and Q. V. Le (2018) Do better imagenet models transfer better?. CoRR abs/1805.08974. External Links: Link, 1805.08974 Cited by: §2.1, §4.1, §5.
  • F. Li, Z. Liu, H. Chen, M. Jiang, X. Zhang, and Z. Wu (2019) Automatic Detection of Diabetic Retinopathy in Retinal Fundus Photographs Based on Deep Learning Algorithm. Translational Vision Science & Technology 8 (6), pp. 4–4. External Links: ISSN 2164-2591, Document, Link, Cited by: §1, §2.2.
  • A. Mitani, A. Huang, S. Venugopalan, G. S. Corrado, L. Peng, D. R. Webster, N. Hammel, Y. Liu, and A. V. Varadarajan (2020) Detection of anaemia from retinal fundus images via deep learning. Nature Biomedical Engineering 4 (1), pp. 18–27. External Links: ISSN 2157-846X, Document, Link Cited by: §1, §2.2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §3.1.
  • M. Raghu, C. Zhang, J. M. Kleinberg, and S. Bengio (2019) Transfusion: understanding transfer learning with applications to medical imaging. CoRR abs/1902.07208. External Links: Link, 1902.07208 Cited by: §2.1, §4.4, §5, §5.
  • P. Rajpurkar, A. Joshi, A. Pareek, P. Chen, A. Kiani, J. A. Irvin, A. Ng, and M. Lungren (2020) CheXpedition: investigating generalization challenges for translation of chest x-ray algorithms to the clinical setting. ArXiv abs/2002.11379. Cited by: §2.2.
  • P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Y. Ding, A. Bagul, C. Langlotz, K. S. Shpanskaya, M. P. Lungren, and A. Y. Ng (2017) CheXNet: radiologist-level pneumonia detection on chest x-rays with deep learning. CoRR abs/1711.05225. External Links: Link, 1711.05225 Cited by: §1.
  • Y. Ro and J. Y. Choi (2020) Layer-wise pruning and auto-tuning of layer-wise learning rates in fine-tuning of deep networks. External Links: 2002.06048 Cited by: §2.3.
  • R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra (2016) Grad-cam: why did you say that? visual explanations from deep networks via gradient-based localization. CoRR abs/1610.02391. External Links: Link, 1610.02391 Cited by: §3.3.
  • H. Sowrirajan, J. Yang, A. Y. Ng, and P. Rajpurkar (2020) MoCo pretraining improves representation and transferability of chest x-ray models. External Links: 2010.05352 Cited by: §5.
  • S. Srinivas and R. V. Babu (2015) Data-free parameter pruning for deep neural networks. CoRR abs/1507.06149. External Links: Link, 1507.06149 Cited by: §2.3.
  • A. Sriram, M. Muckley, K. Sinha, F. Shamout, J. Pineau, K. J. Geras, L. Azour, Y. Aphinyanaphongs, N. Yakubova, and W. Moore (2021) COVID-19 deterioration prediction via self-supervised representation learning and multi-image prediction. External Links: 2101.04909 Cited by: §5.
  • X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers (2017) ChestX-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. CoRR abs/1705.02315. External Links: Link, 1705.02315 Cited by: §1.
  • R. Wightman (2020) Timm 0.2.1. Note: Cited by: §3.1.
  • L. Zhang, M. Yuan, Z. An, X. Zhao, H. Wu, H. Li, Y. Wang, B. Sun, H. Li, S. Ding, X. Zeng, L. Chao, P. Li, and W. Wu (2020) Prediction of hypertension, hyperglycemia and dyslipidemia from retinal fundus photographs via deep learning: a cross-sectional study of chronic diseases in central china. PLOS ONE 15 (5), pp. 1–11. External Links: Link, Document Cited by: §1, §2.2.