One Hyper-Initializer for All Network Architectures in Medical Image Analysis

by   Fangxin Shang, et al.

Pre-training is essential to deep learning model performance, especially in medical image analysis tasks where limited training data are available. However, existing pre-training methods are inflexible as the pre-trained weights of one model cannot be reused by other network architectures. In this paper, we propose an architecture-irrelevant hyper-initializer, which can initialize any given network architecture well after being pre-trained for only once. The proposed initializer is a hypernetwork which takes a downstream architecture as input graphs and outputs the initialization parameters of the respective architecture. We show the effectiveness and efficiency of the hyper-initializer through extensive experimental results on multiple medical imaging modalities, especially in data-limited fields. Moreover, we prove that the proposed algorithm can be reused as a favorable plug-and-play initializer for any downstream architecture and task (both classification and segmentation) of the same modality.


Surrogate Supervision for Medical Image Analysis: Effective Deep Learning From Limited Quantities of Labeled Data

We investigate the effectiveness of a simple solution to the common prob...

Bimodal network architectures for automatic generation of image annotation from text

Medical image analysis practitioners have embraced big data methodologie...

neuralRank: Searching and ranking ANN-based model repositories

Widespread applications of deep learning have led to a plethora of pre-t...

Universal Model for 3D Medical Image Analysis

Deep Learning-based methods recently have achieved remarkable progress i...

Advancing 3D Medical Image Analysis with Variable Dimension Transform based Supervised 3D Pre-training

The difficulties in both data acquisition and annotation substantially r...

Multimodal Self-Supervised Learning for Medical Image Analysis

In this paper, we propose a self-supervised learning approach that lever...

Deep Learning for Medical Image Segmentation: Tricks, Challenges and Future Directions

Over the past few years, the rapid development of deep learning technolo...

1 Introduction

Deep learning algorithms have made great progress in various medical imaging modalities, such as fundus photography [13, 33], computed tomography(CT) [32, 34], X-Ray [15], and magnetic resonance imaging (MRI) [31]. However, the deep models often crave a large scale of annotated data, while labeled data of medical imaging tasks are limited due to the time-consuming and expensive annotating process.

To achieve better performance with limited data, there are mainly two kinds of weight initializing approaches for the cold-start of medical imaging tasks. One is supervised pre-training, where pre-designed deep networks are trained on a large scale labeled source data (e.g., ImageNet

[7]) before fine-tuned for downstream tasks. [27, 24, 2]. The other is self-supervised pre-training, where a model is trained on a large number of unlabeled data of the same modality as the downstream task in a self-supervised manner. [3, 10, 4].

Previous works on weight initialization indicate the following conclusions:

  • [leftmargin=*]

  • Pre-training can enhance the performance of the downstream task in small data regime. Chen et al. [4] show self-supervised pre-training can learn useful semantic features for multiple medical image modalities including CT, MR and ultrasound images. Verified by COVID-19 classification task, Peng et al. [24]

    prove that pre-training using ImageNet benefits data-limited learning.

  • Pre-training can speed up the convergence in training. Ghesu et al. [10] report that a well designed self-supervised initializer achieves convergence 72% faster than no pre-training in pneumothorax classification. Although pre-training with ImageNet does not necessarily for accuracy improvements due to the fundamental mismatch between the natural images and the medical images, it helps convergence compared to random initialization [16, 26].

Although initializing the models by self-/supervised pre-training seems to be a favorable paradigm for medical image analysis tasks, current pre-training methods are inflexible as it is time and energy consuming to repeat the pre-training procedures for every new architecture. Strubell et al.[30] report that training the deep learning models on abundant data consumes striking financial and environmental costs. The carbon emission to train the popular BERT model[8] on NVIDIA V100 GPUs for 72 hours is roughly equivalent to a trans-American flight. Although take many efforts, ImageNet pre-training for ResNet-50[17] take approximate 256 Tesla P100 GPUs/hour with Caffe2[12]

and 225 Tesla P40 GPUs/hour with Tensorflow

[19]. As more superior network architectures get proposed in the community, it is not a good bargain to pre-train all the new models for initialization on different downstream tasks.

To maintain the merits of pre-training while avoiding the retraining procedures, we proposed an architecture-irrelevant initializer named hyper-initializer, which only needs to be pre-trained once to provide effective, efficient and scalable initial parameters for any unseen network.

Inspired by [14] and [20], the proposed initializer is designed as a hypernetwork which mainly consists of three modules: 1) A node embedding module which embeds the nodes of the input networks (i.e.

basic network operations such as convolution, batch-normalization, activation) into the feature space; 2) A

gated graph neural network (GatedGNN)

which learns the feature representation of the input architectures; 3) A decoder module with multiple linear and convolutional layers, which transforms the features from GatedGNN to the parameters for the input networks. Our hyper-initializer is trained with unlabeled medical images in a self-supervised way, and then predict the initial parameters of any downstream model for the same image modality.

Different from the related hypernetwork in [20] which is pre-trained with a large scale of labeled dataset, we optimize the proposed hyper-initializer in a self-supervised way. Therefore, we can leverage the feature representation from the unlabled data of the similar modality with the target tasks. To the best of our knowledge, this is the first work which can initialize any unseen medical imaging models with hypernetwork. The main benefits of this paper include:

  • [leftmargin=*]

  • Effectiveness and efficiency: By evaluating the method on multiple medical image modalities, we show that initializing the downstream models by hyper-initializer leads to performance enhancement for the target tasks in terms of both accuracy and convergence. Additionally, by exploiting the downstream models in an interpretation way, we find that the hyper-initializer can transfer valuable feature attention from the pre-training source to the target domain.

  • Scalability: Instead of the conventional initialization paradigm that every new architecture need to be retrained, the proposed hyper-initializer only needs pre-trained once for one image modality then can generate effective and efficient initial parameters for any unseen architectures and tasks. Therefore, the proposed algorithm is a convenient and scalable paradigm for the development of the cold-start tasks.

2 Method

Preliminary: Given the objective function and a N-sample training dataset , the conventional optimization procedure of a given neural network can be formalized as Eq.(1):


where is the weights of , and represents a forward pass of with . As the neural networks can be described by directed acyclic graph (DAG)[14], the operations (e.g. convolutions, batch-normalizations, activations) are represented by initial node features , where each

is a one-hot vector representing one of the

operation categories. The connection of the operations(nodes) is described by a binary adjacency matrix , thus .

Figure 1: Training procedure of the hyper-initializer. Firstly, the architecture and images are randomly sampled from and ; Secondly, the hyper-initializer generate for the input architecture. The self-supervised samples generated by transformation function are fed into ; Finally, the self-supervised objective function is optimized to update by gradient descent.

Construction of the hyper-initializer: In this paper, we propose the hyper-initializer that directly predicts modality-specific initialization parameters in a single forward pass for any neural architectures. As illustrated in the gray area of Fig.1, the proposed hyper-initializer mainly consists of three modules: 1) The node embedding module embed the initial node features to d-dimensional features through an embedding layer, which is a lookup table with learnable parameters. 2) A K-layer graph neural network (GatedGNN) variant from the most recent research [20] is applied to learn valuable features and associations implicit in the computational graph. The forward pass equation of GatedGNN is:


where indicate the features of node extracted from -th graph layer; and are two multi-layer perceptions. The is the shortest path between node and ; corresponds to the incoming neighbor of node ; are neighbors satisfying , where

is a hyperparameter which is set to 50.

is the update function of the Gated Recurrent Unit

[5]. 3) A decoder module with multiple linear and convolutional layers transforms the GatedGNN output to the parameter for node(operation) . The detailed architecture of the hyper-initializer is presented in the supplementary materials due to the limitation of paper length.

Hyper-initializer training: The training procedure of the hyper-initializer is shown in Fig.1. Firstly, we construct the architecture collection with architectures sampled from the network design space. The space covers a high variety of stems, including VGG [28], ResNets [17], MobileNet [18], and others more recently proposed architectures. We follow the hypothesis in [20] that increased training diversity can improve the hyper-initializer generalizability to unseen architectures. Secondly, the images and architectures are randomly sampled from unlabeled medical image datasets and architecture collection . The transformation function generates self-supervise sample pairs. In this paper, is implemented by rotation angle classification proposed by [11]. Finally, along with the conventional optimization procedure in Eq.1, the parameters of the hyper-initializer can be achieved by minimizing the bi-level self-supervised optimization problem:


where the parameter set , represents the learnable parameters of embedding encoder, GatedGNN, and decoder respectively.

Once the hyper-initializer reaches convergence, the downstream tasks with any unseen architecture can fine-tune from the parameters generated by hyper-initializer in seconds, which is convenient and time-saving.

3 Experiments

Datasets: To evaluate the proposed algorithm sufficiently, we use six publicly available medical image datasets over three modalities including fundus image, computed tomography (CT), and X-Ray radiography. Concretely, EyePACS [6] and APTOS2019 [1], DRIVE [29], and Refuge-2 [21] contain fundus images for classification and segmentation tasks. CT and X-Ray images are from RSNA Intracranial Hemorrhage Detection (RSNA-IHD) [9] and RSNA Pneumonia Detection Challenge (RSNA-CXR) [23]

, respectively. The evaluation metric for diabetic retinopathy grading on EyePACS and APTOS is Kappa


which is a common-used metric for multi-grade classification. For the binary classification on RSNA-IHD and RSNA-CXR datasets, we use the AUC (area under receiver operating characteristic curve) metric. Additionally, the Dice metric is used to evaluate the performance on Refuge-2 for optic cup/disk segmentation and DRIVE for vessel segmentation. The details of the datasets are listed in Table


Modality Dataset Task Metric Distribution
Training Set Evaluation Set
Fundus EyePACS[6] Cls Kappa 35125 42670
APTOS[1] Cls Kappa 2929 733
Refuge-2[21] Seg Dice 800 400
DRIVE[29] Seg Dice 20 20
CT RSNA-IHD[9] Cls AUC 60224 15056
X-Ray RSNA-CXR[23] Cls AUC 24181 6046
Table 1: The details of the datasets in the experiments. We can see that the datasets cover multiple modalities and tasks. Cls and Seg are short for classification and segmentation tasks

Implementation details:

All of our experiments are implemented with PyTorch 1.10 and NVIDIA-P40. The hyper-initializer training phase last 50 epochs (iterated 50 times over

) with pixels resized input images and augmented by RandomResizedCrop with scale range. For each modality, the unsupervised dataset consists of all images available. In the downstream fine-tune phase, all samples are resized to pixels. Besides the data augmentation methods used for the hyper-initializer training, the RandomRotation and RandomFlip are also applied in the fine-tuning phase. The cosine learning rate annealing [22] start from is applied in all training procedures.

3.1 Effectiveness Evaluation

To verify the effectiveness of the proposed algorithm, we compare the downstream models initialized by the hyper-initializer to random-initializer (train from scratch) on three modalities. To observe the sensitivity of the initializers under diffident number of training samples, we compare the two initializers under the subsets of the training data with different scales.

As shown in Fig.2, the models achieves better performance on all modalities when initialized with hyper-initializer especially when the numbers of the training samples are small, which reveal that the proposed algorithm is more significant in small data regime.

Figure 2: The performance improvement corresponding to the number of training samples. The horizontal axis represents the retained ratio of the complete training set. The vertical axis represents the best performance achieved by the model, which indicate the model benefits of hyper-initializer under different data scales.
Figure 3: Evolution of class activation maps (CAM) under different initialize methods. The higher values in the heatmap indicate the larger activation and attention on the corresponding regions. The hyper-initialized models quickly focus on the lesion regions of all the three modalities, while the activation regions of the random-initialized models trend to drift in the first several fine-tuning epochs.

Additionally, we explore the class activation maps (CAM) [35] of the downstream model initialized by the two initialization methods, and visualize the checkpoints at epoch 1, 3, 5, 10, and 20, respectively. The CAM heatmaps in Fig.3 indicate the relevant regions generated by the classification models in an interpretation way. We can see that almost all the salience regions indicated by the hyper-initialized models cover the the class-related lesions in the early fine-tuning epochs. However, the attention of the random-initialized models are various in the first several epochs. Therefore, the models initialized with the proposed algorithm can quickly achieve stable and accurate class-related attention, which reveal that the hyper-initializer transfer valuable modality-specific features from the pre-training source to the downstream tasks.

3.2 Efficiency Evaluation

Figure 4: Convergence curve of the model on three modalities initialized by Hyper-Initializer and Random-Initializer.

To verify that the hyper-initializer can speed up the convergence of the downstream tasks, we record the convergence curves of the downstream tasks initialized by the hyper-initializer and random-initializer during the fine-tuning processes. As seen in Fig.4, the horizontal axis is the training epochs, and the vertical axis is the performance on the validation set corresponding to the epochs. We can see the hyper-initialized models converges faster than random-initialized ones for all three modalities. Note that the speed enhancement of hyper-initializer is inconspicuous on RSNA-IHD compared to that on other datasets. The reason is that the training samples in the RSNA-IHD (over 60 thousands images) is much lager than the other two datasets, which in line with the aforementioned conclusion that the hyper-initializer is more suitable for the limited data tasks.

3.3 Scalability Evaluation

We’ve shown the hyper-initializer can be used for multiple image modalities in the above experiments. In this section, we verify the scalability by applying the hyper-initializer to more neural architectures and downstream tasks. Theoretically, since the hyper-initializer is a type of hypernetwork, it can be used for any network architectures because of the intrinsic property of the hypernetworks [20]. As seen in Table 2, we show that the proposed hyper-initializer can not only generate effective initial parameters for various of ResNets [17], but also for the RegNet family [25] which is the recent state-of-the-art architecture based on network architecture search.

Besides the classification tasks, we also evaluate our method on vessel and optic cup/disk segmentation tasks which are commonly in fundus analysis. For U-Net family methods, our hyper-initializer only generates initial parameters for the encoder and leaves the decoder initialized randomly. We can see that the proposed algorithm can also generate favorable initial parameters for the segmentation tasks.

Modality Dataset Architectures Performance
Random Initialize Hyper-Initialize
Fundus EyePACS ResNet-18 0.7433 0.7835
ResNet-50 0.7870 0.7940
ResNet-101 0.7311 0.7734
RegNetY-1.6GF 0.6867 0.7981
APTOS RegNetY-1.6GF 0.8472 0.8931
Refuge-2 UNet 0.7086 0.7816
DRIVE UNet 0.6853 0.7079
CT RSNA-IHD ResNet-18 0.9584 0.9639
RegNetY-1.6GF 0.9589 0.9605
DR RSNA-CXR ResNet-18 0.8844 0.9006
RegNetY-1.6GF 0.8976 0.9020
Table 2: Evaluation metrics for classification and segmentation tasks on the fundus, CT, and X-Ray. The results illustrate the proposed algorithm is scalable for various architectures and tasks.

4 Conclusion

In this paper, we presented an architecture-irrelevant hyper-initializer for initializing the medical imaging models. Designed as a type of hypernetwork, the proposed initializer can generate initial parameters for any input network architectures. Optimizing the proposed algorithm with a simple self-supervised learning approach, extensive experimental evaluations reveal that the hyper-initializer can enhance the accuracy and speed up the convergence for multiple medical image modalities, especially in small data regime. To the best of our knowledge, this the the first hyper-initializer for medical imaging tasks, and we expect that the idea of hyper-initializing can be extent by the further researchers. We will try more architectures and self-supervised methods for our hyper-initializer in the future. Furthermore, we will extend the architecture collection from 2D to 3D operations, which enable the initial parameter prediction for 3D models.


  • [1] Aptos 2019 blindness detection,
  • [2]

    Alzubaidi, L., Al-Amidie, M., Al-Asadi, A., Humaidi, A.J., Al-Shamma, O., Fadhel, M.A., Zhang, J., Santamaría, J., Duan, Y.: Novel transfer learning approach for medical imaging with limited labeled data. Cancers

    13,  1590 (2021)
  • [3]

    Azizi, S., Mustafa, B., Ryan, F., Beaver, Z., Freyberg, J., Deaton, J., Loh, A., Karthikesalingam, A., Kornblith, S., Chen, T., Natarajan, V., Norouzi, M.: Big self-supervised models advance medical image classification. In: International Conference on Computer Vision (ICCV) (2021)

  • [4] Chen, L., Bentley, P., Mori, K., Misawa, K., Fujiwara, M., Rueckert, D.: Self-supervised learning for medical image analysis using image context restoration. Medical Image Analysis 58, 101539–101539 (2019)
  • [5] Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)
  • [6] Cuadros, J., Bresnick, G.: Eyepacs: an adaptable telemedicine system for diabetic retinopathy screening. Journal of diabetes science and technology 3(3), 509–516 (2009)
  • [7] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR (2009)
  • [8] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2019)
  • [9]

    Flanders, A.E., Prevedello, L.M., Shih, G., Halabi, S.S., Kalpathy-Cramer, J., Ball, R., Mongan, J.T., Stein, A., Kitamura, F.C., Lungren, M.P., et al.: Construction of a machine learning dataset through collaboration: the rsna 2019 brain ct hemorrhage challenge. Radiology: Artificial Intelligence

    2(3), e190211 (2020)
  • [10] Ghesu, F.C., Georgescu, B., Mansoor, A., Yoo, Y., Neumann, D., Patel, P., Vishwanath, R.S., Balter, J.M., Cao, Y., Grbic, S., Comaniciu, D.: Self-supervised learning from 100 million medical images (2022)
  • [11] Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018)
  • [12] Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in 1 hour (2018)
  • [13] Gulshan, V., Peng, L., Coram, M., Stumpe, M., Wu, D., Narayanaswamy, A., Venugopalan, S., Widner, K., Madams, T., Cuadros, J., Kim, R., Raman, R., Nelson, P., Mega, J., Webster, D.: Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316(22), 2402–2410 (2016)
  • [14] Ha, D., Dai, A., Le, Q.V.: Hypernetworks (2016)
  • [15] Hammoudi, K., Benhabiles, H., Melkemi, M., Dornaika, F., Arganda-Carreras, I., Collard, D., Scherpereel, A.: Deep learning on chest x-ray images to detect and evaluate pneumonia cases at the era of covid-19. Journal of Medical Systems 45, 75–75 (2021)
  • [16] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: International Conference on Computer Vision (2019)
  • [17]

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition (2016)

  • [18] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
  • [19] Jia, X., Song, S., He, W., Wang, Y., Rong, H., Zhou, F., Xie, L., Guo, Z., Yang, Y., Yu, L., Chen, T., Hu, G., Shi, S., Chu, X.: Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes (2018)
  • [20] Knyazev, B., Drozdzal, M., Taylor, G.W., Romero Soriano, A.: Parameter prediction for unseen deep architectures. Advances in Neural Information Processing Systems 34 (2021)
  • [21] Li, F., Song, D., Chen, H., Xiong, J., Li, X., Zhong, H., Tang, G., Fan, S., Lam, D.S., Pan, W., et al.: Development and clinical deployment of a smartphone-based visual field deep learning system for glaucoma detection. NPJ digital medicine 3(1),  1–8 (2020)
  • [22] Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
  • [23] Pan, I., Cadrin-Chênevert, A., Cheng, P.M.: Tackling the radiological society of north america pneumonia detection challenge. American Journal of Roentgenology 213(3), 568–574 (2019)
  • [24] Peng, L., Liang, H., Li, T., Sun, J.: Rethink transfer learning in medical image classification. arXiv: Image and Video Processing (2021)
  • [25] Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Dollár, P.: Designing network design spaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10428–10436 (2020)
  • [26] Raghu, M., Zhang, C., Kleinberg, J., Bengio, S.: Transfusion: Understanding transfer learning for medical imaging. In: Neural Information Processing Systems (2019)
  • [27] Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., Ding, D., Bagul, A., Langlotz, C., Shpanskaya, K., Lungren, M.P., Ng, A.Y.: Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning (2017)
  • [28] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  • [29] Staal, J., Abràmoff, M.D., Niemeijer, M., Viergever, M.A., Van Ginneken, B.: Ridge-based vessel segmentation in color images of the retina. IEEE transactions on medical imaging 23(4), 501–509 (2004)
  • [30] Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for deep learning in nlp. In: Meeting of the Association for Computational Linguistics (2019)
  • [31] Venugopalan, J., Tong, L., Hassanzadeh, H.R., Wang, M.D.: Multimodal deep learning models for early detection of alzheimer’s disease stage. Scientific Reports 11, 3254–3254 (2021)
  • [32] Wang, D., Wang, C., Masters, L., Barnett, M.H.: Masked multi-task network for case-level intracranial hemorrhage classification in brain ct volumes. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI) (2020)
  • [33] Yang, Y., Fangxin, S., Binghong, W., Dalu, Y., Lei, W., Xu, Y., Zhang, W., Zhang, T.: Robust collaborative learning of patch-level and image-level annotations for diabetic retinopathy grading from fundus image. IEEE Transactions on Cybernetics pp. 1–11 (2021)
  • [34] Zhang, H., Gu, Y., Qin, Y., Yao, F., Yang, G.Z.: Learning with sure data for nodule-level lung cancer prediction. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI) (2020)
  • [35]

    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2921–2929 (2016)