In transfer learning, the source problem may be seemingly unrelated to the target problem that is being solved. For example, Imagenet(Russakovsky et al., 2015), a large-scale dataset for object recognition, has been successfully used as source data for many medical imaging target tasks, with (Schlegl et al., 2014; Bar et al., 2015; Ciompi et al., 2015) among the earliest examples. Using other medical datasets as source data is less frequent, possibly because pretrained models are not as conveniently available as models trained on Imagenet, which are included in various toolboxes. It is therefore unclear whether pretraining on Imagenet is indeed the best strategy to choose for transfer learning in medical imaging.
In this paper we review a number of papers which have used multiple source and/or target datasets, where the target datasets are from the medical imaging domain. Our goal is to get insights into what type of considerations should be made when choosing a source dataset for transfer learning. We first review the papers that compare different source data (Section 2) and provide a summary of publicly available source datasets (Section 2.1). We then discuss several gaps in current literature and opportunities for future research in Section 3.
2 Comparisons of source datasets
Schlegl et al. (2014)
address five-class classification of abnormalities in 2D slices of chest CT images. They pretrain an unsupervised convolutional restricted Boltzmann machine on different source datasets with 20K patches, and fine-tune an entire CNN with varying sizes of lung patches. The target data is from 380 chest CT scans of the LTRC dataset(Bartholmai et al., 2006). The source data includes chest CT scans from LTRC, chest CT scans from a private dataset, brain CT scans from a private dataset, and natural images from the STL-10 dataset (Coates et al., 2011), a subset of ImageNet. Natural images performed comparably or even slightly better than using only lung images. Brain images were less effective, possibly due to large homogeneous areas present in the scans, which are not present in more texture-rich lung scans.
Tajbakhsh et al. (2016) address four different applications: polyp detection in colonoscopy, image quality assessment in colonoscopy, pulmonary embolism detection in CT, and intima-media boundary segmentation in ultrasonography. They investigate full training and fine-tuning in a layerwise manner with Alexnet pretrained on ImageNet. Overall they observe that fine-tuning only the last layers performed worse than full training, but fine-tuning more layers was comparable to, or outperformed full training. Fine-tuning more layers was especially important for polyp detection and intima-media boundary segmentation, which the authors hypothesize are less similar to ImageNet than the other applications they examined.
Shin et al. (2016) address two tasks: thoraco-abdominal lymph node detection and interstitial lung disease (ILD) classification. CIFAR-10 (Krizhevsky and Hinton, 2009) and Imagenet are used as source data. They compare training from scratch, off-the-shelf and finetuning strategies for different networks: Cifarnet (trained on CIFAR-10), Alexnet (trained on Imagenet) and GoogLeNet. Cifarnet is used only with the off-the-shelf strategy, Alexnet with all three, and GoogleNet only with from-scratch and fine-tuning. For lymph node detection, the off-the-shelf strategy gives the worst results, but Cifarnet outperforms Alexnet. Full training and fine-tuning lead to the best results, with fine-tuning being most beneficial for GoogLeNet. For ILD classification, Alexnet achieves similar performance with all three strategies, and for GoogLeNet fine-tuning is the most beneficial.
Zhang et al. (2017) address detection and classification of colorectal polyps in endoscopy images. They pretrain an eight-layer CNN and use the lower layers to extract features from the target data, which are then classified with an SVM. They use Imagenet and Places (Zhou et al., 2017) as source datasets. As target datasets, they use a private endoscopy dataset with 2K images in three classes, and a public endoscopy dataset (Mesejo et al., 2016) with videos from which they extract 332 images in three classes. They hypothesize that Places has higher similarity between classes than Imagenet, which would help distinguish small differences in polyps. This indeed leads to higher recognition rates, also while varying other parameters of the classifier.
Cha et al. (2017) predict the response to cancer treatment in the bladder of 82 patients using a five-layer CNN. They compare networks without transfer learning to two other source datasets: 60K natural images from CIFAR-10, and 160K bladder ROIs from 81 patients from a previous study. They find no statistically significant differences in the AUC values of two-fold cross-validation using these strategies.
Christodoulidis et al. (2017) address classification of interstitial lung disease in patches of CT images. They use six public texture datasets as the source data, training a seven-layer network on each dataset and combining the networks in an ensemble. Individually, the source datasets result in networks with comparable performance, but the performance varies a lot depending on the number of layers transferred. The ensemble outperforms the individual networks. The ensemble also outperforms a network, trained on the union of the datasets.
as source data. They compare off-the-shelf, full training and fine-tuning strategies for a VGG network. They also investigate “double transfer”: fine-tuning the pretrained Imagenet model on KaggleDR and only then on the target task. Fine-tuning outperforms off-the-shelf features when transferring from both sources. When transferring from Imagenet, off-the-shelf features outperform full training, but when transferring from KaggleDR, off-the-shelf features perform comparably with full training. Double transfer performs worse than transfer from Imagenet alone. This is in contrast to the hypothesis of the authors, that KaggleDR will lead to best results because of the visual similarity of the data.
Ribeiro et al. (2017) investigate pretraining and fine-tuning with nine different source datasets (natural images, texture images and endoscopy images) for classification of polyps in endoscopy images. Different from most other papers, they extract datasets of the same number of classes and images from the available types of data for the pretraining. They find that texture datasets perform best as source data, but if the size of the source dataset is small, it is better to select a larger unrelated source dataset.
Shi et al. (2018) address prediction of ocult invasive disease in ductal carcinoma in situ (DCIS) in mammography images of 140 patients. They use three public datasets as the source data: Imagenet, texture dataset DTD (Cimpoi et al., 2014) and dataset of mammography images INbreast (Moreira et al., 2012)
. They pretrain a 16-layer VGG network, extract off-the-shelf features from the target data using different network layers, and train a logistic regression classifier. They hypothesize that INbreast is most similar to the target data and will lead to the best results (and conversely, the least similar Imagenet will lead to the worst results), and report that the average AUCs are consistent with this hypothesis.
Du et al. (2018) address classification of 15K epithelium and stroma ROIs in 158 digital pathology images. Imagenet and Places are used as the source data. They extract off the shelf features from different layers of several architectures, where only AlexNet is trained on both sources. Comparing the AUCs of the AlexNet trained on Imagenet and Places, the layer used to extract the features (lower layers are better) has more influence than which data is used for pretraining.
Mormont et al. (2018) focus on tissue classification. They argue that experiments are often carried out on a single dataset, therefore as target data they use eight tissue classification datasets with 1K to 30K images and 2 to 10 classes. They perform a comparison of seven architectures which are all trained on Imagenet. They extract features off the shelf or after fine-tuning, and train a supervised classifier. They show that fine-tuning usually outperforms the other methods for any network, especially for multi-class datasets. They also find that the last layer is never the best to extract feature from, possibly because the features are too specific for natural images.
Lei et al. (2018) address HEp-2 cell classification in the ICPR 2016 challenge as the target task (Lovell et al., 2016). Among other models, they compare a Resnet pretrained on Imagenet, to a Resnet pretrained on data from the earlier edition of the challenge, ICPR 2012 (Foggia et al., 2013). They hypothesize that pretraining on ICPR 2012 will lead to similar feature representations both in the lower and higher layers, and show that the network pretrained on ICPR 2012 data outperforms the Imagenet network.
Wong et al. (2018) focus on two tasks: three-class classification of brain tumors in 3D MR images and nine-class classification in 2D cardiac CTA images. They argue that pretrained Imagenet models are not suitable for medical target tasks because of unnecessary resizing of images, too large number of classes, and the absence of 3D information. They use a modified U-Net which is first trained on a segmentation task on the same data, using either manual segmentations or segmentations generated with a simple thresholding method. In tumor classification, where Imagenet is not tested due to the 3D nature of the images, pretraining both with manual and thresholded segmentations outperforms training a network from scratch. In cardiac image classification, pretraining with manual segmentations gives the best results. Pretraining on Imagenet outperforms pretraining on thresholded segmentations. Pretraining on Imagenet also outpeforms training from scratch, but only for low training sizes.
2.1 Public source datasets
A list of publicly available source datasets used in papers comparing multiple sources, but focusing on medical target tasks, is presented in Table 1. Imagenet is a popular choice, although some papers use other object recognition datasets such as CIFAR-10. Several papers use texture datasets, of which a variety is available. Only a few medical source datasets are listed, often because a private medical dataset is used.
|Imagenet (Russakovsky et al., 2015)||object recognition||1.2M||1K||Tajbakhsh et al. (2016)|
|Shin et al. (2016)|
|Menegola et al. (2017)|
|Zhang et al. (2017)|
|Du et al. (2018)|
|Mormont et al. (2018)|
|Lei et al. (2018)|
|Wong et al. (2018)|
|STL-10 Coates et al. (2011)||object recognition||100K||10||Schlegl et al. (2014)|
|Places (Zhou et al., 2017)||scene recognition||2.5M||205||Zhang et al. (2017)|
|Du et al. (2018)|
|DTD (Cimpoi et al., 2014)||texture classification||5.6K||47||Ribeiro et al. (2017)|
|Shi et al. (2018)|
|(Christodoulidis et al., 2017)|
|ALOT (Burghouts and Geusebroek, 2009)||texture classification||28K||250||Ribeiro et al. (2017)|
|(Christodoulidis et al., 2017)|
|KTH-TIPS (Dana et al., 1999)||texture classification||810||10||Ribeiro et al. (2017)|
|(Christodoulidis et al., 2017)|
|CIFAR-10 (Krizhevsky and Hinton, 2009)||object classification||60K||10||Shin et al. (2016)|
|Cha et al. (2017)|
|FMD (Sharan et al., 2009)||texture classification||1K||10||Christodoulidis et al. (2017)|
|KTB (Kylberg, 2011)||texture classification||4.5K||27||Christodoulidis et al. (2017)|
|UIUC (Lazebnik et al., 2005)||texture classification||1K||25||Christodoulidis et al. (2017)|
|CALTECH-101 (Fei-Fei et al., 2006)||object recognition||9K||101||Ribeiro et al. (2017)|
|COREL-1000||scene recognition||1K||10||Ribeiro et al. (2017)|
|Kaggle-DR Graham (2015)||retinal image classification||35K||2||Menegola et al. (2017)|
|INbreast (Moreira et al., 2012)||breast lesion classification||410||2||Shi et al. (2018)|
|ICPR 2012 (Foggia et al., 2013)||HEp-2 cell classification||1.5K||6||Lei et al. (2018)|
We have summarized several papers which use medical or non-medical data as source data and apply the classifier on medical target data. A limitation is that such papers are difficult to discover - other than ”transfer learning”, which returns over 2K results when combined with ”medical imaging” on Google Scholar, we have not been able to find keywords that identify when different source of datasets have been used. We encourage readers to notify us of any papers that were not included, but also investigate this phenomenon.
The results of the comparisons point in different directions. Schlegl et al. (2014); Menegola et al. (2017) find natural images more effective than medical. Ribeiro et al. (2017) have most success with texture images, compared to natural or medical images. Shi et al. (2018); Lei et al. (2018); Wong et al. (2018) get better results using medical images as source data, and Cha et al. (2017) find no differences between using different sources.
It is difficult to compare these results directly because of differences in how transfer learning is implemented. Examples of variation include the subset of the source data that is used, the architecture of the network, and how the transfer was implemented, both in terms of strategy (off-the-shelf or fine-tuning) and which layers were used for the transfer.
Another issue is that the target datasets in medical imaging can be very small, and it is not clear if the results would generalize to another similar dataset. Methods are sometimes compared by looking only at a single run of each method, or at an average over multiple runs, but without considering possible variability in such performances. A recent paper comparing medical image challenges Maier-Hein et al. (2018) shows that in such conditions, rankings of algorithms can easily change, for example if a slightly different metric is used. Most papers we surveyed performed no statistical significance tests - if this was the case, perhaps the conclusions would be different.
There are opportunities in doing more systematic comparisons. One direction is to use more of the available datasets, both from the non-medical and medical domains. It would be informative to vary the number of images and number of classes in the data, similar to (Ribeiro et al., 2017). Also of interest would be comparing different tasks, such as segmentation and classification, involving the same images, similar to (Wong et al., 2018).
The number of public medical source datasets is rather low. A strategy that could be helpful to counteract this, but seems underexplored, is unsupervised pretraining. This would allow the use of larger unlabeled medical datasets, which may be only weakly labeled. Another way to increase the number of source datasets would be to share pretrained models, which would also allow transfer learning from private datasets, without sharing the data itself.
Similarity of datasets is often used to hypothesize about which source data will be best, but definitions of similarity differ. For example, Menegola et al. (2017) discuss similarity in terms of visual similarities of the images, Lei et al. (2018)
discuss similarity in terms of feature representations. In computer vision, other definitions may be used - for example,Azizpour et al. (2015) investigate transfer from Imagenet and Places to 15 other datasets, and define similarity in terms of the number and variety of the classes. Given a definition of similarity, it remains a question which datasets would be best to use for pretraining. Arguably, the most similar dataset to the target dataset, is the target dataset itself, which might not add any additional information.
Instead of considering only the similarity of the source data, perhaps the diversity of the source data is also an important factor. Instead of selecting one dataset as the source, it might be a good strategy to use an ensemble, similar to (Christodoulidis et al., 2017). In this case, the answer to the question posed by the title, is simply “both”.
Conflict of Interest
The authors declare no conflict of interest.
Azizpour et al. 
H. Azizpour, A. Sharif Razavian, J. Sullivan, A. Maki, and S. Carlsson.
From generic to specific deep representations for visual recognition.
Computer Vision and Pattern Recognition Workshops (CVPR-W), pages 36–45, 2015.
- Bar et al.  Y. Bar, I. Diamant, L. Wolf, S. Lieberman, E. Konen, and H. Greenspan. Chest pathology detection using deep learning with non-medical training. In International Symposium on Biomedical Imaging (ISBI), pages 294–297. IEEE, 2015.
- Bartholmai et al.  B. Bartholmai, R. Karwoski, V. Zavaletta, R. Robb, and D. R. I. Holmes. The lung tissue research consortium: an extensive open database containing histological, clinical, and radiological data to study chronic lung disease. Insight Journal: 2006 MICCAI Open Science Workshop, 2006.
- Burghouts and Geusebroek  G. J. Burghouts and J.-M. Geusebroek. Material-specific adaptation of color invariant features. Pattern Recognition Letters, 30(3):306–313, 2009.
- Cha et al.  K. H. Cha, L. M. Hadjiiski, H.-P. Chan, R. K. Samala, R. H. Cohan, E. M. Caoili, C. Paramagul, A. Alva, and A. Z. Weizer. Bladder cancer treatment response assessment using deep learning in CT with transfer learning. In SPIE Medical Imaging, pages 1013404–1013404. International Society for Optics and Photonics, 2017.
- Cheplygina et al.  V. Cheplygina, M. de Bruijne, and J. P. Pluim. Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. arXiv preprint arXiv:1804.06353, 2018.
Christodoulidis et al. 
S. Christodoulidis, M. Anthimopoulos, L. Ebner, A. Christe, and S. Mougiakakou.
Multisource transfer learning with convolutional neural networks for lung pattern analysis.IEEE Journal of Biomedical and Health Informatics, 21(1):76–84, 2017.
- Cimpoi et al.  M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014.
- Ciompi et al.  F. Ciompi, B. de Hoop, S. J. van Riel, K. Chung, E. T. Scholten, M. Oudkerk, P. A. de Jong, M. Prokop, and B. van Ginneken. Automatic classification of pulmonary peri-fissural nodules in computed tomography using an ensemble of 2D views and a convolutional neural network out-of-the-box. Medical image analysis, 26(1):195–202, 2015.
Coates et al. 
A. Coates, A. Ng, and H. Lee.
An analysis of single-layer networks in unsupervised feature
Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223, 2011.
- Dana et al.  K. J. Dana, B. Van Ginneken, S. K. Nayar, and J. J. Koenderink. Reflectance and texture of real-world surfaces. ACM Transactions On Graphics (TOG), 18(1):1–34, 1999.
- Du et al.  Y. Du, R. Zhang, A. Zargari, T. C. Thai, C. C. Gunderson, K. M. Moxley, H. Liu, B. Zheng, and Y. Qiu. A performance comparison of low-and high-level features learned by deep convolutional neural networks in epithelium and stroma classification. In Medical Imaging 2018: Digital Pathology, volume 10581, page 1058116. International Society for Optics and Photonics, 2018.
- Fei-Fei et al.  L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.
- Foggia et al.  P. Foggia, G. Percannella, P. Soda, and M. Vento. Benchmarking HEp-2 cells classification methods. IEEE Transactions on Medical Imaging, 32(10):1878–1889, 2013.
- Graham  B. Graham. Kaggle diabetic retinopathy detection competition report. University of Warwick, 2015.
- Greenspan et al.  H. Greenspan, B. Van Ginneken, and R. M. Summers. Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique. IEEE Transactions on Medical Imaging, 35(5):1153–1159, 2016.
- Krizhevsky and Hinton  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Kylberg  G. Kylberg. Kylberg Texture Dataset v. 1.0. Centre for Image Analysis, Swedish University of Agricultural Sciences and Uppsala University, 2011.
- Lazebnik et al.  S. Lazebnik, C. Schmid, and J. Ponce. A sparse texture representation using local affine regions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1265–1278, 2005.
- Lei et al.  H. Lei, T. Han, F. Zhou, Z. Yu, J. Qin, A. Elazab, and B. Lei. A deeply supervised residual network for hep-2 cell classification via cross-modal transfer learning. Pattern Recognition, 79:290–302, 2018.
- Litjens et al.  G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak, B. Van Ginneken, and C. I. Sánchez. A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88, 2017.
- Lovell et al.  B. C. Lovell, G. Percannella, A. Saggese, M. Vento, and A. Wiliem. International contest on pattern recognition techniques for indirect immunofluorescence images analysis. In Pattern Recognition (ICPR), 2016 23rd International Conference on, pages 74–76. IEEE, 2016.
- Maier-Hein et al.  L. Maier-Hein, M. Eisenmann, A. Reinke, S. Onogur, M. Stankovic, P. Scholz, T. Arbel, H. Bogunovic, A. P. Bradley, A. Carass, et al. Is the winner really the best? a critical analysis of common research practice in biomedical image analysis competitions. Nat. Commun., 2018.
- Menegola et al.  A. Menegola, M. Fornaciali, R. Pires, F. V. Bittencourt, S. Avila, and E. Valle. Knowledge transfer for melanoma screening with deep learning. In International Sympsium on Biomedical Imaging (ISBI), pages 297–300. IEEE, 2017.
- Mesejo et al.  P. Mesejo, D. Pizarro, A. Abergel, O. Rouquette, S. Beorchia, L. Poincloux, and A. Bartoli. Computer-aided classification of gastrointestinal lesions in regular colonoscopy. IEEE transactions on medical imaging, 35(9):2051–2063, 2016.
- Moreira et al.  I. C. Moreira, I. Amaral, I. Domingues, A. Cardoso, M. J. Cardoso, and J. S. Cardoso. Inbreast: toward a full-field digital mammographic database. Academic radiology, 19(2):236–248, 2012.
- Mormont et al.  R. Mormont, P. Geurts, and R. Marée. Comparison of deep transfer learning strategies for digital pathology. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 2262–2271, 2018.
- Ribeiro et al.  E. Ribeiro, M. Häfner, G. Wimmer, T. Tamaki, J. Tischendorf, S. Yoshida, S. Tanaka, and A. Uhl. Exploring texture transfer learning for colonic polyp classification via convolutional neural networks. In International Symposium on Biomedical Imaging (ISBI), pages 1044–1048. IEEE, 2017.
- Russakovsky et al.  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
- Schlegl et al.  T. Schlegl, J. Ofner, and G. Langs. Unsupervised pre-training across image domains improves lung tissue classification. In Medical Computer Vision: Algorithms for Big Data (MICCAI MCV), pages 82–93. Springer, 2014.
- Sharan et al.  L. Sharan, R. Rosenholtz, and E. Adelson. Material perception: What can you see in a brief glance? Journal of Vision, 9(8):784–784, 2009.
Shi et al. 
B. Shi, R. Hou, M. A. Mazurowski, L. J. Grimm, Y. Ren, J. R. Marks, L. M. King,
C. C. Maley, E. S. Hwang, and J. Y. Lo.
Learning better deep features for the prediction of occult invasive disease in ductal carcinoma in situ through transfer learning.In Medical Imaging 2018: Computer-Aided Diagnosis, volume 10575, page 105752R. International Society for Optics and Photonics, 2018.
- Shin et al.  H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and R. M. Summers. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Transactions on Medical Imaging, 35(5):1285–1298, 2016.
- Tajbakhsh et al.  N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang. Convolutional neural networks for medical image analysis: full training or fine tuning? IEEE Transactions on Medical Imaging, 35(5):1299–1312, 2016.
- Wong et al.  K. C. Wong, T. Syeda-Mahmood, and M. Moradi. Building medical image classifiers with very limited data using segmentation networks. Medical image analysis, 49:105–116, 2018.
- Zhang et al.  R. Zhang, Y. Zheng, T. W. C. Mak, R. Yu, S. H. Wong, J. Y. Lau, and C. C. Poon. Automatic detection and classification of colorectal polyps by transferring low-level CNN features from nonmedical domain. IEEE Journal of Biomedical and Health Informatics, 21(1):41–47, 2017.
- Zhou et al.  B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.