In medical image analysis, tissue characterization and classification are among the most important components in a Computer Aided Diagnosis (CAD) system. An accurate and robust tissue classifier is one of the ultimate goals for many radiology applications. In recent years, impressive improvements on various computer vision problems have been reported using deep learning-based models over traditional machine learning and statistical methods. Specifically, Convolutional Neural Networks (CNN) have shown superior capabilities in extracting the low to high-level image features needed to perform the classification with the deep neural networks. These successes have motivated the increasing application of deep learning for medical image analysis[1, 2, 3, 4]. However, it has also been recognized that deep learning models (and actually most of the learning-based methods) are far more difficult to be successfully applied on medical image analysis comparing with the natural image analysis. One of the main challenges arises from the limited number of labeled samples for training the model , as annotation and labeling of medical images is a highly time consuming and labor-intensive work. Also, medical images are highly heterogeneous both on an individual-level and population-level. The combination effect of these two limitations severely degrades the robustness of the models and the reproducibility of the learning results. The highly sensitive nature of the medical images regarding privacy concerns poses difficulty in cross-institutional data sharing which further limits the availability of case material.
On the other hand, the availability and size of medical images on public domains have been increasing very fast over the past few years. However most of them are not annotated as these databases are usually provided for general purpose of use, and annotation is extremely costly. Consequently, there exist huge discrepancies between the large number of datasets to be analyzed and the very limited number of available annotations to be used as training data. In response to the challenge of the lack of training samples, we propose the self-paced Convolutional Neural Network (spCNN) framework which is able to identify unlabeled image patches as "virtual" training samples. These virtual samples are then mixed with the original manually-labeled samples to retrain a new network. By introducing the virtual samples, we can practically obtain any number of training samples by increasing the computational load of the machines in exchange for the (much more expensive) human labor work, provided with sufficient unlabeled images to be analyzed. As the database used in this work is constitutes the imaging data from 10,000 subjects, the potential number of virtual samples can be huge to support highly complicated learning.
We tested the performance of both the raw CNN trained only from manually-labeled samples and the CNN retrained on the mixed training samples, by applying them both to another benchmark testing dataset labeled by a different group of experts. The classification results show that the accuracy of the spCNN framework is improved by over 10% with the help of the virtual samples. The improvement is shown to be consistent using different network architectures and parameter settings. The performance improvement show that the proposed spCNN framework could provide a new perspective on improving the performance of learning-based methods in medical image analysis as well as other applications with limited training samples. That is, we can leverage small-size training data to train an initial network, then adaptively select the virtual samples to add them back into the training set to improve the performance in a "snow-balling" fashion.
2 Related Works
As the name implies, our spCNN framework is motivated by self-paced learning [6, 7, 8] and curriculum learning  methods, which adaptively choose part of labeled instances for the training. Instances considered to be easy are selected first, such as those with large margin or high confidences. Difficult instances will be learned in later stages or even dropped eventually. Most of the self-paced learning methods focus on identifying the optimized order of learning and learning the model simultaneously . In our work CNNs are trained on an initial set of data followed by a bootstrapping scheme to evaluate the unlabeled instances. Then a new CNN is trained based on the selected instances. Another scheme related with our framework is learning from positive and unlabeled examples (PU learning) , where only positive instances are partially labeled. Thus, it is needed in PU learning to identify a set of reliable negative-labeled instances from the unlabeled data and use them for the further training 
, which is similar to our problem yet more focusing on the estimation of the labels without utilizing the learning-based method as we do.
It is also worth noting that several semi-supervised learning techniques including ladder networks which incorporate auto-encoder into the supervised model with skipped connections, as well as the stacked what-where auto-encoders 
that utilize both convolution and deconvolution nets to allow integrated supervised and unsupervised learning. While our spCNN model does not incorporate the unsupervised component, it is advantageous over the semi-supervised learning approaches in that we can obtain an explicit evaluation of the new data and select the samples accordingly based on simple rules, which could be highly useful for clinical practice and decisions.
From an application perspective, there are various literature reports that applied CNN models to analyze medical images with lung diseases [15, 16, 17, 18] and obtained encouraging results. As we discussed previously, the lack of training data is especially challenging in medical image analysis. Current solutions for addressing this issue  include traditional data augmentation techniques based on affine transformations to expand the labeled dataset [18, 19, 20]. Also, researchers have tried utilizing the vast amount of labeled natural images to help training the network in a learning fashion [4, 21] or initialize network parameters (i.e. weights) by pre-training the network with non-medical images then fine-tuning the parameters with the target samples [22, 23]. Although we are focusing on unlabeled medical data, which is conceptually different from these works that focus on the labeled non-medical data. These two approaches are not mutually exclusive however, and can be combined together for an integrated and more effective framework.
3 Materials and Methods
3.1 Method overview
In this work, we propose the self-paced Convolution Neural Network (spCNN) framework in order to improve the accuracy and robustness of the learned model beyond the information provided from the initially limited training data. The major contribution and novelty of the proposed framework is that it leverages the large amount of new, unlabeled data as potential new training samples for the retraining. The new training samples are called "virtual samples" in contrast to the original manually-labeled samples. Specifically, class labels and the distribution of prediction accuracies of the samples in the new dataset are estimated by bootstrapping CNNs. Image patches with significant different prediction probabilities across labels are then pooled together with the original training data for retraining a new CNN. A conceptual diagram of the framework design is illustrated in Fig.1.
3.2 Architecture of the CNN applied in the framework
The CNN model used in the proposed framework is implemented in Caffe, and its architecture is shown in Fig. 2. Image patches of size 3636 are convolved by 4 convolutional layers. The kernel size of all convolutional layers is set to 3. Based on the principles introduced in  that the number of kernels in each layer shall be proportional to the area of its receptive field (in this work, from 33 in the first layer to 6
6 in the fourth layer), we set the number of kernels in the four layers as 45, 80, 125 and 180 respectively. Each convolutional layer is followed by a maximum pooling layer. The extracted features are then fed into three fully connected layers, with the number of neurons being 1080, 360 and 3. These numbers are proportional (6 and 2 times) to the number of features (180), based on the empirical rules reported in. The first two fully connected layers are equipped with dropout layers 
with probability of 50%. Both the convolutional layers and the fully connected layers use the activation function of LeakyReLU.
3.3 Bootstrapping module for virtual sample selection
The key challenge of the spCNN framework is how to correctly select the image patches from the new dataset into the virtual samples: labels of the patches could be wrongfully assigned by the initially trained CNN which is clearly not desired. Specifically, it has been observed in both our experiments and in previous work  that the errors of the model could be quickly accumulated during the retraining and eventually lead to a performance decrease. On the other hand, we will also want to push the boundary of the retraining dataset beyond the original manual annotations, which often involves image patches with higher uncertainty of the accuracy from the network. In other words, the framework needs to balance between the original manually-labeled samples and the latterly identified virtual samples through the learning process.
Thus, in this work we apply a 10-folds bootstrapping scheme to estimate the empirical distribution of the predication probabilities of the new samples, and select the most suitable samples automatically according to the statistical testing. Specifically, we perform random subsampling for 10 times to obtain 10 sets of 90% of the original training data, which are 360 patches for each class for a total of 1080 patches. These 10 sets of data are then used to train the bootstrapping networks. Patches in the new dataset are then classified by each of the 10 networks, resulting in 10 sets of class labels and prediction probabilities for each patch, illustrated in Fig. 3.
It could be found that while certain patches (e.g. the first two patches in Fig.3) in the new dataset can be easily classified with high prediction probabilities and low variability from all the 10 bootstrapping networks, there are cases where the classification uncertainty is much higher (e.g. the third patch). However, visual inspection indicates that the third image patch lies on the boundary between the normal lung tissue and regions outside of lung, which is definitely a good candidate of a virtual sample to be included in the retraining process due to the fact that: 1) the bootstrapping networks unanimously assign a correct label to it, and 2) such patches are much rarer in the manual annotation results and underrepresented in the training set. As reported in , the sheer number of training samples does not help too much for training the network especially if they come from a homogeneous population. It is the samples that are not encountered before will actually lead to better performance. Based on such observations, we also deduce that virtual samples can help expanding the solution space the training process can explore thus improve it.
Thus, in this work, for each of the 10
3 prediction probability matrix of the given image patch estimated by the bootstrapping CNNs, we will perform two two-sample t-tests (as there are totally 3 labels), aiming to find whether the label with highest average prediction probability is significantly higher than both of the other two labels. The two p-values produced by the t-tests from each patch will be then aggregated and further analyzed by the false discovery rate (FDR) control respectively. Here we employed the FDR to minimize the possibility that the huge number of testing performed could lead to increased false positives (i.e. unfitting patches). Patches with significantly different prediction probabilities across the 3 labels will be selected as virtual samples for retraining the new CNN.
4 Experimental Results
4.1 Data acquisition and preprocessing
In this work, we use the data from the COPDGene database  sponsored by NIH, which aims to investigate the CT phenotypes in Chronic Obstructive Pulmonary Disease (COPD) and other lung diseases. For the purpose of testing and validating the proposed model, we mainly focus on pulmonary emphysema, defined as the permanent enlargement of airspaces distal to the terminal bronchioles and the destruction of the alveolar walls. In the COPDGene database, 3-D volumetric images are acquired using 64-slice CT scanners during full inspiration, and then reconstructed using sub-millimeter slice thickness with smoothing and edge-enhancing filters. From the total of 10,000 subjects in the database from both normal and COPD population, 500 image slices from 150 subjects were manually annotated by a group of experts on our team for the three classes: airway (Class I), emphysema (Class II), and other lung tissue (Class III). For each of the 3 classes, 600 non-overlapping image patches of size 3636 were extracted from the annotation results, constituting the samples for training (400 patches) and verification (200 patches). An example set of the image patches from the three classes is shown in Fig. 4.
At the same time, from the new unlabeled dataset, 9600 patches were extracted and analyzed by the bootstrapping CNNs, constituting the candidates for virtual sample selection. Finally, 161 image slices from another 59 subjects were manually annotated by another group of experts. Totally 887 (Class I: 203, Class II: 192, Class III: 255) image patches were extracted from the annotated regions, which were used as the benchmark testing inputs for evaluating the model performance.
4.2 Performance comparisons
By applying the bootstrapping module of the proposed spCNN framework on the 9600 image patches in the new dataset, we select the virtual samples according to the different significant level. We then retrain the new CNN models from the mixture of the virtual samples and the original manually-labeled samples. The new CNNs are then applied to classify the benchmark testing dataset. The model performance and the details of the virtual samples are summarized in Table 1. The results show that using the significant level =0.05/0.1 for the FDR-controlled statistical testing, spCNN can obtain as high as 10% of accuracy increase over the raw CNN model trained solely from the original manually-labeled samples. The classification accuracy which is near 90% is on the same level with the results from a similar lung CT image study using CNN as reported in . Using a more conservative significant level (=0.025), fewer data will be selected, obtain similar levels of accuracy compared to the raw CNN without causing performance decrease.
|Model||Number of virtual samples||Accuracy|
|Original samples only||N/A||79.0%|
As the current virtual sample selection in spCNN are empirically determined using bootstrapping scheme, one important question is whether the framework are robust enough to guide the virtual sample selection process under different models and/or different datasets. Limited by the size and scope of this manuscript, we only focused on the COPDGene dataset, yet test the spCNN performance using the same virtual sample selection method but different CNN architectures. Specifically, we have tried replacing the CNN architecture as introduced in 2.1 by the following designs: 1) Reducing the number of kernels in the convolutional layers as well as the number of neurons in the fully connected layers by half. 2) Increasing the number of kernels in the convolutional layers as well as the number of neurons in the fully connected layers by 50%. 3) Adding an extra fully connected layer with number of neurons of 180 before the last layer. 4) Removing the first 33 convolutional layer. The results show that, while the classification performance of the spCNN framework based on these 4 network architectures varies, in all of the cases using significant level of =0.1 will outperform other configurations, as well as the original CNN method.
4.3 Time cost for the virtual sample selection
As we previously discussed, the self-paced learning framework essentially exchanges human labor work with computational costs through the bootstrapping process. So, the time cost for training the bootstrapping CNNs could be an important factor for the proposed spCNN framework, especially for larger datasets and/or more complicated network architectures. Currently we deploy the framework on two platforms: one is an in-house server installed with two NVIDIA Tesla P-100 GPUs. The other is the NVIDIA DGX-1 deep learning system with eight P-100 GPUs interconnected with the NVIDIA NVLink . Time costs for training one bootstrapping CNN in the bootstrapping module, as well as for performing one forward-backward propagation using different hardware configurations are listed in Table 2.
|Configuration||per Session (s)||per Iteration (ms)|
|In-house, single GPU||50.63||5.01%|
|In-house, 2 GPUs||50.92||5.00%|
|DGX1, single GPU||44.46||4.41%|
|DGX1, 2 GPUs||33.80||3.32%|
|DGX1, 4 GPUs||27.84||2.66%|
|DGX1, 8 GPUs||27.98||2.54%|
It can be seen that using the most advanced accelerator of NVIDIA DGX1, we can achieve a nearly 2-fold speed increase. It should be noted that the in-house sever shows a lowered running speed using two GPUs comparing with a single GPU, due to the fact that P2P DMA access between devices is needed for running Caffe in multiple-GPU mode. When the P2P access is not supported (as in our in-house server), data will need to be copied through hosts thus severely affect the performance. On the contrary, the DGX1 system is much better optimized for parallelizing the computational loads across multiple GPUs, showing the importance of the P2P DMA access and the NVIDIA NVLink technology. Also, we observe that using 8 GPUs in DGX1 does not result in better performance comparing with 4 GPUs even though the iteration time has been reduced, which is most likely due to the fact that the overhead for parallelization became dominant in the time cost for 8 GPUs. Considering the fact that the current CNN architecture is relatively simple, we envision that 8 GPUs will outperform other configurations in more complicated cases.
5 Conclusion and Discussion
In this work, we develop the self-paced scheme for identifying virtual samples from the unlabeled data and use them to retrain a new CNN, in order to overcome the problems of the lack of training samples. The FDR-controlled statistical testing for the virtual sample selection based on bootstrapping scheme shows that the current optimized threshold is around =0.1. Similar threshold could be used for analyzing the rest of the data within the COPDGene dataset as such inference is not affected by the number of classes nor the number of samples tested. Actually, we propose that the parameter tuning on the threshold is essentially the empirical characterization for the relationship between the distribution of the network outputs and the quality/confidence of the corresponding samples which is highly related with the dataset. According our experience in the self-paced learning procedure, we envision that further improve for the performance of spCNN could be achieved by gradually selecting the virtual samples in multiple rounds, similar to the majority of curriculum learning methods. In other words, after selecting the virtual samples and retraining the new CNN, we can update the bootstrapping CNNs based on the mixture data (with subsampling), then perform the same virtual sample selection again on the remaining dataset. The multiple-rounds application of spCNN thus form a closed loop from the selected virtual samples back to the selection process.
-  M. Anthimopoulos, S. Christodoulidis, L. Ebner, A. Christe, and S. Mougiakakou. Lung pattern classification for interstitial lung diseases using a deep convolutional neural network. IEEE Transactions on Medical Imaging, 35(5):1207–1216, 2016.
-  Dan C. Cireşan, Alessandro Giusti, Luca M. Gambardella, and Jürgen Schmidhuber. Mitosis detection in breast cancer histology images with deep neural networks. Medical Image Computing and Computer-Assisted Intervention – MICCAI, 2013.
-  Dinggang Shen, Guorong Wu, and Heung-Il Suk. Deep learning in medical image analysis. Annual Review of Biomedical Engineering, 2016.
-  Hoo-Chang Shin, Holger R. Roth, Mingchen Gao, Le Lu, Ziyue Xu, Isabella Nogues, Jianhua Yao, Daniel Mollura, and Ronald M. Summers. Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. arXiv:1602.03409, 2016.
-  H. Greenspan, B. van Ginneken, and R. M. Summers. Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique. IEEE Transactions on Medical Imaging, 35(5):1153–1159, 2016.
-  Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, and Alexander G. Hauptmann. Self-paced learning with diversity. In Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS’14, pages 2078–2086, Cambridge, MA, USA, 2014. MIT Press.
-  M. Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In Proceedings of the 23rd International Conference on Neural Information Processing Systems, NIPS’10, pages 1189–1197, USA, 2010. Curran Associates Inc.
Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander G. Hauptmann.
Self-paced curriculum learning.
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pages 2694–2700. AAAI Press, 2015.
-  Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 41–48, New York, NY, USA, 2009. ACM.
James Steven Supancic III and Deva Ramanan.
Self-paced learning for long-term tracking.
Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’13, pages 2379–2386, Washington, DC, USA, 2013. IEEE Computer Society.
-  Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’08, pages 213–220, New York, NY, USA, 2008. ACM.
Xiao-Li Li, Bing Liu, and See-Kiong Ng.
Negative training data can be harmful to text classification.
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10, pages 218–228, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
-  Antti Rasmus, Harri Valpola, Mikko Honkala, Mathias Berglund, and Tapani Raiko. Semi-supervised learning with ladder networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pages 3546–3554, Cambridge, MA, USA, 2015. MIT Press.
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.J. Mach. Learn. Res., 11:3371–3408, December 2010.
-  Mingchen Gao, Ulas Bagci, Le Lu, Aaron Wu, Mario Buty, Hoo-Chang Shin, Holger Roth, Georgios Z. Papadakis, Adrien Depeursinge, Ronald M. Summers, Ziyue Xu, and Daniel J. Mollura. Holistic classification of ct attenuation patterns for interstitial lung diseases via deep convolutional neural networks. Computer Methods in Biomechanics and Biomedical Engineering: Imaging and Visualization, pages 1–6, 2016.
-  Kai-Lung Hua, Che-Hao Hsu, Shintami Chusnul Hidayati, Wen-Huang Cheng, and Yu-Jen Chen. Computer-aided classification of lung nodules on computed tomography images via deep learning technique. OncoTargets and therapy, 8:2015–2022, 2015.
-  Mingchen Gao, Ziyue Xu, Le Lu, Adam P. Harrison, Ronald M. Summers, and Daniel J. Mollura. Multi-label Deep Regression and Unordered Pooling for Holistic Interstitial Lung Disease Pattern Detection, pages 147–155. Springer International Publishing, Cham, 2016.
-  Wei Shen, Mu Zhou, Feng Yang, Caiyun Yang, and Jie Tian. Multi-scale Convolutional Neural Networks for Lung Nodule Classification, pages 588–599. Springer International Publishing, Cham, 2015.
-  H. R. Roth, L. Lu, J. Liu, J. Yao, A. Seff, K. Cherry, L. Kim, and R. M. Summers. Improving computer-aided detection using convolutional neural networks and random view aggregation. IEEE Transactions on Medical Imaging, 35(5):1170–1181, May 2016.
-  A. A. A. Setio, F. Ciompi, G. Litjens, P. Gerke, C. Jacobs, S. J. van Riel, M. M. W. Wille, M. Naqibullah, C. I. SÃ¡nchez, and B. van Ginneken. Pulmonary nodule detection in ct images: False positive reduction using multi-view convolutional networks. IEEE Transactions on Medical Imaging, 35(5):1160–1169, May 2016.
-  Francesco Ciompi, Bartjan de Hoop, Sarah J. van Riel, Kaman Chung, Ernst Th Scholten, Matthijs Oudkerk, Pim A. de Jong, Mathias Prokop, and Bram van Ginneken. Automatic classification of pulmonary peri-fissural nodules in computed tomography using an ensemble of 2d views and a convolutional neural network out-of-the-box. Medical Image Analysis, 26(1):195–202, 2015.
-  Ashish Gupta, Murat Seçkin Ayhan, and Anthony S. Maida. Natural image bases to represent neuroimaging data. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pages III–987–III–994. JMLR.org, 2013.
-  Tom Brosch and Roger Tam. Manifold Learning of Brain MRIs by Deep Learning, pages 633–640. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
-  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia, MM ’14, pages 675–678, New York, NY, USA, 2014. ACM.
-  Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
-  A.L. Maas, A.Y. Hannun, and A.Y. Ng. Rectifier nonlinearities improve neural network acoustic models. ICML Workshop on Deep Learning for Audio, Speech, and Language Processing (WDLASL), 2013.
-  X. Zhu, C. Vondrick, C. Fowlkes, and D. Ramanan. Do we need more training data? British Machine Vision Conference, 2012.
-  George R. Washko, Gary M. Hunninghake, Isis E. Fernandez, Mizuki Nishino, Yuka Okajima, Tsuneo Yamashiro, James C. Ross, RaÃºl San JosÃ© EstÃ©par, David A. Lynch, John M. Brehm, Katherine P. Andriole, Alejandro A. Diaz, Ramin Khorasani, Katherine D’Aco, Frank C. Sciurba, Edwin K. Silverman, Hiroto Hatabu, and Ivan O. Rosas. Lung volumes and emphysema in smokers with interstitial lung abnormalities. New England Journal of Medicine, 364(10):897–906, 2011. PMID: 21388308.
-  http://www.nvidia.com/object/nvlink.html.