Medical image classification and segmentation are essential building blocks of computer aided diagnosis systems where deep learning (DL) approaches have led to state of the art performance . Robust DL approaches need large labeled datasets which is difficult for medical images because of: 1) limited expert availability; and 2) intensive manual effort required for curation. Active learning (AL) approaches overcome data scarcity with existing models by incrementally selecting the most informative unlabeled samples, querying their labels and adding them to the labeled set 
. AL in a DL framework poses the following challenges:1) labeled samples generated by current AL approaches are too few to train or finetune convolution neural networks (CNNs); 2) AL methods select informative samples using hand crafted features, while feature learning and model training are jointly optimized in CNNs.
Recent approaches to using AL in a DL setting include Bayesian deep neural networks 
, leveraging separate unlabeled data with high classification uncertainty and high confidence for computer vision applications, and fully convolution networks (FCN) for segmenting histopathology images . We propose to generate synthetic data by training a conditional generative adversarial network (cGAN) that learns to generate realistic images by taking input masks of a specific anatomy. Our model is used with chest xray images to generate realistic images from input lung masks. This approach has the advantage of overcoming limitations of small training datasets by generating truly informative samples. We test the proposed AL approach for the key tasks of image classification and segmentation, demonstrating its ability to yield models with high accuracy while reducing the number of training samples.
Our proposed AL approach identifies informative unlabeled samples to improve model performance. Most conventional AL approaches identify informative samples using uncertainty which could lead to bias as uncertainty values depend on the model. We propose a novel approach to generate diverse samples that can contribute meaningful information in training the model. Our framework has three components for: 1) sample generation; 2) classification/segmentation model; and 3) sample informativeness calculation. An initial small labeled set is used to finetune a pre-trained  (or any other classification/segmentation model) using standard data augmentation (DA) through rotation and translation. The sample generator takes a test image and a manually segmented mask (and its variations) as input and generates realistic looking images (details in Sec 2.1). A Bayesian neural network (BNN) 
calculates generated images’ informativeness and highly informative samples are added to the labeled image set. The new training images are used to fine-tune the previously trained classifier. The above steps are repeated till there is no change in classifier performance.
2.1 Conditional Generative Adversarial Networks
learn a mapping from random noise vectorz to output image y: . In contrast, conditional GANs (cGANs)  learn a mapping from observed image x and random noise vector z, to y: . The generator is trained to produce outputs that cannot be distinguished from“real” images by an adversarially trained discriminator, . The cGAN objective function is:
where tries to minimize this objective against , that tries to maximize it, i.e. . Previous approaches have used an additional loss  to encourage the generator output to be close to ground truth in an sense. We use loss as it encourages less blurring , and defined as:
Thus the final objective function is :
where , set empirically, balances the two components’ contributions.
2.1.1 Synthetic Image Generation:
The parameters of , , are given by,
is the number of images. Loss functioncombines content loss and adversarial loss (Eqn. 1), and . Content loss () encourages output image to have different appearance to .
is the latent vector encoding (obtained from a pre-trained autoencoder) of the segmentation mask.is,
denotes the normalized mutual information (NMI) between and , and is used to determine similarity of multimodal images. is the distance between two images using all
feature maps of Relulayer of a pre-trained network . The VGG loss improves robustness by capturing information at different scales from multiple feature maps. is the intensity mean square error. For similar images, gives higher value while and give lower values. In practice is measuring the similarity (instead of dissimilarity in traditional loss functions) between two images, and takes higher values for similar images. Since we are minimizing the total loss function, encourages the generated image to be different from input .
The generator (Fig. 1(a)) employs residual blocks having two convolution layers with filters and
feature maps, followed by batch normalization and ReLU activation. It takes as input the test Xray image and the latent vector encoding of a mask (either original or altered) and outputs a realistic Xray image whose label class is the same as the original image. The discriminator(Figure 1 (b)) has eight convolution layers with the kernels increasing by a factor of from to
. Leaky ReLU is used and strided convolutions reduce the image dimension when the number of features is doubled. The resulting
feature maps are followed by two dense layers and a final sigmoid activation to obtain a probability map.evaluates similarity between and . To generate images with a wide variety of information we modify the segmentation masks of the test images by adopting one or more of the following steps:
Boundary Displacement: The boundary contours of the mask are displaced to change its shape. We select continuous points at multiple boundary locations, randomly displace each one of them by
pixels and fit a b-spline to change the boundary shape. The intensity of pixels outside the original mask are assigned by linear interpolation, or by generating intensity values from a distribution identical to that of the original mask.
Other conventional augmentation techniques like flipping, rotation and translation are also used.
For every test image we obtain up to synthetic images with their modified masks. Figure 2 (a) shows an original normal image (bottom row) and its mask (top row), and Figs. 2 (b,c) show generated ‘normal’ images. Figure 2 (d) shows the corresponding image mask for an image with nodules, and Figs. 2 (e,f) show generated ‘nodule’ images. Although the nodules are very difficult to observe with the naked eye, we highlight its position using yellow boxes. It is quite obvious that the generated images are realistic and suitable for training.
2.2 Sample Informativeness Using Uncertainty form Bayesian Neural Networks
Each generated image’s uncertainty is calculated using the method described in . Two types of uncertainty measures can be calculated from a Bayesian neural network (BNN). Aleotaric uncertainty models the noise in the observation while epistemic uncertainty models the uncertainty of model parameters.We adopt  to calculate uncertainty by combining the above two types. A brief description is given below and refer the reader to  for details. For a BNN model mapping an input image , to a unary output , the predictive uncertainty for pixel is approximated using:
is the BNN output for the predicted variance for pixel, and being a set of sampled outputs.
2.3 Implementation Details
pre-trained on the Imagenet dataset. Our entire dataset hadnormal images and nodule images. We chose an initially labeled dataset of (chosen empirically) images from each class, augment it times using standard data augmentation like rotation and translation, and use them to fine tune the last classification layer of the . The remaining test images and their masks were used to generate multiple images using our proposed cGAN approach ( synthetic images for every test image as described earlier), and each generated image’s uncertainty was calculated as described in Section 2.2. We ranked the images with highest uncertainty score and the top images from each class were augmented times (rotation and translation) and used to further fine-tune the classifier. This ensures equal representation of normal and diseased samples in the samples to add to the training data. This sequence of steps is repeated till there is no further improvement of classifier accuracy when tested on a separate test set of images ( images each of nodule and normal class). Our knowledge of image label allows quantitative analysis of model performance.
Our algorithm is trained on the SCR chest XRay database  which has Xrays of ( normal and nodule images, resized to pixels) patients along with manual segmentations of the clavicles, lungs and heart. The dataset is augmented times using rotation, translation, scaling and flipping. We take a separate test set of images from the NIH dataset  with normal images and images with nodules.
3.1 Classification Results
Here we show results for classifying different images using different amounts of labeled data and demonstrate our method’s ability to optimize the amount of labeled data necessary to attain a given performance, as compared to conventional approaches where no sample selection is performed. In one set of experiments we used the entire training set of images and augmentation to fine tune the classifier, and test it on the separate set of
images. We call this the fully supervised learning (FSL) setting. Subsequently, in other experiments for AL we used different number of initial training samples in each update of the training data. The batch size is the same as the initial number of samples.
The results are summarized in Table 1 where the classification performance in terms of sensitivity (), specificity () and area under the curve () are reported for different settings using and  classifiers. Under FSL, fold indicates normal fold cross validation; and indicates the scenario when of training data was randomly chosen to train the classifier and measure performance on test data (the average of such runs). We ensure that all samples were part of the training and test set atleast once. In all cases AL classification performance reaches almost the same level as FSL when the number of training samples is approximately of the dataset. Subsequently increasing the number of samples does not lead to significant performance gain. This trend is observed for both classifiers, indicating it is not dependent upon classifier choice.
|Active learning ( labeled + Classifier)||FSL|
|VGG16 ||ResNet18 |||||||||||||||||||||||||
3.2 Segmentation Performance
Using the labeled datasets at different stages we train a UNet  for segmenting both lungs. The trained model is then evaluated on the separate set of images from the NIH database on which we manually segment both lungs. The segmentation performance for FSL and different AL settings is summarized in Table 1 in terms of Dice Metric (DM) and Hausdorff Distance (HD). We observe that the segmentation performance reaches the same level as FSL at a fraction of the full dataset - in this case between , which is similar for classification. Figure 3 shows the segmentation results for different training models. When the number of training samples are less than the segmentation performance is quite bad in the most challenging cases. However the performance improves steadily till it stabilizes at the threshold.
3.3 Savings in Annotation Effort
Segmentation and classification results demonstrate that with the most informative samples, optimum performance can be achieved using a fraction of the dataset. This translates into significant savings in annotation cost as it reduces the number of images and pixels that need to be annotated by an expert. We calculate the number of pixels in the images that were part of the AL based training set at different stages for both classification and segmentation. At the point of optimum performance the number of annotated pixels in the training images is . These numbers clearly suggest that using our AL framework can lead to savings of nearly in terms of time and effort put in by the experts.
We have proposed a method to generate chest Xray images for active learning based model training by modifying the original masks of associated images. A generated image’s informativeness is calculated using a bayesian neural network, and the most informative samples are added to the training set. These sequence of steps are continued till there is no additional information provided by the labeled samples. Our experiments demonstrate that, with about labeled samples we can achieve almost equal classification and segmentation performance as obtained when using the full dataset. This is made possible by selecting the most informative samples for training. Thus the model sees all the informative samples first, and achieves optimal performance in fewer iterations. The performance of the proposed AL based model translates into significant savings in annotation effort and clinicians’ time. In future work we aim to further investigate the realism of the generated images.
The authors acknowledge the support from SNSF project grant number .
Gal, Y., Islam, R., Ghahramani, Z.: Deep Bayesian Active Learning with Image Data. In: Proc. International Conference on Machine Learning (2017)
-  van Ginneken, B., Stegmann, M., Loog., M.: Segmentation of anatomical structures in chest radiographs using supervised methods: a comparative study on a public database. Med. Imag. Anal. 10(1), 19–40 (2006)
-  Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Proc. NIPS. pp. 2672–2680 (2014)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: In Proc. CVPR (2016)
-  Kendall, A., Gal, Y.: What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In: Advances in Neural Information Processing Systems. (2017)
-  Li, X., Guo, Y.: Adaptive active learning for image classification. In: Proc. CVPR (2013)
Mahapatra, D., Bozorgtabar, B., Hewavitharanage, S., Garnavi, R.: Image super resolution using generative adversarial networks and local saliency maps for retinal image analysis. In: MICCAI. pp. 382–390 (2017)
-  Mahapatra, D., Schffler, P., Tielbeek, J., Vos, F., Buhmann, J.: Semi-supervised and active learning for automatic segmentation of crohn’s disease. In: Proc. MICCAI, Part 2. pp. 214–221 (2013)
-  Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.: Context encoders: Feature learning by inpainting. In: Proc. CVPR. pp. 2536–2544 (2016)
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: In Proc. MICCAI. pp. 234–241 (2015)
-  Simonyan, K., Zisserman., A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
-  Tajbakhsh, N., Shin, J., Gurudu, S., Hurst, R.T., Kendall, C., Gotway, M., Liang., J.: Convolutional neural networks for medical image analysis: Full training or fine tuning?. IEEE Trans. Med. Imag. 35(5), 1299–1312 (2016)
-  Wang, K., Zhang, D., Li, Y., Zhang, R., Lin., L.: Cost-effective active learning for deep image classification. IEEE Trans. CSVT. 27(12), 2591–2600 (2017)
-  Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.: Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: In Proc. CVPR (2017)
-  Yang, L., Zhang, Y., Chen, J., Zhang, S., Chen, D.: Suggestive Annotation: A Deep Active Learning Framework for Biomedical Image Segmentation. In: Proc. MICCAI. pp. 399–407 (2017)