Osteosarcoma is the most common bone cancer occurring in adolescents with a second smaller peak in older adults . Pre-operative chemotherapy followed by surgery is a standard treatment for osteosarcoma. The ratio of necrotic tumor to overall tumor post neoadjuvant chemotherapy is a well-known prognostic factor and correlates with patients’ survival [13, 19]. Thus, for patients with localized disease who have undergone complete resection, if the ratio of tumor necrosis is greater than 90%, the 5-year survival is higher than 80%. Currently, the ratio of tumor necrosis is manually estimated by pathologists by microscopic review of multiple glass slides from resected specimens.
Computational pathology has provided automated and reproducible techniques to analyze digitized histopathology images , especially with convolutional neural networks (CNNs) . Arunachalam showed a patch-level classification CNN composed of three convolutional layers and two fully-connected layers could be used to identify viable tumor, necrotic tumor, and non-tumor in osteosarcoma . For more accurate analysis, fully convolutional networks were developed for a pixel-wise classification, also known as semantic segmentation . U-Net segmenting subcellular structures in microscopy images was described in . More recently, Deep Multi-Magnification Network (DMMN) was introduced for multi-class tissue segmentation of histopathology images by looking at patches in multiple magnifications and has shown outstanding segmentation performance in breast cancer .
Performance of these supervised machine learning methods highly depends on the amount and quality of annotations. Public datasets with annotations are generally available for common cancer types such as breast[7, 2] and have been widely used for training CNNs [24, 14]. For rare cancers such as osteosarcoma, fresh manual annotations by pathologists with specialized expertise are required. Such annotations require a lot of time from busy professionals and thus optimizing for reduced burden is paramount. To reduce annotation time, interactive learning has been developed. Interactive learning allows annotators to “interact” with a machine learning model by correcting predictions of the model to improve its performance until the predictions are satisfied [8, 20]. An interactive segmentation toolkit for biomedical images, known as ilastik, was introduced in [21, 4]6] were used for segmentation. QuPath  was developed to interactively analyze giga-pixel whole slide images where segmentation was also done based on random forest classifiers .
In this paper, we propose Deep Interactive Learning (DIaL) by integrating the concept of interactive learning into deep learning framework for multi-class tissue segmentation of histopathology images and treatment response assessment for osteosarcoma. To evaluate our segmentation model, we estimate the necrosis ratio in case-level by counting the number of pixels predicted as viable tumor and necrotic tumor by the segmentation model and compare with the ratio from pathology reports. We observe our CNN model can estimate the necrosis ratio within expected inter-observer variation rate for non-standardized manual surgical pathology task. Note that the total labeling time took approximately 7 hours with DIaL.
2 Proposed Method
It is necessary to manually label osteosarcoma whole slide images (WSIs) to supervise a segmentation convolutional neural network (CNN) for automated treatment response assessment. Labeling WSIs exhaustively would be ideal but it needs tremendous labeling time. Partial labeling approaches are introduced to reduce labeling time [5, 12], but challenging or rare morphological features can be missed. We propose Deep Interactive Learning (DIaL) to efficiently annotate both characteristic features and challenging features on WSIs to have outstanding segmentation performance. Our block diagram is shown in Figure 1. First of all, initial annotation is partially done mainly on characteristic features of classes. During DIaL, training a CNN, segmentation prediction, and correction on mislabeled regions are repeated to improve segmentation performance until segmentation predictions on training images are satisfied by the annotators. Note that challenging or rare features would be labeled during the correction step. When training the CNN is finalized, the CNN is used to segment viable tumor and necrotic tumor on testing cases to assess treatment responses.
2.1 Initial Annotation
Initial annotation on characteristic features of each class is done to train an initial CNN model. In this work, annotators label 7 morphologically distinct classes, shown in Figure 2: viable tumor, necrosis with bone, necrosis without bone, normal bone, normal tissue, cartilage, and blank. Note initial annotation is partially done on training images.
2.2 Deep Interactive Learning
During initial annotation, challenging or rare features may not be included in the training set which can lead to mislabeled predictions. These challenging features can be added into the training set through Deep Interactive Learning (DIaL) by repeating training/finetuning, segmentation, and correction. These three steps are repeated until annotators are satisfied with segmentation predictions on training images.
2.2.1 Initial Training
We need an initially trained model to annotate mislabeled regions with challenging features. WSIs are too large to be processed at once. Thus, the labeled regions are extracted into pixels patches only when more than 1% of pixels in the patch are annotated. To balance the number of pixels between classes, patches containing rare classes are deformed to produce additional patches by elastic deformation [18, 9]. Here, we define a class is rare if the number of pixels in the class is less than 70% of the maximum number of pixels among classes. After patch extraction and deformation are done, some cases are separated for validating the CNN model where approximately 20% of pixels in each class are separated. We use a Deep Multi-Magnification Network (DMMN)  for multi-class tissue segmentation where the model looks at patches in multiple magnifications for accurate predictions. Specifically, the DMMN is composed of three half-channeled U-Nets, U-Net-, U-Net-, and U-Net-, where input patches of these U-Nets are in , , and magnifications, respectively, with size of pixels centered at the same location. Intermediate feature maps in decoders of U-Net- and U-Net- are center-cropped and concatenated to a decoder of U-Net- to enrich feature maps. The final prediction patch of the DMMN is generated in size of pixels in magnification. To train our model initialized by 
, we use weighted cross entropy as our loss function where a weight for class, , is defined as , where is the total number of classes and is the number of pixels in class
. Note that unlabeled regions do not contribute to the training process. During training, random rotation, vertical and horizontal flip, and color jittering are used as data augmentation. Stochastic gradient descent (SGD) optimizer with a learning rate of, a momentum of 0.99, and a weight decay of
is used for 30 epochs. In each epoch, a model is validated by mean Intersection-Over-Union (mIOU) and the model with the highest mIOU is selected as an output model.
After training a model is done, all training WSIs are processed to evaluate unlabeled regions. A set of patches with size of pixels in , , and
magnifications centered at the same location is processed using the DMMN. Note that zero-padding is done on the boundary of WSIs. Patch-wise segmentation is repeated inand
-directions with a stride of 256 pixels until the entire WSI is processed.
Characteristic features are annotated during initial annotation, but challenging or rare features may not be included. During the correction step, these challenging features that the model could not predict correctly are annotated to be included in the training set to improve the model. In this step, the annotators look at segmentation predictions and correct any mislabeled regions. If the predictions are satisfied throughout training images, the model is finalized.
Assuming the previous CNN model has already learned most features of classes, we finetune the previous model to improve segmentation performance. Corrected regions are extracted into patches and included in the training set to improve the CNN model. Additional patches are generated by deforming the extracted patches to give a higher weight on challenging or rare features to emphasize these features to be learned during finetuning. SGD optimizer and weighted cross entropy with the updated weights are used during training, and we reduced a learning rate to be and the number of epochs to be 10 not to perturb parameters of the CNN model too much from the previous model. Note validation cases can be selected again to utilize the majority of corrected cases for the optimization.
2.3 Treatment Response Assessment
The final CNN model segments viable tumor and necrotic tumor on testing WSIs. Note necrotic tumor is a combination of necrosis with bone and necrosis without bone. The ratio of necrotic tumor to overall tumor in case-level estimated by a deep learning model, , is defined as
where and are the number of pixels of viable tumor and necrotic tumor in a case, respectively.
3 Experimental Results
Our hematoxylin and eosin (H&E) stained osteosarcoma dataset is digitized in magnification by two Aperio AT2 scanners at Memorial Sloan Kettering Cancer Center where microns per pixel (MPP) for one scanner is 0.5025 and MPP for the other scanner is 0.5031. The osteosarcoma dataset contains 55 cases with 1578 whole slide images (WSIs) where the number of WSIs per case ranges between 1 to 109 with mean of 28.7 and median of 22, and the average width and height of the WSIs are 61022 pixels and 41518 pixels, respectively. We used 13 cases for training and the other 42 cases for testing. Note 8 testing cases do not contain the necrosis ratio on their pathology reports, so they were excluded for evaluation. Two annotators (N.P.A. and M.R.H.) selected 49 WSIs from 13 training cases and independently annotated them without case-level overlaps. The pixel-wise annotation was performed on an in-house WSI viewer, allowing measuring the time taken for annotation. The annotators labeled three iterations using Deep Interactive Learning (DIaL): initial annotation, first correction, and second correction. They annotated 49 WSIs in 4 hours, 37 WSIs in 3 hours, and 13 WSIs in 1 hour during the initial annotation, the first correction, and the second correction, respectively. The annotators also exhaustively labeled the entire WSI which took approximately 1.5 hours. An example of exhaustive annotation and annotation with DIaL is shown in Figure 3. With the same given time, the annotators would be able to exhaustively annotate only 5 WSIs without DIaL. The annotators can annotate more diverse cases with DIaL. The number of pixels annotated and deformed are shown in Figure 4
(a). The implementation was done using PyTorch and an Nvidia Tesla V100 GPU is used for training and segmentation. Initial training and finetuning took approximately 5 days and 2 days, respectively. Segmentation of one WSI took approximately minutes.
For evaluating our segmentation model, 1044 WSIs from 34 cases were segmented to estimate the necrosis ratio. Note all WSIs were segmented as if pathologists look at all glass slides under the microscope to assess the necrosis ratio. To numerically evaluate the estimated necrosis ratio, we compared with the ratio from pathology reports written by experts. Here, the error rate, , is defined as
where is the ratio from a pathology report and is the ratio estimated by a deep learning model for the -th case, and where is the number of testing cases. Figure 4(b) shows the error rates for our models. Model1, Model2a, Model2b, Model3 denote an initially-trained model, a finetuned model from Model1 with single-weighted first correction, a finetuned model from Model1 with double-weighted first correction, and a finetuned model from Model2b with double-weighted second correction, respectively. Note we tried both single-weighted correction including only extracted correction patches and double-weighted correction including both extracted correction patches and their corresponding deformed patches during the finetuning step. We observed that the error rate decreases after the first correction, especially with a higher weight on correction patches to emphasize challenging features. We selected Model2b as our final model because the error rate stopped reducing after the second correction. Our final model, trained by only 7 hours of annotations done by DIaL, was able to achieve the error rate of 20%. A 20% inter-observer error rate is generally acceptable for non-standardized tasks in surgical pathology such as assessment of percentage of tumor cells has been overestimated by pathologists up to 20% in certain instances . While this cannot be directly transferred to necrosis estimation we have used this data to show that the model is able to achieve this error rate.
The task of manual quantification of the necrosis ratio done by pathologists is challenging because one must make an estimate across multiple glass slides that may differ substantially in the ratio of necrosis. We are convinced that our objective and reproducible deep learning model estimating the necrosis ratio within expected inter-observer variation rate can be superior to manual interpretation.
We presented Deep Interactive Learning (DIaL) for an efficient annotation to train a segmentation CNN. With 7 hours of labeling, we achieved a CNN segmenting viable tumor and necrotic tumor on osteosarcoma whole slide images. Our experiments showed that the CNN model can successfully estimate the necrosis ratio known as a prognostic factor for patients’ survival for osteosarcoma in an objective and reproducible way. In the future, we plan for patient stratification based on patients’ survival data using our deep learning model.
This work was supported by the Warren Alpert Foundation Center for Digital and Computational Pathology at Memorial Sloan Kettering Cancer Center and the NIH/NCI Cancer Center Support Grant P30 CA008748. T.J.F. is the Chief Scientific Officer, co-founder and equity holder of Paige.AI. P.J.S. is a lead machine learning scientist, co-founder and equity holder of Paige.AI. C.M.V. is a consultant for Paige.AI. D.J.H. and T.J.F. have intellectual property interests relevant to the work that is the subject of this paper. MSK has financial interests in Paige.AI. and intellectual property interests relevant to the work that is the subject of this paper.
-  Arunachalam, H. B., et al.: Viable and necrotic tumor assessment from whole slide images of osteosarcoma using machine-learning and deep-learning models. PLoS ONE 14(4), e0210706 (2019)
-  Bandi, P., et al.: From Detection of Individual Metastases to Classification of Lymph Node Status at the Patient Level: The CAMELYON17 Challenge. IEEE Transactions on Medical Imaging 38(2), 550–560 (2019)
Bankhead, P., et al.: QuPath: Open source software for digital pathology image analysis. Scientific Reports7, 16878 (2017)
-  Berg, S., et al.: ilastik: interactive machine learning for (bio)image analysis. Nature Methods 16, 1226–1232 (2019)
-  Bokhorst, J. M., et al., Learning from sparsely annotated data for semantic segmentation in histopathology images. In: Proceedings of the International Conference on Medical Imaging with Deep Learning, pp. 84–91, (2019)
-  Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001)
-  Ehteshami Bejnordi, B., et al.: Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. JAMA 318(22), 2199–2210 (2017)
-  Fails, J. A., Olsen, D. R.: Interactive machine learning. In: Proceedings of the International Conference on Intelligent User Interfaces, pp. 39–45 (2003)
-  Fu, C., et al.: Nuclei segmentation of fluorescence microscopy images using convolutional neural networks. In: Proceedings of the IEEE International Symposium on Biomedical Imaging, pp. 704–708 (2017)
-  Fuchs, T. J., Buhmann, J. M.: Computational pathology: Challenges and promises for tissue analysis. Computerized Medical Imaging and Graphics 35(7), 515–530 (2011)
Glorot. X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
-  Ho, D. J., et al.: Deep Multi-Magnification Networks for Multi-Class Breast Cancer Image Segmentation. arXiv preprint, arXiv:1910.13042 (2019)
-  Huvos, A. G., Rosen, G., Marcove, R. C.: Primary osteogenic sarcoma: pathologic aspects in 20 patients after treatment with chemotherapy en bloc resection, and prosthetic bone replacement. Archives of Pathology & Laboratory Medicine 101(1), 14–18 (1977)
-  Lee, B., Paeng, K.: A Robust and Effective Approach Towards Accurate Metastasis Detection and pN-stage Classification in Breast Cancer. In: Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 841–850 (2018)
-  Ottaviani, G., Jaffe, N.: The Epidemiology of Osteosarcoma. Pediatric and Adolescent Osteosarcoma, 3–13 (2009)
-  Paszke, A., et al.: PyTorch: An Imperative Style, High-Performance Deep Learning Library. In: Proceedings of the Neural Information Processing Systems, pp. 8024–8035 (2019)
-  Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241 (2015)
-  Rosen, G., et al.: Preoperative chemotherapy for osteogenic sarcoma: Selection of postoperative adjuvant chemotherapy based on the response of the primary tumor to preoperative chemotherapy. Cancer 49(6), 1221–1230 (1982)
-  Schüffler, P.J., Fuchs, T.J., Ong, C.S., Wild, P., Buhmann, J.M.: TMARKER: A Free Software Toolkit for Histopathological Cell Counting and Staining Estimation. Journal of Pathology Informatics 4(2), (2013)
-  Sommer, C., Straehle, C., Koethe, U., Hamprecht, F. A.: ilastik: Interactive learning and segmentation toolkit. In: Proceedings of the IEEE International Symposium on Biomedical Imaging, pp. 230–233 (2011)
-  Srinidhi, C. L., Ciga, O., Martel, A. L.: Deep neural network models for computational histopathology: A survey. arXiv preprint, arXiv:1912.12378 (2019)
-  Viray, H., et al.: A Prospective, Multi-Institutional Diagnostic Trial to Determine Pathologist Accuracy in Estimation of Percentage of Malignant Cells. Archives of Pathology & Laboratory Medicine 137(11), 1545–1549 (2013)
-  Wang, D., Khosla, A., Gargeya, R., Irshad, H., Beck, A. H.: Deep Learning for Identifying Metastatic Breast Cancer. arXiv preprint, arXiv:1606.05718 (2016)