Gastric cancer remains important cancer worldwide and is responsible for over
new cases in 2018 and an estimateddeaths (equating to 1 in every 12 deaths globally), making it the fifth most frequently diagnosed cancer and the third leading cause of cancer death. Biopsy of the gastric mucosa is one of the most effective methods of early detection of gastric cancer Bray et al. (2018). It is estimated that there are hundreds of millions of gastric biopsy slides need to be examined in China each year, while the number of certified pathologists is only about thousand, which causes excessive workloads on these pathologists.
. In recent years, deep neural network techniques have achieved remarkable performance on a wide range of computer vision tasks, such as image classificationKrizhevsky et al. (2012); He et al. (2016); Szegedy et al. (2016), object detection He et al. (2017); Lin et al. (2017); Redmon and Farhadi (2018), semantic segmentation Long et al. (2015); Chen et al. (2018), etc. These techniques have been applied in automated pathology image analysis in the past few years. Unlike natural images, digital pathology images, named whole-slide images (WSIs), are extremely large whose width and height often exceed pixels. On the other hand, histological diagnosis requires high accuracy since it is commonly considered as the gold standard. As a result, some of the studies focus on the selected regions of interest (ROIs) Su et al. (2015); Chen et al. (2017); Li et al. (2019b, a), while there are several attempts on analyzing WSI Coudray et al. (2018); Lin et al. (2018); Zhang et al. (2019); Barker et al. (2016).
Besides the difficulties in applying the deep neural network on gigapixel resolution images, the main challenge in examining the WSIs is that the diagnostic results labeled by the pathologists are usually on the slide level in most of the publicly available datasets, while the lesion regions that draw the pathologists’ attention are extremely small compared with the size of the WSI. It is tough to train a deep neural network to locate those regions and make the correct decision only using slide level labels such “positive/negative”. Therefore, we collect a large dataset that not only has the slide level annotation but also carries the lesion region annotation and design a framework leveraging the detailed supervised information.
To our best knowledge, there have been no studies on automated pathology image analysis with lesion region annotation in clinical settings for gastric cancer. We propose an automated screening framework that could not only provide the screening results, i.e., positive/negative, but also show the suspicious areas to pathologists for further reference.
2 Related Works
2.1 Pathology Image Analysis for Gastric Cancer
Only a few existing works are focusing on analyzing the pathology image from the biopsy of the gastric mucosa for diagnosis.
Cosattoa et al. Cosatto et al. (2013) designed a semi-supervised multiple instance learning framework that takes microns at magnification of and microns at magnification of ROIs segmented from tissue units as input and analyzes their color features.
Oikawa et al. Oikawa et al. (2017) proposed a computerized analysis system which first analyzes color and texture features on the entire H&E-stained section at low resolution to search suspicious areas for cancer (except signet ring cell which is detected by CNN-based method), and then analyzes contour and pixel features at high resolution on selected area and uses a trained SVM to confirm the initial suspicion.
Li et al. Li et al. (2018a) proposed GastricNet using different structures for shallow and deep layers, and applied it on patches cropped from ROIs with a magnification factor of .
Yoshida et al. Yoshida et al. (2018) compared the classification results of human pathologists and of the e-Pathologist. The e-Pathologist analyzes high-magnification features that characterized the nuclear morphology and texture as well as low-magnification features that characterized the global H&E stain distribution within an ROI and the appearance of blood cells and gland formation. The classification was modeled as a multi-instance learning problem and solved by training a multi-layer neural network.
2.2 Whole-Slide Image Analysis
As the gold standard for various cancer diagnosis, WSIs analysis related techniques have been well-studied in recent years.
Liu et al. Liu et al. (2017)
presented a CNN framework to aid breast cancer metastasis detection in lymph nodes. The model is based on Inception architecture with careful image patch sampling and data augmentations. Random forest classifier and feature engineering were used for whole-slide classification procedure.
Hou et al. Hou et al. (2016)
proposed to train a decision fusion model to aggregate patch-level predictions given by patch-level CNNs. In addition, authors formulated a Expectation-Maximization based method that automatically locates discriminative patches robustly by utilizing the spatial relationships of patches.
Zhu et al. Zhu et al. (2017) proposed a survival analysis framework which first extract hundreds of patches by adaptive sampling and then group these images into different clusters. An aggregation model was trained to make patient-level predictions based on cluster-level results.
Mercan et al. Mercan et al. (2017) developed a framework to analyze breast histopathology image. Candidate ROIs were extracted from the logs of pathologists’ image screenings based on different behaviors. Class labels were extracted from the pathology forms and slides were modeled as bags of instances which are represented by the candidate ROIs.
Localizing of lesion regions in WSIs by image segmentation related techniques is also a crucial direction for helping pathologists make the correct decision efficiently.
Qaiser et al. Qaiser et al. (2019) proposed a tumor segmentation framework based on the concept of persistent homology profiles which models the atypical characteristics of tumor nuclei to localize the malignant tumor regions. Dong et al. Dong et al. (2018)
proposed a reinforcement learning based framework motivated by the zoom-in operation of a pathologist which learns a policy network to decide whether zooming is required in a given ROI.
Our main contributions:
We collect a large-scale dataset for gastric cancer screening and develop a semi-automated annotation system111https://path-anno.sensetime.com/ to help obtain the detailed lesion region annotation.
We take advantage of the region annotation by proposing a multi-task network structure which could provide the classification label (screening result) as well as the segmentation mask (suspicious region) simultaneously.
We design a practical framework consisting of networks to process the high-resolution WSIs, and employ the deformable convolution operation based on the observation of the characteristics of the pathology images.
3 Data Collection
3.1 Data Acquisition
All the annotated slides are automatically scanned using the digital pathology scanner Leica Aperio AT2 at magnification () and labeled by pathologists under the supervision of the experts in Shanghai General Hospital.
The slide-level annotation is either “positive” (refers to the malignant sample such as low-grade intraepithelial neoplasia, high-grade intraepithelial neoplasia, adenocarcinoma, signet ring cell carcinoma, and poorly cohesive carcinoma) or “negative” (refers to the benign sample such as chronic atrophic gastritis, chronic non-atrophic gastritis, intestinal metaplasia, gastric polyps, gastric mucosal erosion, etc). All the malignant regions are labeled along with their contours and converted to a mask. (Shown in Fig. 1.)
3.2 Semi-automated Annotation System
The manual annotation procedures designed by the experts of Shanghai General Hospital are:
Coarse labeling. The pathologist finds the suspicious area in the high-resolution gigapixel image and selects them with bounding boxes.
Fine labeling. Within the selected bounding box, pathologist further confirms whether it is malignant, draws a closed curve along the contour of the malignant region and marks the lesion type of the region.
Double checking. Finally, the experts go through all the annotated WSIs and verify the correctness of the labels.
Manually labeling a WSI usually takes hours and the contour may not align perfectly with the border due to the enormous image size and the highly unsmooth edge (shown in Fig. 3).
To address this issue, we introduce a semi-automated annotation system (shown in Fig. 2). An annotation refinement module with a series of image processing techniques is employed to eliminate the background areas in the manually labeled regions (shown in the right part of Fig. 2). Color deconvolution technique Van der Laak et al. (2000); Ruifrok et al. (2001) is used to extract Hematoxylin & Eosin-stained regions (image II). Based on the output of color deconvolution, we could obtain the foreground by applying Otsu thresholding Otsu (1979) (image III). The final segmentation annotations are the intersection between the foregrounds and the pathologists annotated areas (image IV). These procedures could fix the empty regions and loose boundary without modifying the true malignant region.
As shown in the left part of Fig. 2, we further propose to initialize a preliminary annotation with the output of a segmentation network trained on a few labeled WSIs. Then, pathologists only need to modify the masked regions discovered by the segmentation network, which could effectively speed up the labeling procedure.
Basically, our semi-automated annotation system could generate more accurate region contour and reduce the labeling time to minutes.
3.3 Large-scale Screening Dataset from Multiple Medical Centers
Moreover, in order to test our proposed framework in real-world scenario, we collected WSIs from four hospitals in East China, i.e., Sijing Hospital of Shanghai Songjiang District (SHSSD), Songjiang District Center Hospital (SDCH), Zhejiang Provincial People’s Hospital (ZPPH) and Shanghai General Hospital (SGH).
The main challenge in analyzing WSIs is that the images are extremely large, while only a small part of images contains the lesion region. Hence, we propose a gastric cancer screening and localization framework (shown in Fig. 4), which consists of 1) a lite segmentation network for extracting the tissue region and eliminating the background area, 2) a multi-task network for generating classification and segmentation results for every cropped patch, and 3) a simple feedforward network for providing the slide-level screening result.
4.1 Network Structure
Multi-task Network. We employ deep layer aggregation (DLA) structure Yu et al. (2018) as the backbone since it is designed to aggregate layers to fuse semantic and spatial information for better recognition and localization. In our case, we utilize the dense prediction network for lesion region segmentation and change it to a multi-task structure by adding a classification branch on the output of the encoder. In the training phase, we use the patches cropped from ROIs with their annotated malignant region masks as the positive inputs and the patches randomly cropped from the benign region with the background (all-zero) masks as the negative inputs, and train the network in an end-to-end manner by combining the classification loss and the segmentation loss (Eq. 1). In the inference phase, patches from each sliding window are sent to the network to generate the classification score and the segmentation heatmap.
where denotes the loss in classification branch, is the binary cross entropy loss in segmentation branch, and is a hyper-parameter to balance the weights of these two losses.
Feedforward Screening Network. The input of this network is a matrix which consists of the mid-layer features (i.e., the output of encoder) in the dense prediction DLA structure from
patches with the highest probability of being malignant. The followings are atensor, and a fully connected layer followed by a sigmoid function. Ground truth label of this network is defined in a multi-instance learning (MIL) manner, i.e., whether the WSI contains at least one malignant region.
Since lesion regions only account for percent of the tissue or less, it is difficult for a network targeting at recognition task such as as Krizhevsky et al. (2012); He et al. (2016); Szegedy et al. (2016) to find where to pay attention to. In this project, we could take advantage of the detailed lesion region annotation. The multi-task network is designed for producing patch-level results. The segmentation task could not only generate the lesion region mask but also help the whole network locate the regions that really need to focus, which could implicitly support the classification branch to achieve higher performance. The feedforward screening network is proposed for generating the slide-level results. The basic idea in designing this simple feedforward network follows the concept of MIL.
Because the gastric WSIs usually carry many blank areas. We further design a foreground extraction network to extract all tissue region and reduce the running time in the inference phase. It is a segmentation network with convolution layers and upsampling layer. The input of this network are the resized WSIs from the lowest magnification level (i.e., ) and the masks generated by color deconvlotion Van der Laak et al. (2000); Ruifrok et al. (2001) and Otsu thresholding Otsu (1979) with manually modification . The output of this network is utilized in the inference stage for selecting the sliding window area.
4.2 Deformable Convolution
Another issue in WSIs is that the lesion regions often come with irregular shapes and various sizes while standard convolution operation always fails to handle large geometric transformations due to the fixed geometric structures. Therefore, instead of using standard convolution operation, we employ the deformable convolution layer Dai et al. (2017) in the decoder.
Deformable convolution is proposed based on the idea of augmenting the spatial sampling locations in the modules with additional offsets:
where denotes the input feature map, denotes the output feature map, is a location on input feature map, and is a regular grid . This type of convolution operation allows adding 2D offsets to the regular grid sampling locations in the standard convolution.
As demonstrated in Fig. 5, we could easily benefit from the adaptive receptive field when handling the lesion regions with irregular shapes. The deformable convolution operation enables free form deformation of the sampling grid by adding additional offsets (the arrows in the right image of Fig. 5). The offsets are learned from the segmentation task simultaneously with convolutional kernels during training. Hence, the deformation is conditioned on the input features in a local, dense, and adaptive way.
As a consequence, we could capture the feature of the tissue regions with various shapes by utilizing deformable convolution operation, which leads to better performance on the segmentation task.
5.1 Implementation Details
We choose the basic DLA-34 with dense prediction part as the backbone of the multi-task network, replace the first convolution layer with three convolution layers, and several standard convolution in decoder with deformable convolution. The output of classification branch of this network is a feature vector. We pick to form a matrix as the input of the feedforward screening network. The in Eq. 1 used for balancing the classification loss and segmentation loss is set to
. Besides, in-place activated batch normalizationRota Bulò et al. (2018) is employed to reduce the memory requirement in training deep networks.
We use Adam optimizer to train the mentioned networks separately. The initial learning rate is set to and reduced by a factor when the validation loss stagnates for epochs. The multi-task network is trained for epochs with the batch size of on Nvidia GTX 1080 Ti with GB RAM GPUs. And the feedforward screening network is trained for epochs with the batch size on a single GPU.
5.2 Evaluation of Segmentation
positive patches are cropped from the annotated ROIs, and negative patches are collected from the WSIs that are diagnosed as benign. Data augmentation techniques such as horizontal/vertical flip, rotation, random cropping and resizing, changing the aspect ratio and image contrast, and adding Gaussian noise are applied during training. We use repeated random sub-sampling strategy in evaluation. Patches from about WSIs are selected as the testing set by the patient ID, and the whole procedures are repeated times for stable results.
For the evaluation metric, we use the Dice coefficient (DSC) defined over image pixels.
We compare our proposed method with commonly used segmentation networks, i.e., FCN Long et al. (2015), U-Net Ronneberger et al. (2015), and the standard DLA Yu et al. (2018) structure without any modification.
FCN and U-Net serve as the baselines, which give us a sense about the performance of commonly used methods on our dataset. Dice coefficient of these two methods is and , respectively. The possible reason for FCN performing better could be it benefits more from classification branch without the skip connections like U-Net. DLA performs better, i.e., , because it benefits from fusing information by aggregating layers. Our proposed model achieves the highest DSC of as we adjust the standard DLA structure according to our problem settings by replacing the standard convolution with deformable convolution operator, adding classification branch, etc.
To better demonstrate the effectiveness of our proposed method, we show some examples in Fig. 7. FCN and U-Net often generate unclear and unsmooth boundaries (1st and 2nd row). As we could see from the 3rd to the 5th column, the 3 comparison methods make a lot of false positives compared to our method. In the 2nd row, U-Net even includes red blood cell area in the result. In the 5th row, our model is the only one that successfully identifies the tissue on the bottom to be negative.
5.3 Evaluation of Screening
Compare with the segmentation evaluation that requires a testing set with lesion region annotation, evaluating the screening results of our proposed model could be done on a larger testing set with screening annotation only, i.e., positive samples and negative samples.
In Table. 1, we report the screening results of our proposed framework and comparison methods on multiple evaluation metrics, i.e., sensitivity, specificity, and Area Under ROC Curve (AUC). The comparison methods are a single task classification network with the same backbone as our proposed network structure (W/O seg) and multi-task network structure with different backbones (FCN and U-Net).
The primary goal of our proposed framework is gastric cancer screening. In other words, the number of negative samples is much larger than the positive ones (about in this testing set). In this case, sensitivity is the most critical evaluation metric when we take clinical background knowledge into consideration since we would not like to miss any of the malignant cases. Besides, specificity is the second priority because we do not want too many false alarms, either. AUC is a metric that expresses both sensitivity and specificity. Although the single task classification network (W/O seg) achieves the highest specificity of , it is not a suitable choice as it generates too many false negative. The proposed method shows the highest sensitivity of , the second-best specificity of , and the leading AUC of , which outperforms others methods by a large margin.
5.4 Evaluation in Real-world Scenario
Also, we test our best model on our large-scale real-world set collected from medical centers. Table. 2 shows the numbers of images of the collected data.
All the training images with lesion region annotation are from SGH in the year 2018. Besides those training images, we further collect images in that year, images in 2019 as the most recent samples, and images in 2015 as the old samples since they did not use too many automated devices for fixation, sectioning, and staining at that time. Moreover, to test the generalization ability of our proposed model, we collect images from other hospitals, i.e., SHSSD, SDCH, and ZPPH. The devices and procedures in making histology slides are different in these hospitals which may affect the final WSIs. Overall, we have WSIs from hospitals and years, and the positive ratio is less than .
We apply our best model on these data and present sensitivity, and specificity in Table. 3. We achieve the sensitivity of in testing set, in the other, and the specificity is around
in most of the cases. In the real-world scenario, the data distribution is different from our training and validation dataset, i.e., less positive samples and more outliers such as out of focus samples. Also, the differences in device and slides preparing procedures do limit the performance of our model.
6 Conclusion and Future Work
In conclusion, with the assumption that detailed lesion region annotation is necessary for the network to locate the crucial part in the gigapixel resolution WSIs and achieve the better performance, we designed a semi-automated annotation system and collected a large dataset consisting of WSIs with lesion region masks and over samples with screening results. To exploit our dataset, we propose a gastric slides screening framework to reduce the workload of pathologists and lower the inter-observer variability. The whole framework consists of three networks, a multi-task network for patch-level classification and segmentation, a 3-layer feed-forward network for slide-level screening result, and a simple segmentation network for foreground extraction.
One of the possible future direction of this project is cancer subtype diagnosis. Instead of just providing the “positive/negative” screening result, an automated lesion type diagnosis system could present the result like “adenocarcinoma”, “signet ring cell carcinoma”, “chronic atrophic gastritis”, etc. Furthermore, since a single WSI may contain multiple lesion types, displaying the labels on each segmented region/ROI could be more helpful, which requires an instance-level segmentation structure.
- Automated classification of brain tumor type in whole-slide digital pathology images using local representative tiles. Medical image analysis 30, pp. 60–71. Cited by: §1.
- Global cancer statistics 2018: globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians 68 (6), pp. 394–424. Cited by: §1.
- DCAN: deep contour-aware networks for object instance segmentation from histology images. Medical image analysis 36, pp. 135–146. Cited by: §1.
- Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801–818. Cited by: §1.
Automated gastric cancer diagnosis on h&e-stained sections; ltraining a classifier on a large scale with multiple instance machine learning. In Medical Imaging 2013: Digital Pathology, Vol. 8676, pp. 867605. Cited by: §2.1.
Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nature medicine 24 (10), pp. 1559. Cited by: §1.
- Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: §4.2.
- Reinforced auto-zoom net: towards accurate and fast breast cancer segmentation in whole-slide images. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 317–325. Cited by: §2.2.
- Residual deconvolutional networks for brain electron microscopy image segmentation. IEEE transactions on medical imaging 36 (2), pp. 447–456. Cited by: §1.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.1.
Patch-based convolutional neural network for whole slide tissue image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2424–2433. Cited by: §2.2.
- Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §4.1.
- Accurate nuclear segmentation with center vector encoding. In Information Processing in Medical Imaging, A. C. S. Chung, J. C. Gee, P. A. Yushkevich, and S. Bao (Eds.), Cham, pp. 394–404. External Links: Cited by: §1.
Signet ring cell detection with a semi-supervised learning framework. In Information Processing in Medical Imaging, Cited by: §1.
- Deep learning based gastric cancer identification. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 182–185. Cited by: §2.1.
- Large-scale retrieval for medical image analytics: a comprehensive review. Medical image analysis 43, pp. 66–84. Cited by: §1.
- Scannet: a fast and dense scanning framework for metastastic breast cancer detection from whole-slide image. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 539–546. Cited by: §1.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §1.
- Detecting cancer metastases on gigapixel pathology images. arXiv preprint arXiv:1703.02442. Cited by: §2.2.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1, §5.2.
- Multi-instance multi-label learning for multi-class classification of whole slide breast histopathology images. IEEE transactions on medical imaging 37 (1), pp. 316–325. Cited by: §2.2.
- Pathological diagnosis of gastric cancers with a novel computerized analysis system. Journal of pathology informatics 8. Cited by: §2.1.
- A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics 9 (1), pp. 62–66. Cited by: §3.2, §4.1.
- Fast and accurate tumor segmentation of histology images using persistent homology and deep convolutional features. Medical image analysis 55, pp. 1–14. Cited by: §2.2.
- Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §1.
- U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §5.2.
- In-place activated batchnorm for memory-optimized training of dnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5639–5647. Cited by: §5.1.
- Quantification of histochemical staining by color deconvolution. Analytical and quantitative cytology and histology 23 (4), pp. 291–299. Cited by: §3.2, §4.1.
- Region segmentation in histopathological breast cancer images using deep convolutional neural network. In 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI), pp. 55–58. Cited by: §1.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §1, §4.1.
- Hue-saturation-density (hsd) model for stain recognition in digital images from transmitted light microscopy. Cytometry: The Journal of the International Society for Analytical Cytology 39 (4), pp. 275–284. Cited by: §3.2, §4.1.
Stacked sparse autoencoder (ssae) for nuclei detection on breast cancer histopathology images. IEEE transactions on medical imaging 35 (1), pp. 119–130. Cited by: §1.
- Automated histological classification of whole-slide images of gastric biopsy specimens. Gastric Cancer 21 (2), pp. 249–257. Cited by: §2.1.
- Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2403–2412. Cited by: §4.1, §5.2.
Towards large-scale histopathological image analysis: hashing-based image retrieval. IEEE Transactions on Medical Imaging 34 (2), pp. 496–506. Cited by: §1.
- Pathologist-level interpretable whole-slide cancer diagnosis with deep learning. Nature Machine Intelligence 1 (5), pp. 236. Cited by: §1.
- Wsisa: making survival prediction from whole slide histopathological images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7234–7242. Cited by: §2.2.