A well-established method of assessing tumor proliferation is the Mitotic Count (MC) [meuten2016] - a quantification of mitotic figures in a selected field of interest. Identifying mitotic figures, however, is prone to a high level of intra- and inter-observer variability [aubreville2020]
. Recent work has shown that deep learning-based algorithms can guide pathologists duringMC assessment and lead to faster and more accurate results [aubreville2020]. These algorithmic solutions, however, are highly domain-dependent and performance significantly decreases when applying these algorithms to data from unseen domains [lafarge2017]. In histopathology, domain shifts are oftentimes attributed to varying sample preparation or staining protocols used at different laboratories. These sources of domain shift have been approached with a wide range of strategies, e.g. stain normalization [macenko2009], stain augmentation [tellez2018] and domain adversarial training [lafarge2017]. Domain shifts, however, cannot only be attributed to staining variations but can also include variations induced by different slide scanners [aubreville2021]. The MItosis DOmain Generalization (MIDOG) challenge [midog], hosted as a satellite event of the 24 International Conference at Medical Image Computing and Computer Assisted Intervention (MICCAI) 2021, addresses this topic in the form of assessing the MC on a multi-scanner dataset. This work presents the reference algorithm developed out-of-competition as a baseline for the MIDOG challenge. The RetinaNet-based architecture was trained in a domain adversarial fashion and scored an F score of 0.7514 on the preliminary test set.
Material and Methods
The reference algorithm was developed on the official training subset of the MIDOG dataset. We did not use any additional datasets and had no access to the preliminary test during method development. The algorithm is based on a publicly available implementation of RetinaNet [marzahl2020] which was extended by a domain classification path to enable domain adversarial training.
The MIDOG training subset consists of 200 Whole Slide Images from human breast cancer tissue samples stained with routine Hematoxylin & Eosin (H&E) dye. The samples were digitized with four slide scanning systems: the Hamamatsu XR NanoZoomer 2.0, the Hamamatsu S360, the Aperio ScanScope CS2 and the Leica GT450, resulting in 50 WSIs per scanner. For the slides of three scanners, a selected field of interest sized approximately (equivalent to ten high power fields) was annotated for mitotic figures and hard negative look-alikes. These annotations were collected in a multi-expert blinded set-up. For the Leica GT450, no annotations were available. The preliminary test set consists of five WSIs each for four undisclosed slide scanning systems of which only two were also part of the training set. This preliminary test set was used for evaluating the algorithms prior to submission and publishing preliminary results on a leaderboard. The final test set consists of 20 additional WSIs from the same four scanners used for the preliminary test set. The evaluation through a Docker-based submission system ensured that the participants had no access to the (preliminary) test images during method development.
Domain Adversarial RetinaNet
and likewise chose a sequence of three blocks consisting of a convolutional layer, batch normalization, ReLU activation and Dropout, followed by an adaptive average pooling and a fully connected layer. We experimented with varying the number and positions of the domain classifier but ultimately decided for positioning a single discriminator at the bottleneck of the encoding branch.Figure 1 schematically visualizes the modified RetinaNet architecture.
We split our training data into 40 training and ten validation WSIs per scanner and ensured a similar distribution of samples with a high and a low density of mitotic figures in each subset. For network training, we used a patch size of 512 512 pixels and a batch size of 12. Each batch contained three images of each scanner. To overcome class imbalance, we employed a custom patch sampling, where half of the training patches were sampled randomly from the slides and the other half was sampled in a 512-pixel radius around a randomly chosen mitotic figure. Furthermore, we performed online data augmentation with random flipping, affine transformations and random lightning and contrast change. We trained the network with a cyclical maximal learning rate of
for 200 epochs until convergence. For loss computation, we calculated the standard RetinaNet loss as the sum of the bounding box regression loss and instance classification loss and added the domain classification loss. Both classification losses (instance and domain) were calculated using the Focal Loss[lin2017]
. For patches of the Leica scanner, which were not annotated, only the domain classification loss was considered. During backpropagation, theGRL negates the gradient and multiplies it with , a weighting factor which was gradually increased from 0 to 1 during training. Model selection was guided by the highest performance on the validation set as well as the highest domain confusion, i.e. highest domain classification loss, to ensure domain independence of the computed features.
Evaluation and Results
The training procedure elaborated above was repeated three times and the validation slides of the three annotated scanners were used for performance assessment. To compare results across different model operating points, we constructed precision-recall curves and compared the area under the precision-recall curves averaged over all three scanners for which mitotic figure annotations were available. As our final model, we selected the model with the highest mean AUCPR on the validation set and selected the operating point according to the highest mean F score. fig. 2 shows the AUCPRs of the final model with a mean AUCPR of 0.7964 and an F score of 0.7533 at an operating point of 0.62. When integrating the selected model into a submission docker container and evaluating it on the preliminary test set, we scored a mean F score of 0.7514 resulting from a 0.6939 precision and a 0.8193 recall.
Discussion and Conclusion
In this work, we presented our baseline algorithm for the MIDOG challenge, based on domain adversarial training. With a validation F score of 0.7514, the algorithm is in line with previous mitotic figure algorithms trained and tested on breast cancer images from the same domain [bertram2020]. The similar F scores on the validation and preliminary test set indicate a successful domain generalization of the proposed network. The code used for training the network will be made publicly available in our GitHub111https://github.com/DeepPathology repository after the final submission deadline.