To What Extent Does Downsampling, Compression, and Data Scarcity Impact Renal Image Analysis?

by   Can Peng, et al.

The condition of the Glomeruli, or filter sacks, in renal Direct Immunofluorescence (DIF) specimens is a critical indicator for diagnosing kidney diseases. A digital pathology system which digitizes a glass histology slide into a Whole Slide Image (WSI) and then automatically detects and zooms in on the glomeruli with a higher magnification objective will be extremely helpful for pathologists. In this paper, using glomerulus detection as the study case, we provide analysis and observations on several important issues to help with the development of Computer Aided Diagnostic (CAD) systems to process WSIs. Large image resolution, large file size, and data scarcity are always challenging to deal with. To this end, we first examine image downsampling rates in terms of their effect on detection accuracy. Second, we examine the impact of image compression. Third, we examine the relationship between the size of the training set and detection accuracy. To understand the above issues, experiments are performed on the state-of-the-art detectors: Faster R-CNN, R-FCN, Mask R-CNN and SSD. Critical findings are observed: (1) The best balance between detection accuracy, detection speed and file size is achieved at 8 times downsampling captured with a 40× objective; (2) compression which reduces the file size dramatically, does not necessarily have an adverse effect on overall accuracy; (3) reducing the amount of training data to some extents causes a drop in precision but has a negligible impact on the recall; (4) in most cases, Faster R-CNN achieves the best accuracy in the glomerulus detection task. We show that the image file size of 40× WSI images can be reduced by a factor of over 6000 with negligible loss of glomerulus detection accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4


Increasing Compression Ratio in PNG Images by k-Modulus Method for Image Transformation

Image compression is an important filed in image processing. The science...

Deep Fruit Detection in Orchards

An accurate and reliable image based fruit detection system is critical ...

Speed/accuracy trade-offs for modern convolutional object detectors

The goal of this paper is to serve as a guide for selecting a detection ...

RECURSIA-RRT: Recursive translatable point-set pattern discovery with removal of redundant translators

Two algorithms, RECURSIA and RRT, are presented, designed to increase th...

DeepSperm: A robust and real-time bull sperm-cell detection in densely populated semen videos

Background and Objective: Object detection is a primary research interes...

X-Ray bone abnormalities detection using MURA dataset

We introduce the deep network trained on the MURA dataset from the Stanf...

Waveform Signal Entropy and Compression Study of Whole-Building Energy Datasets

Electrical energy consumption has been an ongoing research area since th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The introduction of the kidney biopsy is one of the major events in the history of nephrology [1]. The kidney biopsy helps diagnose diseases such as glomerulonephritis and glomerulosclerosis [1]. Direct Immunofluorescence (DIF) is usually used as the gold standard for immunohistochemical evaluation of renal biopsy specimens [20]. Traditionally, specimen diagnosis is performed manually by pathologists with microscopes. This process is subjective, time-inefficient and labour-intensive [31]. In addition, due to fluorescence bleaching, DIF slides can only be stored and viewed for a limited period. To solve these problems, many works such as [25] have developed systems that can digitize these slices for permanent recording and consequently allow computer-aided analysis.

Fig. 1: Diagram of the renal corpuscle structure of the glomerulus (this image by M. Komorniczak is licensed under CC BY-SA 4.0)

The challenges

of DIF WSI renal analysis include large image resolution, large file size, and data scarcity. The DIF WSIs are extremely large with file size up to tens of gigabytes with resolution up to gigapixels. It is impossible to directly feed such large images to Computer Aided Diagnostic (CAD) systems based on Convolutional Neural Networks (CNNs). Moreover, hospitals would like to store patients’ diagnostic results for future review. However, storing such large DIF files will be very expensive and hard to manage. Thus, pre-processing is a critical step to automate the WSI analysis task and it is vital to explore to what extent the pre-processing methods will affect the analysis. Furthermore, compared to many general images, renal biopsy images are significantly more costly and invasive to obtain as they involve surgery on patients. Lack of training data will affect the CAD system’s performance.

Fig. 2: The main procedure for glomerulus detection on the renal WSI — The raw renal DIF WSI is first pre-processed by downsampling and compressing to reduce both image resolution and file size. Then, the pre-processed image is cropped into patches of suitable size through a sliding-window method. After that, the patches are fed into the CNN detector to get detection result. Finally, the patch-level results are combined to generate the final WSI detection result.
Fig. 3: An example of glomerulus detection results for a renal DIF WSI. The red boxes are the ground truth bounding boxes and the green boxes are the predicted bounding boxes.

One way to address the above issues is to first detect areas/objects of interest used by the pathologists to make the correct diagnoses. This can also reduce the time in viewing the WSIs as the system can bring up the locations of detected area/object of interests. Note that, in DIF WSIs analysis, glomeruli are the primary objects of interest. Thus, we confine ourselves to study the effect of the pre-processing steps and amount of training data on the glomerulus detection problem. Figure 1 shows the renal corpuscle structure of the glomerulus. For our experiments, the image data were captured using a microscope camera with magnification. Figure 2 illustrates the whole detection procedure.

The findings of the study are as follows:

  • With a fixed patch size (1024 1024 pixels), downsampling affects the detection accuracy. If the image is not downsampled, the image patch will only contain a whole or partial glomerulus. Training a model under these conditions increases the false positive rate as background information is insufficient. On the other hand, significantly reducing the resolution of the image increases the false negative rate as the glomeruli become small and ambiguous.

  • Applying JPEG compression to the raw renal DIF WSI does not affect detection performance significantly. However, the optimal compression for each detection model can be different. This finding may help in addressing the data storage issues.

  • Reducing the amount of training data to some extents causes a drop in precision but has a negligible impact on the recall. The rate of mean Average Precision (mAP) performance drop is significantly smaller than the rate of data reduction. For instance, reducing the training data by 20% only reduces Faster R-CNN detection performance from 0.781 mAP to 0.734 mAP at an Intersection of Union (IoU) threshold of 0.5. Note that, this finding does not apply to SSD[18] which is also the worst performing method in the study.

  • Of the four object detectors: Faster R-CNN [24], Mask R-CNN[10], R-FCN[4] and SSD[18], Faster R-CNN has the best performance of 0.781 mAP at an IoU of 0.5.

We continue our paper as follows. In Section II, we discuss the related work. Section III presents the experimental setup and protocols. Section IV presents and discusses the results followed by Section V presenting the conclusions.

Ii Related Work

We first discuss common pre-processing methods in renal WSI analysis followed by discussions on recent state-of-the-art object detection methods.

Ii-a Renal WSI Pre-processing and Analysis

Although pre-processing and the amount of training data will strongly affect detection accuracy for automatic glomerular analysis, most literature about glomerulus classification and detection mention little about these problems. For example, Kawazoe et al. [14] proposed a method of using Faster R-CNN to detect glomeruli on periodic acid-Schiff (PAS), periodic acid-methenamine silver (PAM), Masson trichrome (MT) and Azan stained renal WSIs. In their methods, images are taken by objective lens magnification and detection is conducted with downsampling equivalent to objective lens magnification. Then the down-sampled images are sliding-window cropped by pixels. For training data, images are cropped by the window centred on each annotated glomerulus and incomplete glomerular bounding boxes within the sliding window are ignored. Zhao et al. [31] published a renal DIF dataset and explored several CNN methods to detect the glomeruli in renal DIF. For all the experiments, the authors used 12 times downsampling on the WSIs and cropped the resized images into patches for detection. Simon et al. [27]

used local binary patterns (LBPs) image feature vector to train a support vector machine (SVM) model to classify glomeruli on light microscopy (LM) renal images. Their training images were extracted manually at a size of 576

576 pixels from the original WSIs. For test data, each WSI was fed into the SVM classifier by the sliding-window approach with a stride of 64 pixels. Window size was the same as the size of the training images (

pixels). Gallego et al. [6] proposed an AlexNet based CNN classifier to classify glomeruli for the PAS stained LM renal data. They also manually extracted glomerulus and non-glomerulus patches from the WSIs and resized all images to pixels. In contrast to these works on renal WSI, here we perform extensive experiments about how various pre-processing methods and the amount of training data affect glomerulus detection performance on several state-of-the-art CNN models.

Ii-B CNN Based Object Detectors

Since the CNN was first proposed for object detection task as Region-based CNN (R-CNN) by Girshick et al. [9], various CNN-based models have produced impressive object detection performance. Several works have applied CNN methods to the computational pathology domain, such as cancer detection [5, 12, 28], organ segmentation on CT and MRI images [2, 21] and cell classification [3, 7]. Modern CNN based object detectors can be roughly categorized into two categories: the two-stage detectors, such as Faster R-CNN [24], Mask R-CNN [10], and R-FCN [4], and the one-stage detectors, such as YOLO [22], YOLOv2 [23], SSD [18] and RetinaNet [16]. In the two-stage detector approach, the input image is first fed to a Region Proposal Network (RPN) to generate a sparse set of candidate boxes. These Region of Interests (RoIs) are then further classified and regressed by RoI-wise branches to generate the final prediction results [4]. Different from the two-stage detectors, one-stage detectors do not generate region proposals. This significantly reduces running time at the price of reduction in accuracy. We employed four object detectors for our experiments: Faster R-CNN, R-FCN, Mask R-CNN and SSD.

Faster R-CNN [24] is an improved version of R-CNN [9] and Fast R-CNN [8]. Inspired by image classification, R-CNN directly applies a CNN based image classifier on a set of generated region proposals [13]. Although R-CNN improves the state-of-the-art detection accuracy, the proposal features are calculated multiple times which leads to large run time [13]

. Fast R-CNN alleviates this problem by making all region proposal features share one-time generated feature extraction

[13]. However, both R-CNN and Fast R-CNN depend on external proposal generators which then become the new bottleneck as everything apart from the regional proposal generator runs in the GPU. Faster R-CNN solves this problem by using a neural network called RPN to generate the candidate anchors and it is then able to be trained end-to-end.

R-FCN [4]

is proposed based on Faster R-CNN. R-FCN modifies the backbone network used for feature extraction in Faster R-CNN. This modification is crucial as a typical backbone network is trained for image classification problems which require the network to impose a translation invariant property. This property, however, is the opposite of the object detection problem which requires translation variance. To this end, R-FCN uses a position-sensitive convolution layer that is specifically trained to remove the translation invariant property.

Mask R-CNN [10] is mainly targeted to address the instance segmentation problem. However, there are several improvements that allow Mask R-CNN to outperform Faster R-CNN. For instance, to perform accurate spatial quantization for feature extraction, RoI Align is used instead of RoI Pool. In addition, unlike Faster R-CNN, Mask RCNN uses a Feature Pyramid Network (FPN)[15] with ResNet as its backbone. FPN builds an in-network feature pyramid from a single-scale input by a top-down architecture with lateral connections [15].

SSD [18] is a single-stage object detector. SSD is similar to RPN, since both provide the detection results in one step. The difference is that whilst RPN provides object/non-object classification, SSD provides class-level classification. In other words, it directly classifies default anchor boxes’ classes and regresses their real bounding boxes. SSD combines predictions from multiple feature maps with different resolutions to handle various object sizes.

Iii Experimental Setup and Protocols

Iii-a Data Collection

Tissue samples from biopsies are digitized into TIF format using a M12 microscopy camera with the Sony IMX253 CMOS global shutter sensor at objective lens magnification. Figure 4 shows our scanning system that digitized the renal glass slides. The system uses a two-stage scanning method that first creates a general view of the specimen using the low magnification objective (), and then this overview image is used as a location guide to scan the actual specimen using the high magnification objective ().

Fig. 4: The scanning system that digitized the renal glass slides.

The renal DIF dataset used in all experiments includes 230 WSIs collected from 30 patients. These WSIs have an average file size of 20 gigabytes and an average image size of pixels. The antibodies for staining were: IgG, IgA, IgM, C3, Fib, C1q, Kappa and Lambda. Each WSI image was manually labelled. The labelling works were conducted by expert pathologists. Figure 5 shows some glomerular and non-glomerular examples from our dataset. Compared to the generic object detection tasks, glomerulus detection on renal DIF images is more challenging. Renal DIF WSIs are enormous (90,000 72,000 pixels), but the size of glomeruli are relatively quite small, ranging from 4000 4000 pixels to 7000 7000 pixels. The glomerular staining intensity is highly variable as it relates to the different positivity grades of patients. In addition, due to different diseases, the patterns of glomeruli on renal DIF images can vary with conditions such as granular glomerular staining, linear glomerular deposit, and scanty glomerular immunostaining [31].

(a) Glomerular examples
(b) Non-glomerular examples
Fig. 5: Glomerulus and non-glomerulus examples.

The raw renal DIF images generated from the camera were downsized and converted into JPEG images before being sent to the CNN detectors. Note that the downsized images were still very large. For example, after 12 times downsizing, the image was still pixels and could not be directly fed into the detection models. Thus, after resizing, the shrunk images were divided into overlapping patches of size of pixels. The sliding-window method with stride of 256 pixels was used to perform the cropping. After getting the patch-level detection result, it was mapped back to the WSIs to generate the final prediction result. Figure 3 shows an example of the WSI detection result.

Iii-B Detector Setup

The cropped patches were randomly divided into training set, validation set, and test set with a split ratio of 70%, 10% and 20%, respectively. Note that, all data from the same patient was put into either the train or test set only. This patient-specific setting was required as a trained system should not have any data from an unseen new patient.

In our experiments, Faster R-CNN, Mask R-CNN and R-FCN used Resnet101 [11] as the backbone network. SSD used Mobilenet v2 [26]

as the backbone. All models were implemented using the Tensorflow framework. These models were pre-trained on the COCO dataset 


and then fine-tuned on our renal training set for 120,000 steps. Each model was trained and tested with the same pre-processing parameters. Faster R-CNN, Mask R-CNN and R-FCN used a batch size of 1 and the Stochastic Gradient Descent (SGD) with momentum value of 0.9 as optimizer. SSD used a batch size of 32 and RMSprop 

[29] with momentum value of 0.9 as optimizer. The input image size for Faster R-CNN, Mask R-CNN and R-FCN was pixels. The patches were further resized to pixels for SSD in order to align with SSD pre-trained models. The learning rate was set at with the weight decay regularization of . All experiments were performed on an NVIDIA Tesla V100 GPU.

Iii-C Evaluation Metric

For assessing experimental results, we use mAP as our main evaluation metric. The mAP calculates the area under the precision/recall curve. In addition, we further drill down into what the pre-processing and the amount of training data affect — precision or recall. Equation (

1) defines precision (P) and (2) defines recall (R), in terms of True Positive (TP), False Positive (FP) and False Negative (FN).


Glomerulus detection is a challenging task as glomeruli vary in size, shape, and pattern. Sometimes glomeruli do not even have visible boundaries because their structures are damaged due to disease or the specimen preparation procedure. Therefore, even the annotation work is performed by pathologists, it is difficult to be 100% sure where the accurate boundaries are located in DIF renal images. An example is shown in Figure 6. The red boxes are the ground truth boxes and the green boxes are the predicted boxes. It is difficult to determine whether the ground truth boxes are more accurate than the predicted bounding boxes. Furthermore, our primary concern is to localize the glomeruli to help the pathologists, rather than producing accurate boundaries. Thus, we focus on the mAP result at Intersection of Union (IoU) threshold of 0.5.

Fig. 6: An example of the 8 times downsampled patch-level glomerulus detection result. The red boxes and green boxes are the ground truth and predicted bounding boxes, respectively. The two glomeruli do not have clear boundaries. It is hard to say that ground truth boxes are more accurate than predicted bounding boxes.

Iv Experimental Results and Analysis

In order to explore the key settings relating to the training images which greatly affect the glomerulus detection result, we performed extensive experiments in four factors: downsampling, compression, training data amount, and different detection models.

Iv-a Experiments on Downsampling

Downsampling and compression are two commonly used pre-processing methods to handle large images. The main difference between downsizing and compression is that downsizing reduces both the input image’s file size and image size (resolution), but compression only reduces the input image’s file size at the expense of some compression artefacts. As mentioned in Section III-A, the raw WSIs are extremely large and the resizing operation is always required. Thus, we want to evaluate to what extent the original uncompressed (raw) TIF images can be resized whilst maintaining the detection accuracy. To study this, we downsize the original TIF images by 4, 8, 12 and 16. Then the downsized images are saved as JPEG files with no compression. Figure 7 shows the average file size and image size per image at different downsampling rates. For example, when downsizing a TIF image by 4 and then saving it as JPEG, file size is decreased by 445 times and image resolution is only reduced by 4 from the original TIF file.

If the downsampling is less than 4, the size of many glomeruli will be greater than the patch size (1024 1024 pixels). This will cause many cropped patches to only contain a sub-region of the glomeruli and impede model learning. Thus, we use downsampling rate at 4 as the minimum rate. Figures 8, 9 and 10

show the mAP, precision and recall performance of CNN detectors trained and tested with images at different downsizing rates, respectively. Observing the experimental results, we find that apart from 8 times downsampling which actually increases accuracy, increasing the downsampling results in a drop in detection accuracy. At 8 times downsampling, all four detectors attain their best performance, especially for Faster R-CNN which achieves a mAP accuracy of 0.781.

At 4 times downsampling, all detectors have abysmal performance. This is because the glomerulus occupies most of the patch and there is little background seen in the training patches. The models are not provided with enough background information to learn the difference between glomeruli and background noise. Therefore they suffer from false positive problem leading to low precision as shown in Figure 9. With increasing downsampling rate, all models’ precision increases, since more background information is provided within the training set to avoid false positives. In contrast to the two-stage detectors, SSD has much lower detection accuracy. SSD’s low performance is due to its high false negative rate (low recall) as shown in Figure 10.

Due to the two-step cascaded classification and regression mechanism [30], two-stage detectors are more robust for glomerulus detection than one-stage detectors. To sum up, for glomerulus detection, when the detector’s input image size is set to pixels, a downsampling rate of 8 times is optimal in terms of detection accuracy and small file size. Training a model using images with very low downsampling rate (4 times), increases the false positive rate as not enough background is visible. On the other hand, significantly reducing the image size increases the false negative rate as the glomerulus becomes small and ambiguous. These observations are corroborated by the findings in [19] for general object detection task, which states that re-scaling the image to a lower resolution may produce better accuracy.

Fig. 7: Average file size and image size per image against downsampling rate with no compression. Downsampling changes both input image’s file size and image resolution.
Fig. 8: The mAP performance of four detectors trained and tested on images against downsampling rate.
Fig. 9: The precision performance of four detectors trained and tested against downsampling rate.
Fig. 10: The recall performance of four detectors trained and tested against downsampling rate.

Iv-B Experiments on Compression

Compression is also a good method to further reduce file size. All the images for the compression experiments are first downsampled to the optimal rate of 8 times. Then the resized images are compressed with JPEG compression rates of 0%, 20%, 40%, 60% and 80%. Figure 11 shows the average file size and image size at different JPEG compression rates after the 8 times downsampling. Figures 12, 13 and 14 show the mAP, precision, and recall performance of the four detectors trained and tested on the downsized images with different compression rates.

From the results, we find that for two-stage detectors, except for a few exceptions, performance trends downwards when compression rate increases. In contrast to the two-stage detectors which have a significant accuracy drop at 80% compression, SSD has an unexpected accuracy increase in accuracy. Due to its single-shot detection, SSD suffers from a strong false negative problem as shown in Figure 14. We conjecture that with increasing compression rate, background noise becomes more random and dissimilar to the glomeruli appearance. This makes SSD can make better detection. For other small anomalies in the graph for the two-stage detectors, we conjecture similarly that JPEG compression artefacts may sometimes help differentiate the glomeruli from the background.

When compressing the downsized renal image at 40%, file size drops from about 12.7 Megabytes (MB) to about 3.3 MB which is roughly 4 times smaller. In addition, according to the results in Figure 12, there is less than a 0.01 mAP accuracy decrease for both Faster R-CNN and Mask R-CNN. For SSD, its accuracy even increases by 0.035 mAP. Therefore, although pathology guidelines require raw images to be stored, our experimental results suggest that a suitable JPEG compression rate may help reduce file size with negligible cost to detection accuracy — at least for machine analysis, if not a human pathologist.

Fig. 11: Average file size and image size per image at 8 times downsampling against compression rate. Compression only changes input image’s file size and does not change its image resolution.
Fig. 12: The mAP performance of four detectors trained and tested with images under 8 times downsizing against compression rate.
Fig. 13: The precision performance of four detectors trained and tested with images under 8 times downsizing against compression rate.
Fig. 14: The recall performance of four detectors trained and tested with images under 8 times downsizing against compression rate.

Iv-C Experiments on Training Data Size

To explore how much data is required to train a model to accurately detect glomeruli on renal DIF images, we performed experiments with different sizes of training data. All training images are first downsampled 8 times and then saved as JPEG files with no compression. Figure 15, Figure 16 and Figure 17 shows the mAP, precision and recall performance of the four detectors trained with different amounts of training data respectively. The number of WSIs used for the training set is 158 (1204 glomerular patches) from 20 patients. We hold back 64 WSIs from another 9 patients as the test set and 8 WSIs from 1 patient are used as the validation set. We progressively reduce the training set, while the validation and test sets are unchanged. This step is performed three times and the average performance is reported.

Observing the experimental results in Figure 15, we find that with reduction in training set size, the mAP performance drops. However, the negative trend is not linear and the mAP drop is much smaller than the reduction in the data size. For instance, when the training set is reduced from 80% to 60%, Faster R-CNN only suffers a 0.01 accuracy drop (from 0.734 to 0.722). According to Figure 16 and Figure 17, the accuracy drop is mainly due to false positive instead of false negative. With the reducing of the training data amount, in most cases, two-stage detectors’ precision drops. SSD presents a pulse at 60% training data, we conjecture SSD’s unstable performance is due to its single shot manner which makes it easier to be affected by background noise. In summary, lack of training data may cause a decrease on detection accuracy but the effect is not directly proportional. The performance reduction is mainly due to false positive. Fewer training data to some extent leads to worse precision but has little effect towards recall.

Fig. 15: The mAP performance of four detectors trained and tested on 8 times downsized images with no JPEG compression against different training set sizes.
Fig. 16: The precision performance of four detectors trained and tested on 8 times downsized images with no JPEG compression against different training set sizes.
Fig. 17: The recall performance of four detectors trained and tested on 8 times downsized images with no JPEG compression against different training set sizes.

Iv-D Further Discussions on Different CNN Detectors

By observing the performance of the selected four detectors in Figure 8, Figures 12 and 15, we find that at an IoU threshold of 0.5, Faster R-CNN always gets the best performance followed by R-FCN, Mask R-CNN, and SSD. One reason why Mask R-CNN has lower performance than Faster R-CNN and R-FCN may be because the ground truth masks fed into Mask R-CNN have background noise. As mentioned in Section III-C, for renal DIF images, glomeruli have high variable in size, shape and pattern and some of them even do not have visible boundaries. Thus, during manually labelling, we are only able to draw the bounding boxes on the glomeruli. For all Mask R-CNN experiments, we use ground truth bounding boxes as ground truth masks which hampers the Mask R-CNN performance and leads to high false positive (low precision). Therefore, although for generic object detection task, such as COCO dataset [17], Mask R-CNN has good performance, in our scenario, Faster R-CNN is more recommendable.

Detection speed of different detectors are also analysed in Table I. Due to its single-shot approach, SSD suffers from false negative problem and has much lower detection accuracy compared to two-stage detectors, but it has the fastest detection speed. Within the three two-stage detectors, R-FCN has the fastest detection speed. The main difference between Faster R-CNN and R-FCN is the depth of the RoI sub-network. R-FCN extracts features from the final convolutional layer of ResNet101 network and uses position sensitive score maps and position sensitive RoI Pooling to get location information. Thus R-FCN has a shallower RoI sub-network than Faster R-CNN which helps to increase speed.

Average detection speed of different detectors on the WSI patch (1024 1024 pixels).

Methods Average Testing Time Per WSI Patch
Mask R-CNN (ResNet101) [10] 1036.17 ms
Faster R-CNN (ResNet101) [24] 865.33 ms
R-FCN (ResNet101) [4] 810.21 ms
SSD (MobileNet v2) [18] 745.46 ms

V Conclusion

For general object detection, the most important elements are detection accuracy and speed. There are two extra key elements for glomerulus detection: storage space and data scarcity. Downsampling affects detection accuracy, speed, and storage space by changing image resolution, number of patches and file size. Compression can further alleviate the storage problem caused by WSIs — often with negligible accuracy loss.

We have performed several experiments to find how the following factors affect glomerulus detection: downsampling rate, compression rate, amount of training data, and choice of detection models. For the best trade-off between performance, speed and storage, we conclude that using both resizing and compression to reduce WSI file size may have negligible effects on the final result, but can save a huge amount of storage space. At 8 times downsizing, Faster R-CNN can achieve 0.781 mAP at an IoU of 0.5 with file size 1616 times smaller than the original WSI. With both 40% compression and 8 times downsizing, Faster R-CNN can achieve 0.780 mAP at an IoU of 0.5 with file size 6157 times smaller than the original WSI. Lack of training data will lead to decreased accuracy and increased false positives, but the effect is not directly proportional. Finally, when investigating different CNN models, since detection accuracy instead of speed is our primary concern, accurate two-stage detectors are preferred. Although Mask R-CNN and R-FCN add novel modifications to Faster R-CNN and achieve exciting performance on the COCO dataset, due to the nature of renal DIF data, these modifications do not lead to better performance on the glomerulus detection task.


This research was funded by the Australian Government through the Australian Research Council and Sullivan Nicolaides Pathology under Linkage Project LP160101797.


  • [1] S. Agarwal, S. Sethi, and A. Dinda, “Basics of kidney biopsy: A nephrologist’s perspective,” Indian journal of nephrology, vol. 23, no. 4, p. 243, 2013.
  • [2] G. Chartrand, T. Cresson, R. Chav, A. Gotra, A. Tang, and J. A. De Guise, “Liver segmentation on ct and mr using laplacian mesh optimization,” IEEE Transactions on Biomedical Engineering, vol. 64, no. 9, pp. 2110–2121, 2017.
  • [3]

    C. L. Chen, A. Mahjoubfar, L.-C. Tai, I. K. Blaby, A. Huang, K. R. Niazi, and B. Jalali, “Deep learning in label-free cell classification,”

    Scientific reports, vol. 6, p. 21471, 2016.
  • [4] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully convolutional networks,” in Advances in neural information processing systems, 2016, pp. 379–387.
  • [5] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun, “Dermatologist-level classification of skin cancer with deep neural networks,” Nature, vol. 542, no. 7639, p. 115, 2017.
  • [6] J. Gallego, A. Pedraza, S. Lopez, G. Steiner, L. Gonzalez, A. Laurinavicius, and G. Bueno, “Glomerulus classification and detection based on convolutional neural networks,” Journal of Imaging, vol. 4, no. 1, p. 20, 2018.
  • [7] Z. Gao, L. Wang, L. Zhou, and J. Zhang, “Hep-2 cell image classification with deep convolutional neural networks,” IEEE journal of biomedical and health informatics, vol. 21, no. 2, pp. 416–428, 2017.
  • [8] R. Girshick, “Fast r-cnn,” in

    Proceedings of the IEEE international conference on computer vision

    , 2015, pp. 1440–1448.
  • [9] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection and segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 1, pp. 142–158, 2016.
  • [10] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2016, pp. 770–778.
  • [12] L. Hou, D. Samaras, T. M. Kurc, Y. Gao, J. E. Davis, and J. H. Saltz, “Patch-based convolutional neural network for whole slide tissue image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2424–2433.
  • [13] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama et al., “Speed/accuracy trade-offs for modern convolutional object detectors,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7310–7311.
  • [14] Y. Kawazoe, K. Shimamoto, R. Yamaguchi, Y. Shintani-Domoto, H. Uozaki, M. Fukayama, and K. Ohe, “Faster r-cnn-based glomerular detection in multistained human whole slide images,” Journal of Imaging, vol. 4, no. 7, p. 91, 2018.
  • [15] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.
  • [16] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
  • [17] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision.    Springer, 2014, pp. 740–755.
  • [18] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision.    Springer, 2016, pp. 21–37.
  • [19] T.-W. C. R. D. D. Marculescu, “Adascale: Towards real-time video object detection using adaptive scaling,” in

    Conference on System and Machine Learning (SysML)

    , 2019.
  • [20] J. Mölne, M. E. Breimer, and C. T. Svalander, “Immunoperoxidase versus immunofluorescence in the assessment of human renal biopsies,” American journal of kidney diseases, vol. 45, no. 4, pp. 674–683, 2005.
  • [21] T. Okada, M. G. Linguraru, M. Hori, R. M. Summers, N. Tomiyama, and Y. Sato, “Abdominal multi-organ segmentation from ct images using conditional shape–location and unsupervised intensity priors,” Medical image analysis, vol. 26, no. 1, pp. 1–18, 2015.
  • [22] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
  • [23] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271.
  • [24] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
  • [25] A. Samak, A. Wiliem, P. Hobson, M. Walsh, T. Ditchmen, A. Troskie, S. Barksdale, R. Edwards, A. Jennings, and B. C. Lovell, “An optimization approach to scanning skin direct immunofluorescence specimens,” in 2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA).    IEEE, 2015, pp. 1–8.
  • [26] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
  • [27] O. Simon, R. Yacoub, S. Jain, J. E. Tomaszewski, and P. Sarder, “Multi-radial lbp features as a tool for rapid glomerular detection and assessment in whole slide histopathology images,” Scientific reports, vol. 8, no. 1, p. 2032, 2018.
  • [28] W. Sun, T.-L. B. Tseng, J. Zhang, and W. Qian, “Enhancing deep convolutional neural network scheme for breast cancer diagnosis with unlabeled data,” Computerized Medical Imaging and Graphics, vol. 57, pp. 4–9, 2017.
  • [29] T. Tieleman and G. Hinton, “Rmsprop: Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning,” COURSERA Neural Networks Mach. Learn, 2012.
  • [30] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-shot refinement neural network for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4203–4212.
  • [31] K. Zhao, Y. J. J. Tang, T. Zhang, J. Carvajal, D. F. Smith, A. Wiliem, P. Hobson, A. Jennings, and B. C. Lovell, “Dgdi: A dataset for detecting glomeruli on renal direct immunofluorescence,” in 2018 Digital Image Computing: Techniques and Applications (DICTA).    IEEE, 2018, pp. 1–7.