Unsupervised learning algorithms[celebi2016unsupervised][sathya2013comparison] allow deep learning models to learn effective image representations from large-scale unlabeled data , such as self-supervised learning, auto-encoder, and contrastive learning[chen2020simple]. However, even large-scale unannotated glomerular images can be difficult to obtain for individual labs[zhang2017deep]. Fortunately, many resources (e.g., NIH Open-i[demner2012design] search engine, academic images released by journals) have provided the opportunity to obtain extra large-scale images. However, the images from such resources consist of a considerably large amount of compound figures with subplots (Fig.1). To extract and curate individual subplots, compound figure separation algorithms can be applied[lee2015dismantling].
Various compound figure separation approaches have been developed[davila2020chart, lee2015detecting, apostolova2013image, tsutsui2017data, shi2019layout, jiang2021two, huang2005associating], especially with recent advances in deep learning. However, previous approaches typically required resource extensive bounding box annotation to train detection models. In this paper, we propose a simple compound figure separation (SimCFS) framework that utilizes weak classification annotations from individual images for compound figure separation. Briefly, the contribution of this study are in three-fold:
We propose a new side loss, an optimized detection loss for figure separation.
We introduce an intra-class image augmentation method to simulate hard cases.
The proposed framework enables an efficient deployment to new classes of images, without requiring resource extensive bounding box annotations.
We apply our technique to conduct compound figure separation for renal pathology. Glomerular phenotyping[koziell2002genotype] is a fundamental task for efficient diagnosis and quantitative evaluations in renal pathology. Recently, deep learning techniques have played increasingly important roles in renal pathology to reduce the clinical working load of pathologists and enable large-scale population based research [gadermayr2017cnn, bueno2020glomerulosclerosis, govind2018glomerular, kannan2019segmentation, ginley2019computational]. Due to the lack of publicly available annotated dataset for renal pathology, the related deep learning approaches are still limited on a small-scale [huo2021ai]. Therefore, it is appealing to extract large-scale glomerular images from public databases (e.g., NIH Open-i
search engine) for downstream unsupervised or semi-supervised learning[huo2021ai].
2 Related Work
In biomedical articles, about 40-60 of figures are multi-panel [kalpathy2015evaluating]. Several works have been proposed in the document analysis community, extracting figure and semantic information. For example, Huang et al. [huang2005associating] presented their recognition results of textual and graphical information in literary figures. Davila et al. [davila2020chart] presented a survey of approaches of several data mining pipelines for future research.
In order to collect scientific data massively and automatically, various approaches have been proposed by different researchers[10.1093/bioinformatics/btx611] [10.1007/978-3-319-65813-1_20][lee2015dismantling]. For example, Lee. et al. (2015) [lee2015detecting]
proposed an SVM-based binary classifier to distinguish complete charts from visual markers like labels, legend, and ticks. Apostolova et al.[apostolova2013image]
proposed a figure separation method by capital index. These traditional computer vision approaches are commonly based on the figure’s grid-based layout or visual information. Thus, the separation was usually accomplished by an x-y cut. However there are more complicated cases in compound figures like no white-space gaps or overlapped situations.
In the past few years, recent deep learning based algorithms using convolutional neural networks(CNNs) provided considerably better performance in extracting and processing textual and non-textual content from scholarly articles. Tsutsui and Crandall (2017)[tsutsui2017data] proposed the first deep learning based approach to compound figure separation in which they applied a deep convolutional network to train the separator. They also implemented training on artificially-constructed datasets and reported superior performances on ImageCLEF data sets[GSB2016]. Shi et al. [shi2019layout] developed a multi-branch output CNN to predict the irregular panel layouts and provided augmented data to drive learning; their network can predict compound figures of different sizes of structures with a better accuracy. Jiang et al. [jiang2021two]
combined the traditional vision method and high performance of deep learning networks by firstly detecting the sub-figure label and then optimizing the feature selection process in the sub-figure detection part. This improved the detection precision by 9.In Tsutsui’s study [tsutsui2017data], they applied You Only Look Once (YOLO) Version 2 [redmon2016you]
, a CNN based detection network. Deep learning based detection approaches utilized a single convolutional network to predict bounding boxes and class probabilities from full images simultaneously, which can achieve high speed detection and are in favor of sub-figure detection tasks.
The overall framework of the SimCFS approach is presented in Fig. 2. The training stage of SimCFS contains two major steps: (1) compound figure simulation, and (2) sub-figure detection. In the testing stage, only the detection network is needed.
3.1 Anchor based detection
YOLOv5, the latest version in the YOLO family [bochkovskiy2020yolov4], is employed as the backbone network for sub-figure detection. The rationale for choosing YOLOv5 is that the sub-figures in compound figures are typically located in horizontal or vertical orders. Herein, the grid-based design with anchor boxes is well adaptable to our application. A new side loss is introduced to the detection network that further optimizes the performance of compound figure separation.
3.2 Compound figure simulation
Our goal is to only utilize single images, which are non-compound images with weak classification labels in training a compound image separation method. In previous studies, the same task typically requires stronger bounding box annotations of subplots using real compound figures. In compound figure separation tasks, a unique advantage is that the sub-figures are not overlapped. Moreover, their spatial distributions are more ordered compared with natural images in object detection. Therefore, we propose to directly simulate compound figures from individual images as the training data for the downstream sub-figure detection.
Tsutsui et al. [tsutsui2017data] proposed a compound figure synthesis approach (Fig. 3). The method first randomly samples a number of rows and random height for each row. Then a random number of single figures fills the empty template. However, the single figures are naively resized to fit the template, with large distortion (Fig. 3).
Inspired by prior arts [tsutsui2017data], we propose a simple compound figure separation specific data augmentation strategy, called SimCFS-AUG, to perform compound figure simulation. Two groups of simulating compound figures are generated which are row-restricted and column-restricted. The length of each row or column is randomly generated within a certain range. Then, images from our database are randomly selected and concatenated together to fit in the preset space. As opposed to previous studies, the original ratio of individual images is kept in our SimCFS-AUG simulator, without distortion. Moreover, we introduce a new class within compound image separation augmentation to SimCFS-AUG so as to simulate the specific hard case in which all images belong to the same class.
3.3 Side loss for compound figure separation
For object detection on natural images, there is no specific preference between over detection and under detection as objects can be randomly located and even overlapped. In medical compound images, however, the objects are typically closely attached to each other, but not overlapping. In this case, over detection would introduce undesired pixels from the nearby plots (Fig. 4), which are not ideal for downstream deep learning tasks. Unfortunately, the over detection is often encouraged by the current Intersection Over Union (IoU) loss in object detection (Fig. 4), compared with under detection.
In the SimCFS-DET network, we introduce a simple side loss, which will penalize over detection. We define a predicted bounding box as and a ground truth box as , with coordinates: ,. The over detection penalty of vertices for each box is computed as:
Then, the side loss is defined as:
Side loss is combined with canonical loss functions in YOLOv5, including bounding box loss (), object probability loss (), and classification loss ().
,where , , , are constant weights to balance the four loss functions. Following the YOLOv5’s implementation 111https://github.com/ultralytics/yolov5, the parameters were set as = , = , = , where was the number of classes, was the number of layers, and was the image size.The of the Side loss was empirically set to across all experiments as the Side loss and Box loss are all based on the coordinates.
4 Data and Implementation Details
We collected two in-house datasets for evaluating the performance of different compound figure separation strategies. One compound figure dataset (called Glomeruli-2000) consisted of 917 training and 917 testing real figure plots from the American Journal of Kidney Diseases (AJKD), with the keywords “glomerular OR glomeruli OR glomerulus” as the keywords. Each figure was annotated manually with four classes, including glomeruli from (1) light microscopy, (2) fluorescence microscopy, and (3) electron microscopy, and (4) charts/plots.
To obtain single images to simulate compound figures, we downloaded 5,663 single individual images from other resources. Briefly, we obtained 1,037 images from Twitter, and obtained 4,626 images from the Google search engine, with five classes, including single images from (1) glomeruli with light microscopy, (2) glomeruli with fluorescence microscopy, (3) glomeruli with electron microscopy, (4) charts/plots, and (5) others. The individual images were combined using the SimCFS-AUG simulator to generate 9,947 pseudo training images. 2,000 of the pseudo images were simulated using intra-class augmentation, while 2,947 of them were simulated with only single sub-figures. The implementation of SimCFS-DET was based on YOLOv5 with PyTorch implementations. Google Colab was used to perform all experiments in this study.
In the experiment setting, the parameters are empirically chosen. We set the learning rate to 0.01, weight decay to 0.0005 and momentum to 0.937. The input image size was set to 640, to 0.5, to 1,
to 0.5, and number of layers to 3. For our in-house datasets, we trained 50 epochs using a batch size of 64. For the imageCLEF2016 dataset[GSB2016], we trained 50 epochs using a smaller batch size of 8.
5.1 Ablation Study
The Side loss is the major contribution to the YOLOv5 detection backbone. In this ablation study, we show the performance of using 917 real compound images with manual box annotations as training data (as “Real Training Images”) in Table 1 and Fig. 5. This also shows the results of merely using simulated images as training data (as “Simulated Training Images”). The proposed side loss consistently improves the detection performance by a decent margin. The intra-class self-augmentation improves the performance when only using simulated training images.
|Training Data||Method||Side loss||AUG||All||Light||Fluo.||Elec.||Chart|
|Real Training Images||YOLOv5 [bochkovskiy2020yolov4]||69.8||77.1||71.3||73.4||57.4|
|Simulated Training Images||YOLOv5 [bochkovskiy2020yolov4]||66.4||79.3||62.1||76.1||48.0|
*AUG is the intra-class self-augmentation. ALL is the Overall mAP, which is reported for all classes, class Light, class Florescence and class Electron.
|Tsutsui et al. [tsutsui2017data]||YOLOv2||69.8||-|
|Tsutsui et al. [tsutsui2017data]||Transfer||77.3||-|
|Zou et al. [zou2020unified]||ResNet152||78.4||-|
|Zou et al. [zou2020unified]||VGG19||81.1||-|
5.2 Comparison with State-of-the-art
We also compare CFS-DET with the state-of-the-art approaches including Tsuisui et al. [tsutsui2017data] and Zou et al. [zou2020unified] using the ImageCLEF2016 dataset[GSB2016]. ImageCLEF2016 is the commonly accepted benchmark for compound figure separation, including total 8,397 annotated multi-panel figures (6,783 figures for training and 1,614 figures for testing). Table 2
shows the results of the ImageCLEF2016 dataset. The proposed CFS-DET approach consistently outperforms other methods by considering evaluation metrics.
In this paper, we introduce the SimCFS framework to extract images of interests from large-scale compounded figures with merely weak classification labels. The pseudo training data can be built using the proposed SimCFS-AUG simulator. The anchor-based SimCFS-DET detection achieves state-of-the-art performance by introducing a simple Side loss.