Log In Sign Up

OXnet: Omni-supervised Thoracic Disease Detection from Chest X-rays

by   Luyang Luo, et al.

Chest X-ray (CXR) is the most typical medical image worldwide to examine various thoracic diseases. Automatically localizing lesions from CXR is a promising way to alleviate radiologists' daily reading burden. However, CXR datasets often have numerous image-level annotations and scarce lesion-level annotations, and more often, without annotations. Thus far, unifying different supervision granularities to develop thoracic disease detection algorithms has not been comprehensively addressed. In this paper, we present OXnet, the first deep omni-supervised thoracic disease detection network to our best knowledge that uses as much available supervision as possible for CXR diagnosis. Besides fully supervised learning, to enable learning from weakly-annotated data, we guide the information from a global classification branch to the lesion localization branch by a dual attention alignment module. To further enhance global information learning, we impose intra-class compactness and inter-class separability with a global prototype alignment module. For unsupervised data learning, we extend the focal loss to be its soft form to distill knowledge from a teacher model. Extensive experiments show the proposed OXnet outperforms competitive methods with significant margins. Further, we investigate omni-supervision under various annotation granularities and corroborate OXnet is a promising choice to mitigate the plight of annotation shortage for medical image diagnosis.


page 3

page 8

page 13


Weakly Supervised Thoracic Disease Localization via Disease Masks

To enable a deep learning-based system to be used in the medical domain ...

Cross Chest Graph for Disease Diagnosis with Structural Relational Reasoning

Locating lesions is important in the computer-aided diagnosis of X-ray i...

A Benchmark for Weakly Semi-Supervised Abnormality Localization in Chest X-Rays

Accurate abnormality localization in chest X-rays (CXR) can benefit the ...

Rethinking annotation granularity for overcoming deep shortcut learning: A retrospective study on chest radiographs

Deep learning has demonstrated radiograph screening performances that ar...

Probabilistic Integration of Object Level Annotations in Chest X-ray Classification

Medical image datasets and their annotations are not growing as fast as ...

Radiomics-Guided Global-Local Transformer for Weakly Supervised Pathology Localization in Chest X-Rays

Before the recent success of deep learning methods for automated medical...

Diagnose Like a Radiologist: Hybrid Neuro-Probabilistic Reasoning for Attribute-Based Medical Image Diagnosis

During clinical practice, radiologists often use attributes, e.g. morpho...

Code Repositories


Code for paper "OXnet: Deep Omni-supervised Thoracic DiseaseDetection from Chest X-rays"

view repo

1 Introduction

Modern object detection algorithms often require a large amount of supervision signals. However, annotating abundant medical images for disease detection is infeasible due to the high expense of expert knowledge and tedious labor. Consequently, many medical datasets are often weakly labeled to indicate the existence of abnormal findings or, more frequently, unlabeled [21]. This situation especially exists for chest X-rays (CXR) as the world’s commonest medical image. Apart from many unlabeled data, CXR datasets often have image-level annotations that can be easily obtained by text mining from the numerous radiological reports [26, 9], while lesion-level annotations (e.g., bounding boxes) are scarce [7, 27]. Therefore, efficiently leveraging available annotations to develop thoracic disease detection algorithms has significant practical value.

Omni-supervised learning [19] aims to leverage the existing fully-annotated data and other available data (e.g. unlabeled data) as much as possible, which could practically address the mentioned challenge. Distinguished from previous studies that only include extra unlabeled data [19, 6, 24], we target on utilizing as much available supervision as possible from data of various annotation granularities, i.e., fully-labeled data, weakly-labeled data, and unlabeled data, to develop a unified framework for thoracic disease detection on chest X-rays. Note that the works on CXR14 [26] that leverage both image-level and lesion-level annotations [12, 15, 17, 33] only care whether the attentions cover a target single lesion, which often does not hold in real-world scenarios where multiple lesions could exist. We here aim to present a feasible and general solution to clinic usage.

In this paper, we present OXnet, a unified deep framework for omni-supervised chest X-ray disease detection. To this end, we first leverage limited bounding box annotations to train a base detector. To enable learning from weakly labeled data, we introduce a dual attention alignment module that guides gradient from an image-level classification branch to the local lesion detection branch. To further enhance learning from global information, we propose a global prototype alignment module to impose intra-class compactness and inter-class separability. For unlabeled data, we present a soft focal loss to distill the knowledge from a teacher model. Extensive experiments show that OXnet not only outperforms the baseline detector with a large margin but also achieves better thoracic disease detection performance than other competitive methods. Further, OXnet also show comparable performance to fully-supervised method with fewer fully-labeled data and sufficient weakly-labeled data. In summary, OXnet can effectively utilize all the available supervision signals, demonstrating a promisingly feasible and general solution to real-world applications.

2 Method

Let , , and denote the fully-labeled data (with bounding box annotations), weakly-labeled data (with image-level annotations), and unlabeled data, respectively. As illustrated in Fig. 1, OXnet correspondingly consists of three main parts: a base detector to learn from , a global path to learn from , and an unsupervised branch to learn from . As the first part is a RetinaNet [14] backbone supervised by the focal loss and bounding box regression loss, we will focus on introducing the latter two parts in the following sections.

Figure 1: Overview of OXnet. For , the model learns via the original supervised loss of RetinaNet. For and , DAA aligns the global and local attentions and enables global information flowing to local branch (Sec. 2.1). Hence, GPA can further enhances the learning from global labels (Sec. 2.2). For and , a teacher model guides the RetinaNet with soft focal loss (Sec. 2.3).

2.1 Dual Attention Alignment for Weakly-supervised Learning

To enable learning from , we first add a global multi-label classification head to the ResNet [5] part of RetinaNet. This global head consists of a 11 convolutional layer for channel reduction, a global average pooling layer, and a sigmoid layer. Then, the binary cross entropy loss is used as the objective:


where is the index of total categories, is the image-level label, and is the prediction. The 11 convolution gets input and outputs feature map with channels, where is exactly the global attention map, i.e., class activating map [31]. Meanwhile, the local classification head of RetinaNet conducts dense prediction and generates outputs of size WidthHeightClassesAnchors for each feature pyramid level [13]. We argue that these classification maps can actually be the local attentions for each class and anchor. Hence, we first take the maximum on the anchor dimension to get the attentions on number of pyramid levels. Then, the final local attention for each class is obtained by:


Essentially, we observed that the global and local attentions are often mismatched as shown in Fig. 2 (1), indicating the two branches making decisions out of different regions of interests (ROIs). Moreover, the local attention usually covers more precise ROIs by learning from lesion-level annotations. Therefore, we argue that the local attention can be used as the multi-instance pooling weights [8, 22] for the global classification. In other words, can weigh the importance of each pixel on . Hence, we resize the local attention to be the same shape as and propose a dual attention alignment (DAA) module as follows:



is the sigmoid function,

is the index of the pixels, and denotes element-wise multiplication. Particularly, we let to construct the multi-instance pooling function. DAA is then used to replace the mentioned global head as illustrated in Fig. 2 (2) and (3). Consequently, the local attention helps rectify the decision-making regions of the global branch. Meanwhile, the gradient from the global branch could thus flow to the local branch. To this stage, the local classification branch not only receives the strong yet limited supervision from but also massive weak supervision from .

Figure 2: (1). a, b, c, and d illustrate the ground truth of lesions, global attention without DAA, local attention map, and global attention with DAA, respectively. (2). global classification head without DAA. (3). DAA module, where solid line and dashed line represent forward and backward flow, respectively.

2.2 Global Prototype Alignment for Multi-label Metric Learning

With DAA, any global supervision signals could flow to the local branch. To further enhance this learning process, we propose a multi-label metric learning loss that imposes intra-class compactness and inter-class separability to the network. It is worth noting that multi-label metric learning can be hard as the deep features often capture multi-class information

[16]. Specially, in our case, we can obtain category-specific features under the guidance of local attention as follows:


where is the feature map before the global 11 convolutional layer. Particularly, is of shape with and being width and height, respectively,

is the index of the feature vectors (

contains # vectors), and is the element-wise multiplication over the dimensions of and . Here, the local attention highlights the feature vectors related to each category, leading to the aggregated category-specific feature vector . For better metric learning regularization, we generate global prototypes [28, 30] for each category as follows:


where is 0 if otherwise 0.7, and is the number of data in a batch. Particularly, the prototype is updated with the weighted average of confidences as suggested in [20, 29]. We then align with the prototypes as follows:


where is a scalar representing the inter-class margin.

2.3 Soft Focal Loss for Unsupervised Knowledge Distillation

To learn from the unlabeled data, we first obtain a mean teacher model by exponential moving average [23]: , where and are the parameters of RetinaNet and the teacher model, respectively, is a decay coefficient, and

represents training step. Then, the student RetinaNet will learn from the teacher model’s classification predictions, which are probabilities in range (0, 1). Particularly, the foreground-background imbalance problem

[14] should also be taken care of. Therefore, we extend the focal loss to its soft form inspired by [11, 25]. Specifically, the original focal loss is as follows:


where is a pre-set weight for balancing foreground and background, is the local classification prediction from the model (we eliminate the subscript for simplicity), is a scalar controlling

to assign more weights onto samples which are less-well classified. Here, instead of assigning

according to whether an anchor is positive or negative, we assume its value changes linearly with the value of teacher model’s prediction and modify it to be , where is a constant. Meanwhile, we notice that in Formula 7 depends on how closer is to its target ground truth. Therefore, we modify the focal weight to be to give larger wights to the instances with higher disagreement between student-teacher models, and the soft focal loss hence is:


2.4 Unified Training for Deep Omni-supervised Detection

The overall training loss sums up the supervised losses [14], weakly-supervised losses (, , and ), and unsupervised loss (). Particularly, weakly-supervised losses are applied to and , and unsupervised loss is applied to and . is set to 0 for the first step otherwise 0.7, is set to 1, is set to 0.99 by tuning on the validation set. As suggested in [25] is set to 0.9, is 0.05, and is set to 2. CXR images are resized to 512512 and equally sampled for different annotations. Augmentation is done following [32]

. All implementations use Pytorch

[18] on an NVIDIA TITAN Xp GPU.

3 Experiments

3.1 Dataset and Evaluation Metrics

In total, 32,261 frontal CXR images taken from 27,253 patients were used. Thoracic diseases including aortic calcification, cardiomegaly, fracture, mass, nodule, pleural effusion, pneumonia, pneumothorax, and tuberculosis were annotated with bounding boxes. Each image is labeled by the consensus among at least three experienced radiologists. The dataset was split into training, validation, and testing sets with 13,963, 5,118, and 13,180 images, respectively, without patients overlapping. More details of the data can be found in the supplementary.

We adopt Average Precision (AP) [3]

as the evaluation metrics. Specifically, we report the mean AP (mAP) from AP

to AP with an interval of 5 following [4]. We also report AP, AP

, and AP for small, medium, and large targets with COCO API

111 All statistics reported are averaged over the nine abnormalities.

3.2 Comparison with Other Methods

To our best knowledge, few previous works simultaneously leveraged fully-labeled data, weakly-labeled data, and unlabeled data. Hence, we first trained a RetinaNet (with pre-trained Res101 weight from ImageNet

[2]) on our dataset. Then, we implemented several state-of-the-art semi-supervised methods finetuned from RetinaNet and also constructed multi-task (MT) models by adding global classification heads. Compared methods included RetinaNet [14], Model [10], Mean Teacher [23], MMT-PSM [32], and FocalMix [25], and all semi-supervised methods conducted knowledge distillation on classification outputs or feature maps.

Method # images used Metrics
Unlabeled mAP AP AP AP AP AP
RetinaNet [14] 2725 0 0 18.4 27.7 7.5 8.0 16.7 25.4
Model [10] 2725 0 11238 20.0 29.3 9.3 9.2 20.8 27.0
Mean Teacher [23] 2725 0 11238 20.0 29.2 9.4 9.1 20.4 26.9
MMT-PSM [32] 2725 0 11238 19.1 28.4 8.4 8.8 19.3 26.5
FocalMix [25] 2725 0 11238 19.8 29.1 9.0 8.6 19.6 26.3
Model [10] + MT 2725 11238 0 20.2 29.6 9.2 9.4 21.3 27.6
Mean Teacher [23] + MT 2725 11238 0 20.4 29.6 9.4 9.3 20.6 27.1
MMT-PSM [32] + MT 2725 11238 0 19.3 28.4 8.8 8.4 17.9 26.8
FocalMix [25] + MT 2725 11238 0 20.1 29.7 8.8 8.5 18.3 27.4
SFL 2725 0 11238 20.4 29.7 9.4 9.3 21.6 29.8
DAA 2725 11238 0 21.2 31.2 9.4 9.8 20.7 28.0
DAA + GPA 2725 11238 0 21.4 31.4 9.5 11.0 20.9 27.1
OXnet (SFL+DAA+GPA) 2725 11238 0 22.3 32.4 10.3 9.6 21.8 31.4
Table 1: Quantitative comparison of different methods.

3.2.1 Quantitative Comparison.

All semi-supervised methods (row 2 to 5 in Table 1) clearly outperform the RetinaNet baseline (1st row), demonstrating effective utilization of the unlabeled data. After incorporating the global classification heads, the four multi-task networks (row 6 to 9) get further improvement with about 0.20.4 points raising in mAP. This finding suggests that large image-level supervision can help learning abnormalities detection, but the benefits are still limited without proper design. On the other hand, OXnet (the last row) achieves 22.3 in mAP and outperforms the multi-task models on various sizes of targets. The results corroborate that our proposed method can more effectively leverage the less well-labeled data for the thoracic disease detection task.

Figure 3: Visualization of results generated by RetinaNet (first row), Mean Teacher + MT (second row), and our method (third row). Ground truth is in red, true positives are in green, and false positives are in blue. Best viewed in color.

3.2.2 Qualitative Comparison.

We also visualize the outputs generated by RetinaNet, Mean Teacher + MT (the best method among those compared with ours), and OXnet in Fig. 3. As illustrated, our model yields more accurate predictions for multiple lesions of different diseases in each chest X-ray sample.

3.2.3 Ablation Study.

The last four rows in Table 1 report the ablation study of the proposed components, i.e., soft focal loss (SFL), dual attention alignment (DAA), and global prototype alignment (GPA). Our SFL achieves an mAP of 20.4 and outperforms other semi-supervised methods, demonstrating its better capability of utilizing the unlabeled data. On the other hand, incorporating only DAA reaches an mAP of 21.2, showing effective guidance from the weakly labeled data. Adding GPA to DAA improves the mAP to 21.4, demonstrating the effectiveness of the learned intra-class compactness and inter-class separability. By unifying the three components, OXnet reaches the best results in 5 out of 6 metrics, corroborating the complementarity of the proposed methods.

3.3 Omni-supervision under Different Annotation Granularities

Efficient learning is particularly essential for medical images as annotations are extremely valuable and scarce. Thus, we investigate the effectiveness of OXnet given different annotation granularities. With results in Table 2

, we find that: (1) Finer annotations always lead to better performance, and OXnet achieves consistent improvements as the annotation granularity becomes finer (row 1 to 9); (2) Increasing fully labeled data benefits OXnet more (mAP improvements are 2.9 from row 4 to 5 and 4.5 from row 5 to 6) than RetinaNet (mAP improvements are 2.1 from row 1 to 2 and 3.8 from row 2 to 3); and (3) With less fully-labeled data and more weakly-labeled data, OXnet can achieve comparable performance to RetinaNet (row 2 vs. row 4, row 3 vs. row 5). These findings clearly corroborate OXnet’s effectiveness in utilizing as much available supervision as possible. Moreover, unlabeled data are easy to acquire without labeling burden, and weakly-labeled data can also be efficiently obtained with natural language processing methods

[26, 9, 1]. Therefore, we believe OXnet could serve as a promisingly feasible and general approach to real-world clinic applications.

Method # images used Metrics
Unlabeled mAP AP AP AP AP AP
RetinaNet 682 0 0 12.5 18.9 5.5 7.0 16.4 18.1
1372 0 0 14.6 21.8 6.5 5.8 10.4 19.5
2725 0 0 18.4 27.7 7.5 8.0 16.7 25.4
OXnet (Ours) 682 13281 0 14.9 22.6 6.6 8.4 16.8 20.3
1372 12591 0 17.8 26.9 8.0 6.8 16.2 24.4
2725 11238 0 22.3 32.4 10.3 9.6 21.8 31.4
2725 8505 2733 21.9 32.0 10.0 9.7 21.6 29.9
2725 2733 8505 21.2 31.0 10.0 9.0 22.2 29.0
2725 0 11238 20.7 30.3 9.4 8.3 20.7 28.3
Table 2: Results under different ratios of annotation granularities.

4 Conclusion

We present OXnet, a deep omni-supervised learning approach for thoracic disease detection from chest X-rays. The OXnet simultaneously utilizes well-annotated, weakly-annotated, and unlabeled data as a unified framework. Extensive experiments have been conducted and our OXnet has demonstrated superiority in effectively utilizing various granularities of annotations. In summary, the proposed OXnet has shown as a promisingly feasible and general solution to real-world applications by leveraging as much available supervision as possible.

4.0.1 Acknowledgement.

This work was supported by Key-Area Research and Development Program of Guangdong Province, China (2020B010165004), Hong Kong Innovation and Technology Fund (Project No. ITS/311/18FP and Project No. ITS/426/17FP.), and National Natural Science Foundation of China with Project No. U1813204.


  • [1] A. Bustos, A. Pertusa, J. Salinas, and M. de la Iglesia-Vayá (2020) Padchest: a large chest x-ray image dataset with multi-label annotated reports. MedIA 66, pp. 101797. Cited by: §3.3.
  • [2] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §3.2.
  • [3] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. IJCV 88 (2), pp. 303–338. Cited by: §3.1.
  • [4] T. Gabruseva, D. Poplavskiy, and A. Kalinin (2020) Deep learning for automatic pneumonia detection. In CVPR Workshops, pp. 350–351. Cited by: §3.1.
  • [5] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §2.1.
  • [6] R. Huang, J. A. Noble, and A. I. Namburete (2018) Omni-supervised learning: scaling up to large unlabelled medical datasets. In MICCAI, pp. 572–580. Cited by: §1.
  • [7] Y. Huang, W. Liu, X. Wang, Q. Fang, R. Wang, Y. Wang, H. Chen, H. Chen, D. Meng, and L. Wang (2020) Rectifying supporting regions with mixed and active supervision for rib fracture recognition. IEEE TMI 39 (12), pp. 3843–3854. Cited by: §1.
  • [8] M. Ilse, J. Tomczak, and M. Welling (2018) Attention-based deep multiple instance learning. In ICML, pp. 2127–2136. Cited by: §2.1.
  • [9] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al. (2019) Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In AAAI, Vol. 33, pp. 590–597. Cited by: §1, §3.3.
  • [10] S. Laine and T. Aila (2017)

    Temporal ensembling for semi-supervised learning

    In ICLR, Cited by: §3.2, Table 1.
  • [11] X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang (2020) Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. In NeurIPS, Cited by: §2.3.
  • [12] Z. Li, C. Wang, M. Han, Y. Xue, W. Wei, L. Li, and L. Fei-Fei (2018) Thoracic disease identification and localization with limited supervision. In CVPR, pp. 8290–8299. Cited by: §1.
  • [13] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, pp. 2117–2125. Cited by: §2.1.
  • [14] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, pp. 2980–2988. Cited by: §2.3, §2.4, §2, §3.2, Table 1.
  • [15] J. Liu, G. Zhao, Y. Fei, M. Zhang, Y. Wang, and Y. Yu (2019) Align, attend and locate: chest x-ray diagnosis via contrast induced attention network with limited supervision. In CVPR, pp. 10632–10641. Cited by: §1.
  • [16] L. Luo, L. Yu, H. Chen, Q. Liu, X. Wang, J. Xu, and P. Heng (2020) Deep mining external imperfect data for chest x-ray disease screening. IEEE TMI 39 (11), pp. 3583–3594. Cited by: §2.2.
  • [17] X. Ouyang, S. Karanam, Z. Wu, T. Chen, J. Huo, X. S. Zhou, Q. Wang, and J. Cheng (2020) Learning hierarchical attention for weakly-supervised chest x-ray abnormality localization and diagnosis. IEEE TMI. Cited by: §1.
  • [18] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In NeurIPS, Vol. 32, pp. 8026–8037. Cited by: §2.4.
  • [19] I. Radosavovic, P. Dollár, R. Girshick, G. Gkioxari, and K. He (2018) Data distillation: towards omni-supervised learning. In CVPR, pp. 4119–4128. Cited by: §1.
  • [20] Y. Shi, X. Yu, K. Sohn, M. Chandraker, and A. K. Jain (2020)

    Towards universal representation learning for deep face recognition

    In CVPR, pp. 6817–6826. Cited by: §2.2.
  • [21] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. N. Chiang, Z. Wu, and X. Ding (2020) Embracing imperfect datasets: a review of deep learning solutions for medical image segmentation. MedIA 63, pp. 101693. Cited by: §1.
  • [22] P. Tang, X. Wang, X. Bai, and W. Liu (2017) Multiple instance detection network with online instance classifier refinement. In CVPR, pp. 2843–2851. Cited by: §2.1.
  • [23] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS 30, pp. 1195–1204. Cited by: §2.3, §3.2, Table 1.
  • [24] L. Venturini, A. T. Papageorghiou, J. A. Noble, and A. I. Namburete (2020)

    Uncertainty estimates as data selection criteria to boost omni-supervised learning

    In MICCAI, pp. 689–698. Cited by: §1.
  • [25] D. Wang, Y. Zhang, K. Zhang, and L. Wang (2020) FocalMix: semi-supervised learning for 3d medical image detection. In CVPR, pp. 3951–3960. Cited by: §2.3, §2.4, §3.2, Table 1.
  • [26] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers (2017) Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In CVPR, pp. 2097–2106. Cited by: §1, §1, §3.3.
  • [27] Y. Wang, K. Zheng, C. Chang, X. Zhou, Z. Zheng, L. Huang, J. Xiao, L. Lu, C. Liao, and S. Miao (2021) Knowledge distillation with adaptive asymmetric label sharpening for semi-supervised fracture detection in chest x-rays. In IPMI, Cited by: §1.
  • [28] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In ECCV, pp. 499–515. Cited by: §2.2.
  • [29] M. Xu, H. Wang, B. Ni, Q. Tian, and W. Zhang (2020) Cross-domain detection via graph-induced prototype alignment. In CVPR, pp. 12355–12364. Cited by: §2.2.
  • [30] H. Yang, X. Zhang, F. Yin, and C. Liu (2018) Robust classification with convolutional prototype learning. In CVPR, pp. 3474–3482. Cited by: §2.2.
  • [31] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016) Learning deep features for discriminative localization. In CVPR, pp. 2921–2929. Cited by: §2.1, Figure 4.
  • [32] Y. Zhou, H. Chen, H. Lin, and P. Heng (2020) Deep semi-supervised knowledge distillation for overlapping cervical cell instance segmentation. In MICCAI, pp. 521–531. Cited by: §2.4, §3.2, Table 1.
  • [33] Y. Zhou, T. Zhou, T. Zhou, H. Fu, J. Liu, and L. Shao (2021) Contrast-attentive thoracic disease recognition with dual-weighting graph reasoning. IEEE TMI. Cited by: §1.

5 Supplementary Materials

Images Annotations
Pathology Train Val Test Train Val Test
AorticCalcification 900 341 825 939 348 846
Cardiomegaly 1098 388 1003 1098 388 1003
Fracture 710 282 617 1893 707 1635
Pleural Effusion 2344 855 1985 2899 1064 2500
Mass 479 179 487 531 196 532
Nodule 1832 696 1711 2777 1131 2604
Pneumonia 2438 932 2153 3477 1307 3111
Pneumothorax 1377 422 1030 1508 478 1154
Tuberculosis 4455 1550 3899 7078 2525 6174
Table 3: Number of images with abnormalities and number of bounding-boxes in the training, validation, and testing sets.
Figure 4: More samples of global attentions (i.e., CAM [31]) and local attentions. It can be seen that: (a) local attention often helps refine CAM (row 1 and 2), but (b) sometimes CAM covers more accurate lesion regions (row 3); (c) Joint learning by DAA could refine both attentions (row 4); and (d) CAM covers unnecessary larger regions for very small lesions, and DAA could lead to an averaged result of both attentions (row 5 and 6).