Chest X-ray is one of the most common radiology exams used for screening of lung diseases. X-ray is econonimical and can be performed with minimal procedural steps. Moroever, each scan can detect multiple suspected pathologies such as tuberculosis and pneumonia, etc. Computer-Aided Detection (CADe) and Diagnosis (CADx) has been a major research focus in medicine. CAD for chest X-ray could potentially become a cost-effective assistive tool for radiologists.
Recent advances in artificial intelligence and machine learning have demonstrated that deep learning technologies have superiority in solving various X-ray analysis tasks involving image classification[1, 2, 3], NLP based analysis  and localisation . The data-driven nature of deep learning benefits from the increasing volume of publicly accessible medical imaging dataset, such as CT, MRI, X-ray [6, 1]. Wang et al.  introduced a large scale collection of chest X-ray which contains the absence or presence of 14 lung diseases. Several methods explored the use of ResNet  and DenseNet [3, 2]
, that were pre-trained on ImageNet
, to classify 14 chest related diseases.
However, there are a few challenges which may obstruct the improvement gained from the usage of direct transfer learning from general image classification tasks to detecting lung diseases in chest x-ray and, more generally, in wider medical imaging domain. First unlike categories in ImageNet where many of them are from different branches of WordNet, the high visual similarity among a wealth of pathologies in lung X-rays can be difficult to interpret and distinguish. Second, presence of patterns from various potential pathologies in one medical scan mandates the model learn a huge number of possible label sets for prediction, which is exponential to the size of label space (for labels). Third, issues like the class-imbalance among disease labels and overwhelming normal samples, cannot be fully taken into consideration by just using sigmoid with binary cross-entropy loss [1, 3] as in done in standard transfer learning based methodologies. These challenges drive a need for innovative learning mechanisms which consider both subtle differences among various classes, as well as multi-label nature of the task.
This paper introduces two contributions to detect lung diseases from chest x-ray.
Second, a neural network loss named Multi-label SoftMax Loss (MSML), is proposed, that captures the characteristics of multi-label learning. We further combine the idea of MSML with bilinear pooling to propose a novel mecahnism for multi-label learning in a deep learning context.
Formally speaking, let be the finite set of labels and be the training set where denotes input and is its relevant label. The label , where if the -th label is present and otherwise. denotes the input space of
-dimensional feature vectors. The goal of our problem is to learn a multi-label hypothesis with parameters, to output predictions to each class accurately.
2.1 Multi-Label Learning Loss
The error function of the hypothesis on the training set is defined as where is the error of the hypothesis on the
-th input sample. Using sigmoid as activation function (in the classification layer, could be defined as where is the sigmoid output of input feature on the -th class, which takes the value between and
. During training, optimisation method such as stochastic gradient descent (SGD) can be directly applied to learn from the multi-label annotations.
Our goal is to minimise the errors from the training set for the model
. To look into the gradient being backpropogated, we compute the derivative of the error with respect to each class using the chain rule,. Following , with :
However, the error function in (1
) only considers individual class discrimination and does not consider interdependencies among target classes explicitely. Therefore we propose a novel loss function for deep learning based multi-label learning task that considers the relationship of multiple labels explicitely. In this paper, the property of leveraging correlations of multi-label is addressed by the MSML (Multi-label Softmax Loss) error function as follows:
defines the positive class indices of the current sample, where is the complementary of (=0). measures the cardinality and is used for normalisation. MSML is bootstrapped from softmax function . Each positive response is fed to the exponential function. The denominator contains activations from all negative outputs and the positive activation from the nominator.
2.2 Bilinear Pooling
Bilinear based methods 
with deep convolutional neural networks have achieved good results on several fine-grained tasks, such as image recognition, and video classification . The idea is that in one of the pooling layers, outer-product is performed at each spatial location of two networks to generate second-order statistical discriminative local feature representations. The bilinear pooling can be calculated in a pooling layer as follows :
where is a local feature descriptor from one of the pooling layer in network , is the outer product of two vectors, is the vectorisation operation, and .
2.3 Fine-Grained Multi-Label Learning
As shown in , CNNs that are symmetrically initialised with identical parameters may provide efficiency during training. However, the symmetric activations may lead to suboptimal solutions since the model doesn’t explore the different space arising from different CNNs. Therefore, in our proposed model we design separate auxiliary losses for two CNN components to break the symmetry between the two feature extractors used for bilinear pooling.
We give an overview of our architecture for fine-grained multi-label learning shown in Fig.1
. Two separate CNNs are initialised with identical pre-trained parameters. The first CNN stream is attached with normal cross-entropy loss based on sigmoid function. The second CNN’s feature maps are input into independent classifiers with MSML loss, which focuses on learning the label correlations. When outputs from bilinear pooling layers are available, fine-grained cross-entropy loss (FCE) can be performed at the fine-grained level. Bilinear pooling layer operates on the last convolutional layers of basic architectures, sign sqrt norm andnorm is applied to the average-pooled bilinear feature layer. A further convolutional layer is added before fed to the classification layer with FCE loss. Our model is trained using image-level annotation only for all classes. The fine-grained multi-label loss to optimiser is the weighted sum of these two:
Dataset: To validate the effectiveness of our methods, we used the ChestX-ray14 dataset introduced by NIH . This dataset contains 112,120 frontal-view x-rays with 14 diseases labels (multi-labels for each image), which are obtained from associated radiology reports. We follow the same settings as in  where the entire dataset is randomly split to three sets: 70% for training, 10% for validation and 20% for testing. The split is done at the patient level so tha there is no overlap in different data folds.
Network and training: We use various CNN architectures (ResNet, DenseNet, VGG) with pre-trained parameters on the 2012 ImageNet Challenge as the base CNN component for our evaluated frameworks. The ADAM  optimiser with initial learning rate of and
is used to optimise the networks. Learning rate decreases by 10 times in every third epochs. We useand to weight the proposed multi-label fine-grained network. Training images are random crops with size by from by rescaled original images. Input images’ value is rearranged to with mean and std norm. All the reported metrics are computed on the test set.
Evaluation Metrics: Two metrics are being used to evaluate the performance of various frameworks. Area under the ROC curves (AUC): This is the metric used in a few lung x-rays classification literatures [1, 3]. The area curve is calculated with Sensitivity () as vertical axis and Specificity () as horizontal axis. There are strong class-imbalance issue for some classes such as Hernia which posses about 2% of the whole dataset. To track the performance of the majority, we propose to use the Weighted Area under the ROC curves (W-AUC). Each class AUC is associated with a weighting factor with respect to its proportions across the dataset. To drive an insight of how the model performs against normality and abnormality, we further investigate the AUC of disease vs. disease (D-AUC) and disease vs. non-disease (N-AUC) for the test set.
3.1 Quantitative Results
We use the following acronyms. R18-CE: ResNet18 trained with cross entropy (CE) loss, R18-BL-CE: ResNet18 with self bilinear pooling with CE loss, R18-F-MSML: The proposed fine-grained multi-label architecture as described in Sec. 2.3. CNN components use ResNet18 as base architecture. Similar acronyms are denoted for D121: DenseNet-121.
Using the method proposed on the ChestX-ray14 dataset, we examine the class average AUC and W-AUC. The main results are summarised in Table. 1. In most instances, we observe that the proposed fine-grained multi-label architecture provides consistent performance improvements for fusion results with the baseline network. The proposed method surpasses the performance of fusion with R18-BL or D121-BL which demonstrates that MSML provides more independences for two CNN components in bilinear-based model.
Somewhat surprisingly, ensemble model of R18 and D121 doesn’t provide higher AUC (first rows under Ensemble Analysis), which indicates that the high-level abstract representation learned from those two models may have strong correlations. This is considered to occur due to baseline results of R18 and D121 offer insignificant performance difference (0.8239 vs. 0.8354) with respect to their model complexity (18 layers vs. 121 layers). The results shown in the last two rows indicate that we can further improve the performance with a base architecture which works better on bilinear pooling function [13, 14].
|R18 + R18-BL||0.8254||0.5811||0.7756||0.8613|
|R18 + R18-F-MSML||0.8388||0.5892||0.7907||0.8733|
|D121 + D121-BL||0.8364||0.5896||0.7926||0.8679|
|D121 + D121-F-MSML||0.8462||0.5950||0.8011||0.8785|
|R18 + D121||0.8355||0.5883||0.7930||0.8653|
|R18 + VGG–F-MSML||0.8438||0.5932||0.8011||0.8743|
|D121 + VGG–F-MSML||0.8537||0.5952||0.8060||0.8827|
3.2 Training Strategy Analysis
In this part we examine three training strategies for the proposed method.111To facilitate the verification process, we use ResNet18 for this experiment. Local: Train two CNN components with CE and MSML loss then fine-tuning with FCE with bilinear pooling. Local-Fixed: Same as Local but parameters of two CNN components are fixed during final stage fine-tuning with FCE. Global: All parameters are trained simultaneously. We present the performance of those training strategies in column three. Overall, Global makes the best performance of all three. We may infer that global training strategy is beneficial to multi-loss training.
3.3 Comparable Study to Other Methods
In this section we show the proposed method can achieve state-of-the-art performance for chest disease detection. Recent work in  provided state-of-the-art performance on the ChestX-ray14 dataset. This was achieved by fine-tuning the DenseNet-121 on the dataset. Our work shows that if a fine-grained based method is used along with a baseline DenseNet, it can achieve better results. Moreover, our method with a much smaller architecture (R-18) can achieve approcimately the same performance as the one using D121 (0.8438 VS. 0.8413). Yao et al.  considers label dependency with recurrent model. Both of us used similar label information, but our method is able to achieve better performance with a more flexible model architecture by adding a new loss to explicitly exploiting label correlations rather than embedding the label information into a recurrent model.
In this article, we demonstrate the effectiveness of using multi-label loss function (MSML) learning of DCNN for multi-label lung disease classification on x-ray inputs. We propose two key contributions :(i) the use of fine-grained classification method that learn discriminative representations and (ii) we proposed a novel MSML for deep learning based model which helps to leverage the class dependencies. MSML can be interpreted as decomposing the multi-label learning problem into a number of independent classification problems while learning separate distributions for the presence of each class with respect to all the absent classes. The rooted softmax property inside MSML is essential to facilitate the learning process in an exponential-sized output space. This property makes it appealing to be used as a for multi-label space learning, to predict multiple labels, also making the model alleviate the over-fitting of negative classes because absence classes’ outputs are suppressed with respect to each presence class. In medical data, presence of multiple data and imbalance of data is very common. Proposed MSML can handle both of these problems explicitely and embeds the capability into the network. We have illustrated the effectiveness of such approach though an improvement of AUC-ROC score in disease classification in the ChestX-ray14 dataset. However, the similar problem occurs in other medical data from real world. Therefore, the proposed loss function provides a new direction to attain improved performance for wider medical data.
-  Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers, “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” in . IEEE, 2017, pp. 3462–3471.
-  Li Yao, Eric Poblenz, Dmitry Dagunts, Ben Covington, Devon Bernard, and Kevin Lyman, “Learning to diagnose from scratch by exploiting dependencies among labels,” arXiv preprint arXiv:1710.10501, 2017.
-  Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al., “Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning,” arXiv preprint arXiv:1711.05225, 2017.
-  Hoo-Chang Shin, Kirk Roberts, Le Lu, Dina Demner-Fushman, Jianhua Yao, and Ronald M Summers, “Learning to read chest x-rays: Recurrent neural cascade model for automated image annotation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2497–2506.
-  Christian Payer, Darko Štern, Horst Bischof, and Martin Urschler, “Regressing heatmaps for multiple landmark localization using cnns,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2016, pp. 230–238.
-  Samuel G Armato, Geoffrey McLennan, Luc Bidaut, Michael F McNitt-Gray, Charles R Meyer, Anthony P Reeves, Binsheng Zhao, Denise R Aberle, Claudia I Henschke, Eric A Hoffman, et al., “The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans,” Medical physics, vol. 38, no. 2, pp. 915–931, 2011.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2014.
-  Ryan Farrell, Om Oza, Ning Zhang, Vlad I Morariu, Trevor Darrell, and Larry S Davis, “Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance,” in Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011, pp. 161–168.
-  Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji, “Bilinear CNN models for fine-grained visual recognition,” in ICCV, 2015.
-  David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams, “Learning internal representations by error propagation,” Tech. Rep., California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
-  ZongYuan Ge, Chris McCool, Conrad Sanderson, Peng Wang, Lingqiao Liu, Ian Reid, and Peter Corke, “Exploiting temporal information for dcnn-based fine-grained object classification,” arXiv preprint arXiv:1608.00486, 2016.
-  Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  Zongyuan Ge, Sergey Demyanov, Behzad Bozorgtabar, Mani Abedini, Rajib Chakravorty, Adrian Bowling, and Rahil Garnavi, “Exploiting local and generic features for accurate skin lesions classification using clinical and dermoscopy imaging,” in Biomedical Imaging (ISBI 2017), 2017 IEEE 14th International Symposium on. IEEE, 2017, pp. 986–990.
-  Tsung-Yu Lin and Subhransu Maji, “Improved bilinear pooling with cnns,” arXiv preprint arXiv:1707.06772, 2017.