Chest X-rays Classification: A Multi-Label and Fine-Grained Problem

07/19/2018 ∙ by ZongYuan Ge, et al. ∙ ibm Monash University 0

The widely used ChestX-ray14 dataset addresses an important medical image classification problem and has the following caveats: 1) many lung pathologies are visually similar, 2) a variant of diseases including lung cancer, tuberculosis, and pneumonia are present in a single scan, i.e. multiple labels and 3) The incidence of healthy images is much larger than diseased samples, creating imbalanced data. These properties are common in medical domain. Existing literature uses stateof- the-art DensetNet/Resnet models being transfer learned where output neurons of the networks are trained for individual diseases to cater for multiple diseases labels in each image. However, most of them don't consider relationship between multiple classes. In this work we have proposed a novel error function, Multi-label Softmax Loss (MSML), to specifically address the properties of multiple labels and imbalanced data. Moreover, we have designed deep network architecture based on fine-grained classification concept that incorporates MSML. We have evaluated our proposed method on various network backbones and showed consistent performance improvements of AUC-ROC scores on the ChestX-ray14 dataset. The proposed error function provides a new method to gain improved performance across wider medical datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Chest X-ray is one of the most common radiology exams used for screening of lung diseases. X-ray is econonimical and can be performed with minimal procedural steps. Moroever, each scan can detect multiple suspected pathologies such as tuberculosis and pneumonia, etc. Computer-Aided Detection (CADe) and Diagnosis (CADx) has been a major research focus in medicine. CAD for chest X-ray could potentially become a cost-effective assistive tool for radiologists.

Recent advances in artificial intelligence and machine learning have demonstrated that deep learning technologies have superiority in solving various X-ray analysis tasks involving image classification 

[1, 2, 3], NLP based analysis [4] and localisation [5]. The data-driven nature of deep learning benefits from the increasing volume of publicly accessible medical imaging dataset, such as CT, MRI, X-ray [6, 1]. Wang et al. [1] introduced a large scale collection of chest X-ray which contains the absence or presence of 14 lung diseases. Several methods explored the use of ResNet [1] and DenseNet [3, 2]

, that were pre-trained on ImageNet 

[7]

, to classify 14 chest related diseases.

However, there are a few challenges which may obstruct the improvement gained from the usage of direct transfer learning from general image classification tasks to detecting lung diseases in chest x-ray and, more generally, in wider medical imaging domain. First unlike categories in ImageNet where many of them are from different branches of WordNet, the high visual similarity among a wealth of pathologies in lung X-rays can be difficult to interpret and distinguish. Second, presence of patterns from various potential pathologies in one medical scan mandates the model learn a huge number of possible label sets for prediction, which is exponential to the size of label space (

for labels). Third, issues like the class-imbalance among disease labels and overwhelming normal samples, cannot be fully taken into consideration by just using sigmoid with binary cross-entropy loss [1, 3] as in done in standard transfer learning based methodologies. These challenges drive a need for innovative learning mechanisms which consider both subtle differences among various classes, as well as multi-label nature of the task.

This paper introduces two contributions to detect lung diseases from chest x-ray.

  1. First, we provide a fine-grained perspective into this problem, where fine-grained image classification is defined as a problem to categorise visually similar sub-categories [8]. Thus, this motivates us to explore and re-adapt bilinear pooling method [9] from fine-grained classification field.

  2. Second, a neural network loss named Multi-label SoftMax Loss (MSML), is proposed, that captures the characteristics of multi-label learning. We further combine the idea of MSML with bilinear pooling to propose a novel mecahnism for multi-label learning in a deep learning context.

2 Methods

Formally speaking, let be the finite set of labels and be the training set where denotes input and is its relevant label. The label , where if the -th label is present and otherwise. denotes the input space of

-dimensional feature vectors. The goal of our problem is to learn a multi-label hypothesis with parameters

, to output predictions to each class accurately.

2.1 Multi-Label Learning Loss

The error function of the hypothesis on the training set is defined as where is the error of the hypothesis on the

-th input sample. Using sigmoid as activation function (

in the classification layer, could be defined as where is the sigmoid output of input feature on the -th class, which takes the value between and

. During training, optimisation method such as stochastic gradient descent (SGD) can be directly applied to learn from the multi-label annotations.

Our goal is to minimise the errors from the training set for the model

. To look into the gradient being backpropogated, we compute the derivative of the error with respect to each class using the chain rule,

. Following [10], with :

(1)

However, the error function in (1

) only considers individual class discrimination and does not consider interdependencies among target classes explicitely. Therefore we propose a novel loss function for deep learning based multi-label learning task that considers the relationship of multiple labels explicitely. In this paper, the property of leveraging correlations of multi-label is addressed by the MSML (Multi-label Softmax Loss) error function as follows:

(2)

defines the positive class indices of the current sample, where is the complementary of (=0). measures the cardinality and is used for normalisation. MSML is bootstrapped from softmax function . Each positive response is fed to the exponential function. The denominator contains activations from all negative outputs and the positive activation from the nominator.

We show that MSML is simple to evaluate and differentiate. Using the similar approach as standard error, we can backpropagate the gradient from our proposed MSML. With

, then the overal gradient is calculated as,

(3)

2.2 Bilinear Pooling

Bilinear based methods [9]

with deep convolutional neural networks have achieved good results on several fine-grained tasks, such as image recognition 

[9], and video classification [11]. The idea is that in one of the pooling layers, outer-product is performed at each spatial location of two networks to generate second-order statistical discriminative local feature representations. The bilinear pooling can be calculated in a pooling layer as follows :

(4)

where is a local feature descriptor from one of the pooling layer in network , is the outer product of two vectors, is the vectorisation operation, and .

Figure 1: Figure shows the architecture of the multi-label fine-grained network for chest x-ray disease detection. MSML loss operates on each positive classes from a sample and encourages them to consider independences between label being present and absent while minimising the oss across classes.

2.3 Fine-Grained Multi-Label Learning

As shown in [9], CNNs that are symmetrically initialised with identical parameters may provide efficiency during training. However, the symmetric activations may lead to suboptimal solutions since the model doesn’t explore the different space arising from different CNNs. Therefore, in our proposed model we design separate auxiliary losses for two CNN components to break the symmetry between the two feature extractors used for bilinear pooling.

We give an overview of our architecture for fine-grained multi-label learning shown in Fig.1

. Two separate CNNs are initialised with identical pre-trained parameters. The first CNN stream is attached with normal cross-entropy loss based on sigmoid function. The second CNN’s feature maps are input into independent classifiers with MSML loss, which focuses on learning the label correlations. When outputs from bilinear pooling layers are available, fine-grained cross-entropy loss (FCE) can be performed at the fine-grained level. Bilinear pooling layer operates on the last convolutional layers of basic architectures, sign sqrt norm and

norm is applied to the average-pooled bilinear feature layer. A further convolutional layer is added before fed to the classification layer with FCE loss. Our model is trained using image-level annotation only for all classes. The fine-grained multi-label loss to optimiser is the weighted sum of these two:

(5)

3 Experiments

Dataset: To validate the effectiveness of our methods, we used the ChestX-ray14 dataset introduced by NIH [1]. This dataset contains 112,120 frontal-view x-rays with 14 diseases labels (multi-labels for each image), which are obtained from associated radiology reports. We follow the same settings as in [1] where the entire dataset is randomly split to three sets: 70% for training, 10% for validation and 20% for testing. The split is done at the patient level so tha there is no overlap in different data folds.

Network and training: We use various CNN architectures (ResNet, DenseNet, VGG) with pre-trained parameters on the 2012 ImageNet Challenge as the base CNN component for our evaluated frameworks. The ADAM [12] optimiser with initial learning rate of and

is used to optimise the networks. Learning rate decreases by 10 times in every third epochs. We use

and to weight the proposed multi-label fine-grained network. Training images are random crops with size by from by rescaled original images. Input images’ value is rearranged to with mean and std norm. All the reported metrics are computed on the test set.

Evaluation Metrics: Two metrics are being used to evaluate the performance of various frameworks. Area under the ROC curves (AUC): This is the metric used in a few lung x-rays classification literatures [1, 3]. The area curve is calculated with Sensitivity () as vertical axis and Specificity () as horizontal axis. There are strong class-imbalance issue for some classes such as Hernia which posses about 2% of the whole dataset. To track the performance of the majority, we propose to use the Weighted Area under the ROC curves (W-AUC). Each class AUC is associated with a weighting factor with respect to its proportions across the dataset. To drive an insight of how the model performs against normality and abnormality, we further investigate the AUC of disease vs. disease (D-AUC) and disease vs. non-disease (N-AUC) for the test set.

3.1 Quantitative Results

We use the following acronyms. R18-CE: ResNet18 trained with cross entropy (CE) loss, R18-BL-CE: ResNet18 with self bilinear pooling with CE loss, R18-F-MSML: The proposed fine-grained multi-label architecture as described in Sec. 2.3. CNN components use ResNet18 as base architecture. Similar acronyms are denoted for D121: DenseNet-121.

Using the method proposed on the ChestX-ray14 dataset, we examine the class average AUC and W-AUC. The main results are summarised in Table. 1. In most instances, we observe that the proposed fine-grained multi-label architecture provides consistent performance improvements for fusion results with the baseline network. The proposed method surpasses the performance of fusion with R18-BL or D121-BL which demonstrates that MSML provides more independences for two CNN components in bilinear-based model.

Somewhat surprisingly, ensemble model of R18 and D121 doesn’t provide higher AUC (first rows under Ensemble Analysis), which indicates that the high-level abstract representation learned from those two models may have strong correlations. This is considered to occur due to baseline results of R18 and D121 offer insignificant performance difference (0.8239 vs. 0.8354) with respect to their model complexity (18 layers vs. 121 layers). The results shown in the last two rows indicate that we can further improve the performance with a base architecture which works better on bilinear pooling function [13, 14].

Methods (avg) AUC W-AUC D-AUC N-AUC
Residual Network
R18 0.8239 0.5800 0.7801 0.8552
R18-BL 0.7680 0.5543 0.6825 0.8307
R18 + R18-BL 0.8254 0.5811 0.7756 0.8613
R18 + R18-F-MSML 0.8388 0.5892 0.7907 0.8733
Dense Network
D121 0.8354 0.5888 0.7937 0.8651
D121-BL 0.8107 0.5770 0.7523 0.8529
D121 + D121-BL 0.8364 0.5896 0.7926 0.8679
D121 + D121-F-MSML 0.8462 0.5950 0.8011 0.8785
Ensemble Analysis
R18 + D121 0.8355 0.5883 0.7930 0.8653
R18 + VGG–F-MSML 0.8438 0.5932 0.8011 0.8743
D121 + VGG–F-MSML 0.8537 0.5952 0.8060 0.8827
Table 1: AUC and W-AUC results on 14 abnormalities from ChestX-ray14 dataset

3.2 Training Strategy Analysis

In this part we examine three training strategies for the proposed method.111To facilitate the verification process, we use ResNet18 for this experiment. Local: Train two CNN components with CE and MSML loss then fine-tuning with FCE with bilinear pooling. Local-Fixed: Same as Local but parameters of two CNN components are fixed during final stage fine-tuning with FCE. Global: All parameters are trained simultaneously. We present the performance of those training strategies in column three. Overall, Global makes the best performance of all three. We may infer that global training strategy is beneficial to multi-loss training.

Training Strategy Methods AUC
Local R18-F-MSML 0.8310
Local-Fixed R18-F-MSML 0.8356
Global R18-F-MSML 0.8388
Table 2: Training strategy of proposed architecture on ChestX-ray14 dataset

3.3 Comparable Study to Other Methods

In this section we show the proposed method can achieve state-of-the-art performance for chest disease detection. Recent work in [3] provided state-of-the-art performance on the ChestX-ray14 dataset. This was achieved by fine-tuning the DenseNet-121 on the dataset. Our work shows that if a fine-grained based method is used along with a baseline DenseNet, it can achieve better results. Moreover, our method with a much smaller architecture (R-18) can achieve approcimately the same performance as the one using D121 (0.8438 VS. 0.8413). Yao et al. [2] considers label dependency with recurrent model. Both of us used similar label information, but our method is able to achieve better performance with a more flexible model architecture by adding a new loss to explicitly exploiting label correlations rather than embedding the label information into a recurrent model.

Methods CNN architecture AUC
Wang et al.[1] R-50 0.7363
Yao et al.[2] D121-LSTM 0.7980
Rajpurkar et al.[3] D121 0.8413
Proposed R-18/D-121 0.8438/8.8537
Table 3: Results comparison to other methods from ChestX-ray14 dataset

4 Conclusion

In this article, we demonstrate the effectiveness of using multi-label loss function (MSML) learning of DCNN for multi-label lung disease classification on x-ray inputs. We propose two key contributions :(i) the use of fine-grained classification method that learn discriminative representations and (ii) we proposed a novel MSML for deep learning based model which helps to leverage the class dependencies. MSML can be interpreted as decomposing the multi-label learning problem into a number of independent classification problems while learning separate distributions for the presence of each class with respect to all the absent classes. The rooted softmax property inside MSML is essential to facilitate the learning process in an exponential-sized output space. This property makes it appealing to be used as a for multi-label space learning, to predict multiple labels, also making the model alleviate the over-fitting of negative classes because absence classes’ outputs are suppressed with respect to each presence class. In medical data, presence of multiple data and imbalance of data is very common. Proposed MSML can handle both of these problems explicitely and embeds the capability into the network. We have illustrated the effectiveness of such approach though an improvement of AUC-ROC score in disease classification in the ChestX-ray14 dataset. However, the similar problem occurs in other medical data from real world. Therefore, the proposed loss function provides a new direction to attain improved performance for wider medical data.

References