Medical image classification is critical in clinical practice (e.g., for early detection of diseases). However, medical image classification is still a challenging task due to large intra-class variations, blurred boundaries between abnormalities, inconclusive abnormality patterns, etc. Furthermore, it is common for a patient to suffer multiple symptoms simultaneously, which present several kinds of abnormalities and some complex comorbidities in the medical images. Therefore, medical image interpretation is commonly a multi-labeled image classification process. Recently, supervised deep learning models have attained high performance thanks to large amounts of well-annotated data for model training. However, it is time-consuming to annotate medical images manually by medical experts, while automatic annotation (e.g., automatically extracting labels from reports [Wang2017CVPR]) is fast but possibly introduces considerable corrupted (incorrect) labels.
Many methods, such as model ensemble [CheXNeXt]
, weighted loss function[DNetLoc], and label hierarchy [chen2019deep], were widely utilized in multi-labeled medical image classification. However, dealing with corrupted labels of multi-labeled medical images was rarely studied, which is a basic issue for using automatically annotated labels. It was proved that regularization methods could hinder the memorization of models with the generalization ability preserved [arpit2017closer], which is advantageous to tackling label corruption. Many known regularization methods (e.g., Mixup [MIXUP], Manifold Mixup [manifoldmixup]) were proposed for single-output tasks, but could not meet the needs of multi-output tasks [xu2019survey]. Besides, the existence of complex correlations among abnormalities was confirmed in literature [graphx, Wang2017CVPR], which required additional considerations in model training. Thus, multi-labeled medical image classification with corrupted labels is a challenging problem and requires further research effort.
To this end, in this paper, we propose a new regularization approach called Flow-Mixup for multi-labeled medical image classification with label corruption. Specifically, we introduce a new dimension called “flow dimension” for the feature tensors in hidden states and apply a novel Mixing module to a selected hidden state111In this paper, we denote a “hidden state” as the general output of a model layer, and a “feature” refers to a particular representation of some data.. Thus, the model layers ahead of the selected hidden state are restricted to learning a nonlinear function while the subsequent layers are restricted to learning a linear function. Flow-Mixup guides the nonlinear part to decouple the complex features (where the features of abnormalities are correlative) into abnormality-specific features before feeding the features to the linear part. The decoupling is guaranteed as the linear function requires its input features to lie in a linearly separable space. We compare Flow-Mixup with Mixup [MIXUP] and Manifold Mixup [manifoldmixup] to highlight the characteristics of Flow-Mixup.
This work makes three main contributions:
We propose a new regularization method called Flow-Mixup for multi-labeled medical image classification, and show that Flow-Mixup is insensitive to corrupted labels.
We compare Flow-Mixup with Mixup [MIXUP] and Manifold Mixup [manifoldmixup], and show that the “correlation conflicts” phenomenon and the “distribution shift” phenomenon occur with using Mixup or Manifold Mixup.
Experiments on several multi-labeled medical image classification datasets with corrupted labels verify that our Flow-Mixup outperforms known regularization methods.
Ii Related Work
Ii-a Multi-labeled Medical Image Classification
Various automatic medical image interpretation applications involve multi-labeled image classification tasks, such as chest X-ray (CXR) interpretation [cicero2017training, Wang2017CVPR, graphx, CheXNeXt, subspace], electrocardiogram (ECG) monitoring [shen2019ambulatory, kachuee2018ecg, golany2019pgans], comorbidity identification of Alcohol Use Disorder and human immunodeficiency virus infection [adeli2018multi], bone fracture type diagnosis [lee2020long], etc. To better handle multi-labeled classification tasks, a new loss function was proposed to guide deep learning models to search the subspace of abnormality features [subspace], and label hierarchy [chen2019deep] and matrix completion [adeli2018multi] methods were also used in correlative feature refinement. In [uncertainty], an approach was especially designed to calculate the uncertainty of automated diagnosis. Abnormality location perception was considered in [DNetLoc, infomask] for CXR image classification. Also, adversarial learning approaches were designed for data augmentation and disease severity assessment in CXR [xing2019adversarial, lanfredi2019adversarial] and ECG [golany2019pgans] classification. A large dataset [Wang2017CVPR]
catalyzed multi-labeled classification methods on CXR images. However, the labels in this dataset were mined from radiology reports by natural language processing (NLP) and the text-mined labels were somewhat corrupted. Most of the known supervised multi-labeled classification methods focused on tackling feature correlations among abnormalities but few of them considered label corruption.
Ii-B Regularization Methods
Regularization methods are useful for dealing with label corruption [arpit2017closer]. Kurmann et al. [subspace] managed to drive class-specific features into different affine subspaces and enlarge the distances between the subspaces. This method outperformed the vanilla methods in multi-labeled CXR image classification. Many data augmentation methods were used to deal with multi-labeled medical image classification [graphx, DNetLoc, CheXNeXt], which had similar effect as regularization methods. The state-of-the-art regularization methods for single-labeled classification are Mixup [MIXUP, analysisMixup] and Manifold Mixup [manifoldmixup], by introducing linear constraints into the models. However, both of them are not very suitable to multi-labeled classification because of the “correlation conflicts” and “distribution shift” phenomena (discussed in Sec. IV). Mixup ignored the feature correlations among abnormalities while Manifold Mixup was often unstable in training. In this paper, we propose Flow-Mixup for multi-labeled medical image classification, which avoids the drawbacks of Mixup and Manifold Mixup.
Mixup [MIXUP] introduced a linear constraint to single-labeled classification and achieved good performance. Considering a deep learning classifier as a function , the standard Mixup is defined as:
where and are two input images while and are the corresponding labels, with . Mixup regularization restricts the whole model (the function ) to be a linear function, as . Similarly, Manifold Mixup [manifoldmixup] applies the mixing operation as in Eq. (1
) to a hidden state, and restricts the subsequent parts of the model to learn a linear function. Note that the “linear function” and “nonlinear function” are different from the “linear layer” and “nonlinear layer” of the neural networks, as the former concepts are related to the learning objectives but the later concepts are about the model entities.
Iii-B An Overview of Flow-Mixup
In this paper, we propose a new regularization approach, Flow-Mixup, for multi-labeled medical image classification. Consider a deep learning classifier , where is a nonlinear function and is a linear function. A training forward process with Flow-Mixup takes several steps: First, we select a hidden state to split the model into a nonlinear part and a linear part before training, as , and is the model output. Second, we process the data (e.g., the images) forward to the selected hidden state, and apply a new Mixing module to the features in the hidden state (our Mixing module is depicted below). After being processed by the Mixing module, the features continue the forward propagation until the output. With the Mixing module, Flow-Mixup restricts the front part of the model to learning a nonlinear function, and the rest of the model serves as a linear function. In dealing with multi-labeled medical images, the nonlinear function extracts abnormality-specific features, and the linear function (subsequent part) of the model projects the abnormality-specific features into the label spaces. The constraint to the nonlinear part is guaranteed, as the output of the nonlinear part is fed to the linear part which requires its input to lie in a linearly separable space. Different from Manifold Mixup, the special Mixing module introduces an extra flow dimension, thus simultaneously using several mixing modules in a model is allowed.
Iii-C Mixing Module
Generally, the tensors of an image in deep learning models have 4 dimensions: batch dimension, channel dimension, width and height dimensions. Our proposed Flow-Mixup introduces a new dimension, called flow dimension. As shown in the left part of Fig. 1, assume that the original feature has a flow dimension of size 1 before being processed by the Mixing module, and then the output of the Mixing module has a flow dimension of size 2. The flow size is increased by the feature concatenation operation. After the features are fed to the Mixing module, the first step is to make a copy of these features. Then, the feature copy is processed by a mixing operation and then concatenated into the original features along the flow dimension. The forward process in the Mixing module is defined as:
where the mixing operation transforms a feature copy into two mini-copies () and applies the standard Mixup to them by , as illustrated in the right part of Fig. 1. and are obtained by applying random index-shuffle to .
is randomly sampled from a beta distributionand is a hyper-parameter controlling the mixing degree [MIXUP]. indicates flow-wise concatenation, which results in the flow dimension size increase. Following Eq. (2), the feature is transformed into with a double flow size. Since the flow size is doubled in the forward propagation, the Mixing module shall halve the gradients in the back-propagation in order to keep the magnitudes of the gradients. The back-propagation of the Mixing module is defined as:
where indicates the gradients of the original features, and represents the gradients of the mixed features (see Fig. 1). In this way, the Mixing module can be applied to several hidden states simultaneously with the original features being preserved, as shown in Fig. 2(b). Note that the regularization approach cannot entirely restrict the subsequent layers to be a linear function, and thus applying several Mixing modules is helpful in strengthening the linear constraints. In implementation, if a hidden state is the last one (see Fig. 2(b)) or there is only one state (see Fig. 2(a)) to apply the Mixing module, it is optional to compute the forward propagation of the original features to the output layer. If the original features do not go forward, the Mixing module degrades into the common Mixup operation, calculating in the forward propagation and in the back-propagation.
Iv Analysis and Comparisons
This section discusses the feasibility and the characteristics of Flow-Mixup and its differences with the known regularization methods, Mixup [MIXUP] and Manifold Mixup [manifoldmixup].
Iv-a Feasibility of Flow-Mixup
Hypothesis 1: A learned sequential deep learning classifier for multi-labeled images can be reformulated as a composition of some linear functions and some nonlinear functions.
Since the feature correlations among abnormalities exist and the labels lie independently in the label space, a deep learning classifier needs to learn nonlinear functions in order to decouple the correlative features. Thus, it is reasonable to regard a learned sequential classifier as a composition of multiple nonlinear functions and linear functions.
Based on Hypothesis 1, a learned sequential deep learning classifier can be mathematically decoupled by:
where the functions () and () belong to a linear function family and a nonlinear function family , respectively. . In practice, in Eq. (4), the order and the concrete expressions of and are obtained by learning from the data.
Theorem 1. A learned sequential deep learning classifier for multi-labeled images can be reformulated as a composition of some linear functions and some nonlinear functions, in a sequence where the nonlinear functions appear first and then the linear functions follow, as:
Proof: A commutative law for linear and nonlinear functions can be proved, as follows. Assume for a linear function and a nonlinear function . Then and such that = = (because a linear function is invertible), with . Thus, . By applying this commutative law repeatedly, it is easy to prove that a model under Hypothesis 1 can be specified as:
where and . The solution thus constructed (i.e., Eq. (6)) verifies the theorem, which suggests that any multi-labeled image classifier can find a solution under the constraint of Flow-Mixup if the equation of the original classifier has a solution.
Iv-B Comparisons with Mixup
As discussed in the previous work [graphx], the features of abnormalities can be correlative, which may not be linearly separable. In other words, the inherent correlation of abnormalities might be in conflict with the linear constraint of Mixup. Thus, training a multi-labeled image classifier with the Mixup regularization may result in performance decrease. As the situation illustrated in Fig. 3, such “correlation conflicts” happen as the boundary line of two classes cannot deal with the data belonging to both of these two classes, after mapping the data manifold to a low dimensional space satisfying the Mixup linear constraint. In contrast, with our Flow-Mixup, the correlative features of abnormalities can be decoupled into abnormality-specific features by the nonlinear functions first, and such features lie in a linearly separable space.
Iv-C Comparisons with Manifold Mixup
Manifold Mixup [manifoldmixup]
allows applying a mixing operation to several hidden states in the training process. However, this mixing operation cannot be performed simultaneously. Manifold Mixup randomly selects one of these hidden states to apply the mixing operation in every training iteration, and consequently suffers two drawbacks. (1) Updating parameters in every iteration affects the final parameters. Therefore, it is hard to know exactly what degree of data mixing is applied to a hidden state, as the mixing operation is used with a probability. Thus, it is difficult to determine the hyper-parameters for the mixing operation. (2) Since the training condition to a hidden state (whether to use a mixing operation) is changeable, the training process is unstable and suffers a “distribution shift” phenomenon. “Distribution shift” means that the objective feature distribution is changed. Ideally, using a mixing operation on a hidden state restricts the features to lie in a linearly separable space. However, Manifold Mixup keeps changing the constraint to the hidden states, which leads to an unstable training process and decreases the performance.
To observe the occurrence of the “distribution shift” phenomenon in model training, we compare the feature distributions on the training set of CIFAR-10, as shown in Fig. 4. We train the PreAct-ResNet-32 model [he2016identity] on the training set of CIFAR-10 with Mixup (applied to the data input and the output of every residual block with
) and without Mixup. Then we collect the output of every residual block and the model output. To avoid the influence of the classification results, we utilize the k-means clustering algorithm (partitioning intoclasses) on the collected features of every block output and model output. Then we calculate the average value of (similar to
in the analysis of variance) to observe the feature distributions., where SSI is the sum of squares for intra-cluster and SST is the total sum of squares. presents the percentage of the total variance coming from the inter-cluster variance. The higher is, the more clear the boundaries of the clusters are. SSI and SST are defined by:
where indicates the number of clusters, is the number of images, and is the number of the images belonging to the -th cluster. is the features of the -th images in the -th hidden state. denotes the feature size of one data in the -th hidden state, i.e., , where , and are the channel, height, and width dimension sizes. and denote the data-wise average features in the -th hidden state and the data-wise average features of the -th cluster in the -th hidden state, respectively. As shown in Fig. 4, one can see that of the features learned with Mixup is evidently higher than without any mixing operations. Thus, the “distribution shift” phenomenon happens when using Manifold Mixup, as the objective feature distributions are very different with and without mixing operations.
|Mixup ( = 1.0)||77.0||76.2||0.8||76.8||76.4||0.4||76.4||75.9||0.5||75.7||74.6||1.1|
|Mixup ( = 3.0)||76.7||75.8||0.9||76.3||75.6||0.7||76.3||75.6||0.7||76.1||75.5||0.6|
|Manifold Mixup ( = 3.0)||77.3||75.5||1.8||76.6||75.3||1.3||76.6||74.8||1.8||75.8||74.3||1.5|
|Flow-Mixup (=3.0, Op=False)||77.8||76.9||0.9||77.1||76.3||0.8||76.5||75.3||1.2||76.1||75.1||1.0|
|Flow-Mixup (=3.0, Op=True)||76.9||76.5||0.4||76.9||76.7||0.2||77.0||75.7||1.3||76.3||75.4||0.9|
|Mixup ( = 1.0)||0.6773||0.6581||0.6337||0.5774|
|Mixup ( = 3.0)||0.6575||0.6225||0.6195||0.5894|
|Manifold Mixup ( = 3.0)||0.6436||0.6389||0.6378||0.5822|
|Flow-Mixup (=3.0, Op=False)||0.6994||0.6800||0.6347||0.5996|
|Flow-Mixup (=3.0, Op=True)||0.6846||0.6784||0.6495||0.5963|
|Mixup ( = 1.0)||0.5543||0.5119||0.4992||0.3474|
|Mixup ( = 3.0)||0.5509||0.5245||0.4953||0.4563|
|Manifold Mixup ( = 3.0)||0.5512||0.5390||0.4966||0.4655|
|Flow-Mixup (=3.0, Op=False)||0.5551||0.5416||0.5113||0.4426|
|Flow-Mixup (=3.0, Op=True)||0.5769||0.5454||0.5072||0.4570|
To evaluate our Flow-Mixup approach for multi-labeled medical image classification tasks, we conduct experiments on the ChestX-ray14 dataset [Wang2017CVPR] and two ECG record datasets of the Alibaba Tianchi Cloud Competition222https://tianchi.aliyun.com/competition/entrance/231754/introduction. These datasets are for multi-labeled medical image classification. The ChestX-ray14 dataset [Wang2017CVPR] consists of 112,120 CXR images of size
each. The corresponding labels cover 14 abnormalities extracted from radiology reports by natural language processing (NLP), and some of the CXR images are assigned with more than one label. As estimated by data collectors[Wang2017CVPR], there is 10% label corruption. For the ECG classification, we use the preliminary competition ECG dataset (the ECG-55 dataset) containing 55 arrhythmia categories, and a selected ECG dataset (the ECG-12 dataset) containing 12 most common arrhythmia categories in which the ECG records are selected from the preliminary dataset and the final competition dataset. The ECG-55 dataset consists of 31,779 8-lead ECG records and ECG-12 has 34,664 8-lead ECG records. The ECGs are 10 second records and were recorded at a frequency of 500 Hertz. An ECG record can be treated as a special one-dimensional image and with 1-D convolutions [kachuee2018ecg, shen2019ambulatory]. Example samples of the ChestX-ray14 and ECG datasets are shown in Fig. 5. In experiments, for the ChestX-ray14 dataset, we follow the official split, and for the ECG datasets, we randomly split a dataset into training, validation, and test parts by 7:1:2 since the official test set is not available.
V-B Experimental Setups
We use DenseNet-121 [huang2017densely] as the CXR classifier baseline and ResNet-34 [he2016identity] as the ECG classifier baseline. Two convolutional layers are added ahead of the DenseNet-121 network as was done similarly in [DNetLoc]
, both with a kernel size of 3 and a stride of 2. For CXR image classification, we follow the weighted binary cross-entropy loss function[DNetLoc, Wang2017CVPR], weighting the loss term for an abnormality with its inverse proportion. In ResNet-34, 1-D convolution kernels of size 3 are used to replace the convolution kernels. During training, we set the batch size as 32, and employ the Adam optimizer [Adam] with and . The learning rate is initialized as and is reduced by
when the valid loss reaches a plateaus. We run 50 epochs for CXR image classification and 200 epochs for ECG classification. To validate Flow-Mixup and compare to the known regularization methods on multi-labeled classification with label corruption, we replace the labels with corrupted labels in probability. The label corruption rates for the ECG reports are, , , and , while for CXR images the label corruption rates are , , , and as the original labels are already corrupted. The mixing operation is applied to the input of the third and fifth ResBlock and Denseblock, for both Manifold Mixup and Flow-Mixup. We report the average AUC (Area under the ROC curve) over the 14 kinds of abnormalities on the CXR test set, and report Macro-F1 (macro-averaging on F1 scores) on the two ECG test sets.
V-C Experimental Results
V-C1 Performance Comparison
The experimental results on the ChestX-ray14 dataset are reported in Table I and the results on the ECG-55 and ECG-12 datasets are in Table II. We compare the indicators of our proposed Flow-Mixup with those of several known state-of-the-art regularization methods, including the Empirical Risk Minimization (ERM) principle [ERM], Mixup [MIXUP], and Manifold Mixup [manifoldmixup]
. We report the best and the last performances in CXR classification; we report only the best performances in ECG classification, since the last performances are very close to the best performances on the ECG datasets. One can see that Flow-Mixup outperforms the other regularization methods in dealing with various degrees of label corruption. Flow-Mixup’s performance over the other regularization methods validates the capability of Flow-Mixup. Flow-Mixup attains better performances than Mixup, which might result from the abnormality-specific features extracted by the nonlinear part (see Sec.IV-B). In ECG classification, Flow-Mixup outperforms Mixup and Manifold Mixup, and a similar conclusion can be derived. Further, one can see from Fig. 6 that Flow-Mixup outperforms Mixup in most classes in F1 scores and AUCs.
V-C2 Correlation Conflict Reduction
To further evaluate Flow-Mixup’s ability to reduce correlation conflicts, we compare the F1 score and AUC of every abnormality between Mixup and Flow-Mixup on the ChestX-ray14 test set and ECG-12 test set, respectively. The histograms of these F1 scores and AUCs are shown in Fig. 6, both of them with 10% label corruption. For easy comparison, we set two new indicators for every class: “Performance Ratio” = ( is an exponent; for CXR images and for ECG records), and “Independent Ratio” = , where is the number of images with only the class , and is the number of all the images with the class including multi-labeled images. The “Performance Ratio” (which is the “AUC Ratio” for CXRs and “F1 Ratio” for ECGs in Fig. 6) indicates the relative performance in every abnormality of Mixup and Flow-Mixup, while the “Independent Ratio” suggests in what degree a class is independent in a dataset. The performances are normalized before computing the “Performance Ratio”. Thus, one can see whether the relative performances are related to the class independence by comparing the coincides of the “Performance Ratio” and the “Independent Ratio”. In Fig. 6, the “Performance Ratio” curves coincide with the “Independent Ratio” curves (with Spearman correlation coefficients ), indicating that Flow-Mixup can obtain better performance in a relatively dependent class. Hence, we believe Flow-Mixup is able to reduce the correlation conflicts.
V-C3 Distribution Shift Reduction
To evaluate Flow-Mixup’s ability to reduce the “Distribution Shift” phenomenon, we compute the differences (Diff.s) between the Best AUCs and the Last AUCs on the ChestX-ray14 test set, shown in Table III. The Diff.s on the ECG test sets are not reported since the best and the last Marco F1 scores are very close. Further, we compute the variances of the normalized performance indicators (AUCs for CXR images and Macro F1 scores for ECG records) of some epochs on the test sets, as:
where is the normalized performance on the test set, is the average performance, and is the index of epochs. The normalization method is the min-max normalization. The variances are computed for the early 20 epochs on the ChestX-ray14 dataset () and for the early 100 epochs on the two ECG datasets (), as the indicators just fluctuate slightly in the rest epochs. As shown in Table III, Flow-Mixup has lower variances than Manifold Mixup, which suggests that Flow-Mixup is more stable. Comparing the Diff.s and variances, it is obvious that training models with Manifold Mixup is not as stable as with Flow-Mixup, which might be due to the instability caused by “distribution shift”, as discussed in Sec. IV-C.
As the suggestions in [MIXUP, manifoldmixup], setting the mixing degree is suggested in dealing with the corrupted labels. In our tasks, we find that the models perform well with . Flow-Mixup seems to be insensitive to , and the results fluctuate within on the ECG datasets and within on the ChestX-ray dataset with different .
|ChestX-ray14 ()||ECG-12 ()||ECG-55 ()|
In this paper, we proposed a new regularization approach, Flow-Mixup, for multi-labeled medical image classification with corrupted labels. Guided by Flow-Mixup, a deep learning classifier extracts abnormality-specific features and then maps such features into the label space. Experiments verified that Flow-Mixup can handle datasets containing corrupted labels, and thus makes it possible to apply automatic annotation. Besides, we compared Flow-Mixup with the common Mixup and Manifold Mixup methods, highlighted the characteristics of Flow-Mixup, and discussed the “correlation conflicts” phenomenon and “distribution shift” phenomenon occurred with using Mixup or Manifold Mixup.
This research was partially supported by the National Research and Development Program of China under grant No. 2019YFB1404802, No. 2019YFC0118802, and No. 2018AAA0102102, the National Natural Science Foundation of China under grant No. 61672453, the Zhejiang University Education Foundation under grants No. K18-511120-004, No. K17-511120-017, and No. K17-518051-02, the Zhejiang public welfare technology research project under grant No. LGF20F020013, and the Key Laboratory of Medical Neurobiology of Zhejiang Province. D. Z. Chen’s research was supported in part by NSF Grant CCF-1617735.