I Introduction
Medical image classification is critical in clinical practice (e.g., for early detection of diseases). However, medical image classification is still a challenging task due to large intraclass variations, blurred boundaries between abnormalities, inconclusive abnormality patterns, etc. Furthermore, it is common for a patient to suffer multiple symptoms simultaneously, which present several kinds of abnormalities and some complex comorbidities in the medical images. Therefore, medical image interpretation is commonly a multilabeled image classification process. Recently, supervised deep learning models have attained high performance thanks to large amounts of wellannotated data for model training. However, it is timeconsuming to annotate medical images manually by medical experts, while automatic annotation (e.g., automatically extracting labels from reports [Wang2017CVPR]) is fast but possibly introduces considerable corrupted (incorrect) labels.
Many methods, such as model ensemble [CheXNeXt]
, weighted loss function
[DNetLoc], and label hierarchy [chen2019deep], were widely utilized in multilabeled medical image classification. However, dealing with corrupted labels of multilabeled medical images was rarely studied, which is a basic issue for using automatically annotated labels. It was proved that regularization methods could hinder the memorization of models with the generalization ability preserved [arpit2017closer], which is advantageous to tackling label corruption. Many known regularization methods (e.g., Mixup [MIXUP], Manifold Mixup [manifoldmixup]) were proposed for singleoutput tasks, but could not meet the needs of multioutput tasks [xu2019survey]. Besides, the existence of complex correlations among abnormalities was confirmed in literature [graphx, Wang2017CVPR], which required additional considerations in model training. Thus, multilabeled medical image classification with corrupted labels is a challenging problem and requires further research effort.To this end, in this paper, we propose a new regularization approach called FlowMixup for multilabeled medical image classification with label corruption. Specifically, we introduce a new dimension called “flow dimension” for the feature tensors in hidden states and apply a novel Mixing module to a selected hidden state
^{1}^{1}1In this paper, we denote a “hidden state” as the general output of a model layer, and a “feature” refers to a particular representation of some data.. Thus, the model layers ahead of the selected hidden state are restricted to learning a nonlinear function while the subsequent layers are restricted to learning a linear function. FlowMixup guides the nonlinear part to decouple the complex features (where the features of abnormalities are correlative) into abnormalityspecific features before feeding the features to the linear part. The decoupling is guaranteed as the linear function requires its input features to lie in a linearly separable space. We compare FlowMixup with Mixup [MIXUP] and Manifold Mixup [manifoldmixup] to highlight the characteristics of FlowMixup.This work makes three main contributions:

We propose a new regularization method called FlowMixup for multilabeled medical image classification, and show that FlowMixup is insensitive to corrupted labels.

We compare FlowMixup with Mixup [MIXUP] and Manifold Mixup [manifoldmixup], and show that the “correlation conflicts” phenomenon and the “distribution shift” phenomenon occur with using Mixup or Manifold Mixup.

Experiments on several multilabeled medical image classification datasets with corrupted labels verify that our FlowMixup outperforms known regularization methods.
Ii Related Work
Iia Multilabeled Medical Image Classification
Various automatic medical image interpretation applications involve multilabeled image classification tasks, such as chest Xray (CXR) interpretation [cicero2017training, Wang2017CVPR, graphx, CheXNeXt, subspace], electrocardiogram (ECG) monitoring [shen2019ambulatory, kachuee2018ecg, golany2019pgans], comorbidity identification of Alcohol Use Disorder and human immunodeficiency virus infection [adeli2018multi], bone fracture type diagnosis [lee2020long], etc. To better handle multilabeled classification tasks, a new loss function was proposed to guide deep learning models to search the subspace of abnormality features [subspace], and label hierarchy [chen2019deep] and matrix completion [adeli2018multi] methods were also used in correlative feature refinement. In [uncertainty], an approach was especially designed to calculate the uncertainty of automated diagnosis. Abnormality location perception was considered in [DNetLoc, infomask] for CXR image classification. Also, adversarial learning approaches were designed for data augmentation and disease severity assessment in CXR [xing2019adversarial, lanfredi2019adversarial] and ECG [golany2019pgans] classification. A large dataset [Wang2017CVPR]
catalyzed multilabeled classification methods on CXR images. However, the labels in this dataset were mined from radiology reports by natural language processing (NLP) and the textmined labels were somewhat corrupted. Most of the known supervised multilabeled classification methods focused on tackling feature correlations among abnormalities but few of them considered label corruption.
IiB Regularization Methods
Regularization methods are useful for dealing with label corruption [arpit2017closer]. Kurmann et al. [subspace] managed to drive classspecific features into different affine subspaces and enlarge the distances between the subspaces. This method outperformed the vanilla methods in multilabeled CXR image classification. Many data augmentation methods were used to deal with multilabeled medical image classification [graphx, DNetLoc, CheXNeXt], which had similar effect as regularization methods. The stateoftheart regularization methods for singlelabeled classification are Mixup [MIXUP, analysisMixup] and Manifold Mixup [manifoldmixup], by introducing linear constraints into the models. However, both of them are not very suitable to multilabeled classification because of the “correlation conflicts” and “distribution shift” phenomena (discussed in Sec. IV). Mixup ignored the feature correlations among abnormalities while Manifold Mixup was often unstable in training. In this paper, we propose FlowMixup for multilabeled medical image classification, which avoids the drawbacks of Mixup and Manifold Mixup.
Iii Approach
Iiia Preliminaries
Mixup [MIXUP] introduced a linear constraint to singlelabeled classification and achieved good performance. Considering a deep learning classifier as a function , the standard Mixup is defined as:
(1) 
where and are two input images while and are the corresponding labels, with . Mixup regularization restricts the whole model (the function ) to be a linear function, as . Similarly, Manifold Mixup [manifoldmixup] applies the mixing operation as in Eq. (1
) to a hidden state, and restricts the subsequent parts of the model to learn a linear function. Note that the “linear function” and “nonlinear function” are different from the “linear layer” and “nonlinear layer” of the neural networks, as the former concepts are related to the learning objectives but the later concepts are about the model entities.
IiiB An Overview of FlowMixup
In this paper, we propose a new regularization approach, FlowMixup, for multilabeled medical image classification. Consider a deep learning classifier , where is a nonlinear function and is a linear function. A training forward process with FlowMixup takes several steps: First, we select a hidden state to split the model into a nonlinear part and a linear part before training, as , and is the model output. Second, we process the data (e.g., the images) forward to the selected hidden state, and apply a new Mixing module to the features in the hidden state (our Mixing module is depicted below). After being processed by the Mixing module, the features continue the forward propagation until the output. With the Mixing module, FlowMixup restricts the front part of the model to learning a nonlinear function, and the rest of the model serves as a linear function. In dealing with multilabeled medical images, the nonlinear function extracts abnormalityspecific features, and the linear function (subsequent part) of the model projects the abnormalityspecific features into the label spaces. The constraint to the nonlinear part is guaranteed, as the output of the nonlinear part is fed to the linear part which requires its input to lie in a linearly separable space. Different from Manifold Mixup, the special Mixing module introduces an extra flow dimension, thus simultaneously using several mixing modules in a model is allowed.
IiiC Mixing Module
Generally, the tensors of an image in deep learning models have 4 dimensions: batch dimension, channel dimension, width and height dimensions. Our proposed FlowMixup introduces a new dimension, called flow dimension. As shown in the left part of Fig. 1, assume that the original feature has a flow dimension of size 1 before being processed by the Mixing module, and then the output of the Mixing module has a flow dimension of size 2. The flow size is increased by the feature concatenation operation. After the features are fed to the Mixing module, the first step is to make a copy of these features. Then, the feature copy is processed by a mixing operation and then concatenated into the original features along the flow dimension. The forward process in the Mixing module is defined as:
(2) 
where the mixing operation transforms a feature copy into two minicopies () and applies the standard Mixup to them by , as illustrated in the right part of Fig. 1. and are obtained by applying random indexshuffle to .
is randomly sampled from a beta distribution
and is a hyperparameter controlling the mixing degree [MIXUP]. indicates flowwise concatenation, which results in the flow dimension size increase. Following Eq. (2), the feature is transformed into with a double flow size. Since the flow size is doubled in the forward propagation, the Mixing module shall halve the gradients in the backpropagation in order to keep the magnitudes of the gradients. The backpropagation of the Mixing module is defined as:(3) 
where indicates the gradients of the original features, and represents the gradients of the mixed features (see Fig. 1). In this way, the Mixing module can be applied to several hidden states simultaneously with the original features being preserved, as shown in Fig. 2(b). Note that the regularization approach cannot entirely restrict the subsequent layers to be a linear function, and thus applying several Mixing modules is helpful in strengthening the linear constraints. In implementation, if a hidden state is the last one (see Fig. 2(b)) or there is only one state (see Fig. 2(a)) to apply the Mixing module, it is optional to compute the forward propagation of the original features to the output layer. If the original features do not go forward, the Mixing module degrades into the common Mixup operation, calculating in the forward propagation and in the backpropagation.
Iv Analysis and Comparisons
This section discusses the feasibility and the characteristics of FlowMixup and its differences with the known regularization methods, Mixup [MIXUP] and Manifold Mixup [manifoldmixup].
Iva Feasibility of FlowMixup
Hypothesis 1: A learned sequential deep learning classifier for multilabeled images can be reformulated as a composition of some linear functions and some nonlinear functions.
Since the feature correlations among abnormalities exist and the labels lie independently in the label space, a deep learning classifier needs to learn nonlinear functions in order to decouple the correlative features. Thus, it is reasonable to regard a learned sequential classifier as a composition of multiple nonlinear functions and linear functions.
Based on Hypothesis 1, a learned sequential deep learning classifier can be mathematically decoupled by:
(4) 
where the functions () and () belong to a linear function family and a nonlinear function family , respectively. . In practice, in Eq. (4), the order and the concrete expressions of and are obtained by learning from the data.
Theorem 1. A learned sequential deep learning classifier for multilabeled images can be reformulated as a composition of some linear functions and some nonlinear functions, in a sequence where the nonlinear functions appear first and then the linear functions follow, as:
(5) 
Proof: A commutative law for linear and nonlinear functions can be proved, as follows. Assume for a linear function and a nonlinear function . Then and such that = = (because a linear function is invertible), with . Thus, . By applying this commutative law repeatedly, it is easy to prove that a model under Hypothesis 1 can be specified as:
(6) 
where and . The solution thus constructed (i.e., Eq. (6)) verifies the theorem, which suggests that any multilabeled image classifier can find a solution under the constraint of FlowMixup if the equation of the original classifier has a solution.
IvB Comparisons with Mixup
As discussed in the previous work [graphx], the features of abnormalities can be correlative, which may not be linearly separable. In other words, the inherent correlation of abnormalities might be in conflict with the linear constraint of Mixup. Thus, training a multilabeled image classifier with the Mixup regularization may result in performance decrease. As the situation illustrated in Fig. 3, such “correlation conflicts” happen as the boundary line of two classes cannot deal with the data belonging to both of these two classes, after mapping the data manifold to a low dimensional space satisfying the Mixup linear constraint. In contrast, with our FlowMixup, the correlative features of abnormalities can be decoupled into abnormalityspecific features by the nonlinear functions first, and such features lie in a linearly separable space.
IvC Comparisons with Manifold Mixup
Manifold Mixup [manifoldmixup]
allows applying a mixing operation to several hidden states in the training process. However, this mixing operation cannot be performed simultaneously. Manifold Mixup randomly selects one of these hidden states to apply the mixing operation in every training iteration, and consequently suffers two drawbacks. (1) Updating parameters in every iteration affects the final parameters. Therefore, it is hard to know exactly what degree of data mixing is applied to a hidden state, as the mixing operation is used with a probability. Thus, it is difficult to determine the hyperparameters for the mixing operation. (2) Since the training condition to a hidden state (whether to use a mixing operation) is changeable, the training process is unstable and suffers a “distribution shift” phenomenon. “Distribution shift” means that the objective feature distribution is changed. Ideally, using a mixing operation on a hidden state restricts the features to lie in a linearly separable space. However, Manifold Mixup keeps changing the constraint to the hidden states, which leads to an unstable training process and decreases the performance.
To observe the occurrence of the “distribution shift” phenomenon in model training, we compare the feature distributions on the training set of CIFAR10, as shown in Fig. 4. We train the PreActResNet32 model [he2016identity] on the training set of CIFAR10 with Mixup (applied to the data input and the output of every residual block with
) and without Mixup. Then we collect the output of every residual block and the model output. To avoid the influence of the classification results, we utilize the kmeans clustering algorithm (partitioning into
classes) on the collected features of every block output and model output. Then we calculate the average value of (similar toin the analysis of variance) to observe the feature distributions.
, where SSI is the sum of squares for intracluster and SST is the total sum of squares. presents the percentage of the total variance coming from the intercluster variance. The higher is, the more clear the boundaries of the clusters are. SSI and SST are defined by:(7) 
where indicates the number of clusters, is the number of images, and is the number of the images belonging to the th cluster. is the features of the th images in the th hidden state. denotes the feature size of one data in the th hidden state, i.e., , where , and are the channel, height, and width dimension sizes. and denote the datawise average features in the th hidden state and the datawise average features of the th cluster in the th hidden state, respectively. As shown in Fig. 4, one can see that of the features learned with Mixup is evidently higher than without any mixing operations. Thus, the “distribution shift” phenomenon happens when using Manifold Mixup, as the objective feature distributions are very different with and without mixing operations.
V Experiments
Regularization  
Best  Last  Diff.  Best  Last  Diff.  Best  Last  Diff.  Best  Last  Diff.  
ERM  74.2  69.4  4.8  73.1  69.9  3.2  72.8  69.3  3.5  72.2  67.3  4.9 
Mixup ( = 1.0)  77.0  76.2  0.8  76.8  76.4  0.4  76.4  75.9  0.5  75.7  74.6  1.1 
Mixup ( = 3.0)  76.7  75.8  0.9  76.3  75.6  0.7  76.3  75.6  0.7  76.1  75.5  0.6 
Manifold Mixup ( = 3.0)  77.3  75.5  1.8  76.6  75.3  1.3  76.6  74.8  1.8  75.8  74.3  1.5 
FlowMixup (=3.0, Op=False)  77.8  76.9  0.9  77.1  76.3  0.8  76.5  75.3  1.2  76.1  75.1  1.0 
FlowMixup (=3.0, Op=True)  76.9  76.5  0.4  76.9  76.7  0.2  77.0  75.7  1.3  76.3  75.4  0.9 
Dataset  Regularization  

ECG12  ERM  0.6617  0.6531  0.6238  0.5590 
Mixup ( = 1.0)  0.6773  0.6581  0.6337  0.5774  
Mixup ( = 3.0)  0.6575  0.6225  0.6195  0.5894  
Manifold Mixup ( = 3.0)  0.6436  0.6389  0.6378  0.5822  
FlowMixup (=3.0, Op=False)  0.6994  0.6800  0.6347  0.5996  
FlowMixup (=3.0, Op=True)  0.6846  0.6784  0.6495  0.5963  
ECG55  ERM  0.5535  0.5319  0.4501  0.3280 
Mixup ( = 1.0)  0.5543  0.5119  0.4992  0.3474  
Mixup ( = 3.0)  0.5509  0.5245  0.4953  0.4563  
Manifold Mixup ( = 3.0)  0.5512  0.5390  0.4966  0.4655  
FlowMixup (=3.0, Op=False)  0.5551  0.5416  0.5113  0.4426  
FlowMixup (=3.0, Op=True)  0.5769  0.5454  0.5072  0.4570 
Va Datasets
To evaluate our FlowMixup approach for multilabeled medical image classification tasks, we conduct experiments on the ChestXray14 dataset [Wang2017CVPR] and two ECG record datasets of the Alibaba Tianchi Cloud Competition^{2}^{2}2https://tianchi.aliyun.com/competition/entrance/231754/introduction. These datasets are for multilabeled medical image classification. The ChestXray14 dataset [Wang2017CVPR] consists of 112,120 CXR images of size
each. The corresponding labels cover 14 abnormalities extracted from radiology reports by natural language processing (NLP), and some of the CXR images are assigned with more than one label. As estimated by data collectors
[Wang2017CVPR], there is 10% label corruption. For the ECG classification, we use the preliminary competition ECG dataset (the ECG55 dataset) containing 55 arrhythmia categories, and a selected ECG dataset (the ECG12 dataset) containing 12 most common arrhythmia categories in which the ECG records are selected from the preliminary dataset and the final competition dataset. The ECG55 dataset consists of 31,779 8lead ECG records and ECG12 has 34,664 8lead ECG records. The ECGs are 10 second records and were recorded at a frequency of 500 Hertz. An ECG record can be treated as a special onedimensional image and with 1D convolutions [kachuee2018ecg, shen2019ambulatory]. Example samples of the ChestXray14 and ECG datasets are shown in Fig. 5. In experiments, for the ChestXray14 dataset, we follow the official split, and for the ECG datasets, we randomly split a dataset into training, validation, and test parts by 7:1:2 since the official test set is not available.VB Experimental Setups
We use DenseNet121 [huang2017densely] as the CXR classifier baseline and ResNet34 [he2016identity] as the ECG classifier baseline. Two convolutional layers are added ahead of the DenseNet121 network as was done similarly in [DNetLoc]
, both with a kernel size of 3 and a stride of 2. For CXR image classification, we follow the weighted binary crossentropy loss function
[DNetLoc, Wang2017CVPR], weighting the loss term for an abnormality with its inverse proportion. In ResNet34, 1D convolution kernels of size 3 are used to replace the convolution kernels. During training, we set the batch size as 32, and employ the Adam optimizer [Adam] with and . The learning rate is initialized as and is reduced bywhen the valid loss reaches a plateaus. We run 50 epochs for CXR image classification and 200 epochs for ECG classification. To validate FlowMixup and compare to the known regularization methods on multilabeled classification with label corruption, we replace the labels with corrupted labels in probability. The label corruption rates for the ECG reports are
, , , and , while for CXR images the label corruption rates are , , , and as the original labels are already corrupted. The mixing operation is applied to the input of the third and fifth ResBlock and Denseblock, for both Manifold Mixup and FlowMixup. We report the average AUC (Area under the ROC curve) over the 14 kinds of abnormalities on the CXR test set, and report MacroF1 (macroaveraging on F1 scores) on the two ECG test sets.VC Experimental Results
VC1 Performance Comparison
The experimental results on the ChestXray14 dataset are reported in Table I and the results on the ECG55 and ECG12 datasets are in Table II. We compare the indicators of our proposed FlowMixup with those of several known stateoftheart regularization methods, including the Empirical Risk Minimization (ERM) principle [ERM], Mixup [MIXUP], and Manifold Mixup [manifoldmixup]
. We report the best and the last performances in CXR classification; we report only the best performances in ECG classification, since the last performances are very close to the best performances on the ECG datasets. One can see that FlowMixup outperforms the other regularization methods in dealing with various degrees of label corruption. FlowMixup’s performance over the other regularization methods validates the capability of FlowMixup. FlowMixup attains better performances than Mixup, which might result from the abnormalityspecific features extracted by the nonlinear part (see Sec.
IVB). In ECG classification, FlowMixup outperforms Mixup and Manifold Mixup, and a similar conclusion can be derived. Further, one can see from Fig. 6 that FlowMixup outperforms Mixup in most classes in F1 scores and AUCs.VC2 Correlation Conflict Reduction
To further evaluate FlowMixup’s ability to reduce correlation conflicts, we compare the F1 score and AUC of every abnormality between Mixup and FlowMixup on the ChestXray14 test set and ECG12 test set, respectively. The histograms of these F1 scores and AUCs are shown in Fig. 6, both of them with 10% label corruption. For easy comparison, we set two new indicators for every class: “Performance Ratio” = ( is an exponent; for CXR images and for ECG records), and “Independent Ratio” = , where is the number of images with only the class , and is the number of all the images with the class including multilabeled images. The “Performance Ratio” (which is the “AUC Ratio” for CXRs and “F1 Ratio” for ECGs in Fig. 6) indicates the relative performance in every abnormality of Mixup and FlowMixup, while the “Independent Ratio” suggests in what degree a class is independent in a dataset. The performances are normalized before computing the “Performance Ratio”. Thus, one can see whether the relative performances are related to the class independence by comparing the coincides of the “Performance Ratio” and the “Independent Ratio”. In Fig. 6, the “Performance Ratio” curves coincide with the “Independent Ratio” curves (with Spearman correlation coefficients ), indicating that FlowMixup can obtain better performance in a relatively dependent class. Hence, we believe FlowMixup is able to reduce the correlation conflicts.
VC3 Distribution Shift Reduction
To evaluate FlowMixup’s ability to reduce the “Distribution Shift” phenomenon, we compute the differences (Diff.s) between the Best AUCs and the Last AUCs on the ChestXray14 test set, shown in Table III. The Diff.s on the ECG test sets are not reported since the best and the last Marco F1 scores are very close. Further, we compute the variances of the normalized performance indicators (AUCs for CXR images and Macro F1 scores for ECG records) of some epochs on the test sets, as:
(8) 
where is the normalized performance on the test set, is the average performance, and is the index of epochs. The normalization method is the minmax normalization. The variances are computed for the early 20 epochs on the ChestXray14 dataset () and for the early 100 epochs on the two ECG datasets (), as the indicators just fluctuate slightly in the rest epochs. As shown in Table III, FlowMixup has lower variances than Manifold Mixup, which suggests that FlowMixup is more stable. Comparing the Diff.s and variances, it is obvious that training models with Manifold Mixup is not as stable as with FlowMixup, which might be due to the instability caused by “distribution shift”, as discussed in Sec. IVC.
VC4 Hyperparameter
As the suggestions in [MIXUP, manifoldmixup], setting the mixing degree is suggested in dealing with the corrupted labels. In our tasks, we find that the models perform well with . FlowMixup seems to be insensitive to , and the results fluctuate within on the ECG datasets and within on the ChestXray dataset with different .
ChestXray14 ()  ECG12 ()  ECG55 ()  

Regularization  
Manifold Mixup  0.0636  0.0891  0.1208  0.0909  0.0350  0.0336  0.0233  0.0316  0.0376  0.0538  0.0422  0.0276 
FlowMixup  0.0745  0.0865  0.0732  0.0714  0.0310  0.0280  0.0196  0.0237  0.0408  0.0530  0.0374  0.0270 
Vi Conclusions
In this paper, we proposed a new regularization approach, FlowMixup, for multilabeled medical image classification with corrupted labels. Guided by FlowMixup, a deep learning classifier extracts abnormalityspecific features and then maps such features into the label space. Experiments verified that FlowMixup can handle datasets containing corrupted labels, and thus makes it possible to apply automatic annotation. Besides, we compared FlowMixup with the common Mixup and Manifold Mixup methods, highlighted the characteristics of FlowMixup, and discussed the “correlation conflicts” phenomenon and “distribution shift” phenomenon occurred with using Mixup or Manifold Mixup.
Vii Acknowledgements
This research was partially supported by the National Research and Development Program of China under grant No. 2019YFB1404802, No. 2019YFC0118802, and No. 2018AAA0102102, the National Natural Science Foundation of China under grant No. 61672453, the Zhejiang University Education Foundation under grants No. K18511120004, No. K17511120017, and No. K1751805102, the Zhejiang public welfare technology research project under grant No. LGF20F020013, and the Key Laboratory of Medical Neurobiology of Zhejiang Province. D. Z. Chen’s research was supported in part by NSF Grant CCF1617735.