Meta ordinal weighting net for improving lung nodule classification

by   Yiming Lei, et al.

The progression of lung cancer implies the intrinsic ordinal relationship of lung nodules at different stages-from benign to unsure then to malignant. This problem can be solved by ordinal regression methods, which is between classification and regression due to its ordinal label. However, existing convolutional neural network (CNN)-based ordinal regression methods only focus on modifying classification head based on a randomly sampled mini-batch of data, ignoring the ordinal relationship resided in the data itself. In this paper, we propose a Meta Ordinal Weighting Network (MOW-Net) to explicitly align each training sample with a meta ordinal set (MOS) containing a few samples from all classes. During the training process, the MOW-Net learns a mapping from samples in MOS to the corresponding class-specific weight. In addition, we further propose a meta cross-entropy (MCE) loss to optimize the network in a meta-learning scheme. The experimental results demonstrate that the MOW-Net achieves better accuracy than the state-of-the-art ordinal regression methods, especially for the unsure class.



There are no comments yet.


page 1

page 2

page 3

page 4


Meta Ordinal Regression Forest For Learning with Unsure Lung Nodules

Deep learning-based methods have achieved promising performance in early...

Robust Deep Ordinal Regression Under Label Noise

State-of-the-art ordinal regression methods rely on the correctness of t...

Rank Consistent Logits for Ordinal Regression with Convolutional Neural Networks

While extraordinary progress has been made towards developing neural net...

Learning Probabilistic Ordinal Embeddings for Uncertainty-Aware Regression

Uncertainty is the only certainty there is. Modeling data uncertainty is...

A New Weighting Scheme in Weighted Markov Model for Predicting the Probability of Drought Episodes

Drought is a complex stochastic natural hazard caused by prolonged short...

Prostate Tissue Grading with Deep Quantum Measurement Ordinal Regression

Prostate cancer (PCa) is one of the most common and aggressive cancers w...

Universally Rank Consistent Ordinal Regression in Neural Networks

Despite the pervasiveness of ordinal labels in supervised learning, it r...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional neural networks (CNNs) have recently achieved impressive results in lung nodule classification based on computed tomography (CT) [1, 2, 3, 4, 5, 6, 7, 8]. However, this kind of method usually performs a binary classification (malignant vs.

benign) by omitting the unsure nodules—those between benign and malignant—which is a great waste of medical data for machine learning algorithms, especially for data-hungry deep learning methods. Therefore, how to leverage unsure data to learn a robust model is crucial for lung nodule classification.

To this end, the ordinal regression has been widely explored to utilize those unsure nodules. The unsure data model (UDM) [9]

was proposed to learn with unsure lung nodules, and it regards this classification as an ordinal regression problem, which optimizes the negative logarithm of cumulative probabilities. However, the UDM has some additional parameters that need to be carefully tuned. The neural stick-breaking (NSB) method calculates the probabilities through the

predicted classification bounds, where is the number of classes [10]. The unimodal method makes each fully-connected output follows a unimodal distribution such as Poisson or Binomial [11]

. However, these methods do not guarantee the strict ordinal relationship. Recently, the convolutional ordinal regression forests (CORFs) aim at solving this problem through the combination of CNNs and random forests 

[12], which have been shown effective for lung nodule classification in a meta learning framework [13]. In summary, the existing methods do not explicitly leverage the ordinal relationship resided in the data itself.

Figure 1: The illustration of the MOW-Net. Each training sample is aligned with a meta ordinal set which contains samples of each class. Forward: the denotes the cross-entropy loss of class w.r.t. the -th training sample, and each is mapped to the corresponding weight through the specific MLP. Backward: the two parts of parameters and are updated alternatively based on the proposed MCE loss and the Meta loss, respectively.

In this paper, we assume that the ordinal relationship resides in not only the label but also the data itself. Put differently, the ordinal relationship of features from different classes also dominates the generalization ability of the model. Therefore, we propose a meta ordinal weighting network (MOW-Net) to learn the ordinal regression between each training sample and a set of samples of all classes simultaneously, where this set is termed as meta ordinal set (MOS). Here, the MOS implies the feature and semantic information of the dataset, and acts as the meta knowledge for the training sample. As shown in Fig. 1, each training sample relates to an MOS, and the meta samples in the MOS are mapped to the corresponding weights representing the meta knowledge of all classes. Furthermore, we propose a novel meta cross-entropy (MCE) loss for the training of the MOW-Net; each term is weighted by the learned meta weight as shown in Fig. 1. Different from the normal CE loss, the MCE loss indicates that the training sample is guided by the meta weights learned from the MOS. Hence, the MOW-Net is able to boost the classification performance and generalization ability with the supervision from meta ordinal knowledge. Moreover, our MOW-Net is able to reflect the difficulties of classification of all classes in the dataset, which will be analyzed in Sec.3.4.

The main contributions of this work are summarized as follows. First, we propose a meta ordinal weighting network (MOW-Net) for lung nodule classification with unsure nodules. The MOW-Net contains a backbone network for classification and a mapping branch for meta ordinal knowledge learning. Second, we propose a meta cross-entropy (MCE) loss for the training of the MOW-Net in a meta-learning scheme, which is based on a meta ordinal set (MOS) that contains a few training samples from all classes and provides the meta knowledge for the target training sample. Last, the experimental results demonstrate the significant performance improvements in contrast to the state-of-the-art ordinal regression methods. In addition, the changes in learned meta weights reflect the difficulties of classifying each class.

2 Methodology

This section introduces the proposed meta ordinal set (MOS), meta cross-entropy (MCE) loss, and the meta training algorithm, respectively.

2.1 Meta Ordinal Set (MOS)

We assume that the ordinal relationship resides in not only the label but also the data itself. Therefore, we align each target training sample with an MOS that contains samples from each class. The MOS for -th training sample is formally defined as follows:


where is the number of the classes.Note that the samples in the

are not ordered, but the samples in one class should go to the corresponding multi-layer perceptron (MLP) in Fig. 

1. Then the MPLs are able to learn the specific knowledge from each class. For a target training sample , is randomly sampled from the training set, and .

2.2 Meta Cross-Entropy Loss

In order to enable the MOW-Net to absorb the meta knowledge provided by the MOS, we propose an MCE loss to align the meta knowledge of each class to the corresponding entropy term:


where and are the prediction and the learned meta weight of the -th class, respectively. Note that the MCE loss implies no ordinal regression tricks such as cumulative probabilities, it only holds the correlation between the meta data and the predictions. Compared with the conventional CE loss, the MCE loss enables the training samples to be supervised by the corresponding meta data, hence, the learning of the MOW-Net takes into account the meta ordinal knowledge resided in the data itself.

2.3 Training Algorithm

The MOW-Net is trained in the meta-learning scheme, which requires the second derivatives [14, 15, 16, 17, 18], and two parts of the parameters, and , to be updated alternatively. We highlight that the MOW-net offers a novel way of utilizing ordinal relationship encapsulated within the data itself; however, the model is still the same as the one trained with CE loss. What’s more, our MOW-Net does not modify the classification head and can be adapted to various backbones.

Although the whole MOW-Net contains the two parts of parameters, and , the trained model discards the MLPs () at the inference stage. In other words, the MLPs are only involved in the meta training phase to produce the class specific knowledge. Then the optimal is learned by minimizing the following objective function:


where denotes the conventional CE loss of the -th meta data with respect to the -th training sample, and is the -th MLP with being the input (Fig. 1).

Following [14], can be updated through the following objective function:


where is the size of the MOS. Then we update these two parts of parameters alternatively using meta-learning [14, 15]. First, we calculate the derivative of through Eq. (3):


Next, can be updated as follows:


Last, is updated based on the :


In the above equations, the superscript represents the -th iteration during training, and are the learning rates for and , respectively. The training algorithm is detailed in Algorithm 1. Note that after updating (line 8 in the Algorithm 1), the MOW-Net obtains the meta knowledge through taking the normal CE loss of the training sample as the input to all MLPs. Here, we can regard the updated MPLs as the prior knowledge for each ordinal class.

To further analyze Eq. (6), we can obtain:


We can see that Eq. (9) denotes the derivative of the entropy of the training sample (2nd term) is to approach the derivative of the MOS (1st term). This implies that the learning of the backbone network is guided by the meta knowledge. Please refer to for the detailed derivation.

1:Training data , randomly sampled MOS .
2:The learned parameter .
3:Initialize and .
4:for  to  do
5:     Sample a mini-batch of from , and align each with a MOS .
6:     Forward: input and with to obtain and respectively.
7:     Compute the first order derivative by Eq. (5).
8:     Forward: input with , then obtain the meta CE losses .
9:     Update by Eq. (6).
10:     Forward: input with and calculate the ; then input into all the MLPs to obtain the .
11:     Update by Eq. (7).
12:end for
Algorithm 1 Training Algorithm of MOW-Net.

3 Experiments

In this section, we report the classification performance of our MOW-Net on the dataset LIDC-IDRI [19].

3.1 Dataset

LIDC-IDRI is a publicly available dataset for low dose CT-based lung nodule analysis, which includes 1,010 patients. Each nodule was rated on a scale of 1 to 5 by four thoracic radiologists, indicating an increased probability of malignancy. In this paper, the ROI of each nodule was cropped at its annotated center, with a square shape of a doubled equivalent diameter. An averaged score of a nodule was used as ground-truth for the model training. All volumes were resampled to have 1mm spacing (original spacing ranged from mm to mm) in each dimension, and the cropped ROIs are of the size . The averaged scores range from 1 to 5, and in our experiments, we regard a nodule with a score between 2.5 and 3.5 as the unsure nodule; benign and malignant nodules are those with scores lower than 2.5 and higher than 3.5, respectively [9].

3.2 Implementation Details

We used the VGG-16 as the backbone network [20], and made the following changes: 1) the input channel is 32 following [4]

; 2) we only keep the first seven convolutional layers due to a small size of the input, each followed by the batch normalization (BN) and ReLU; and 3) the final classifier is a two-layer perceptron that has 4096 neurons in hidden layer. We use 80% of data for training and the remaining data for testing.

The hyperparameters for all experiments are set as follows: the learning rate is 0.0001 and decayed by 0.1 for every 80 epochs; the mini-batch size is 16; weight decay for Adam optimizer is 0.0001 

[21]. The symbols P, R, and F1 in our results stand for precision, recall, and F1 score, respectively [9].

Method Accuracy Benign Malignant Unsure
P R F1 P R F1 P R F1
CE Loss 0.517 0.538 0.668 0.596 0.562 0.495 0.526 0.456 0.360 0.402
Poisson [11] 0.542 0.548 0.794 0.648 0.568 0.624 0.594 0.489 0.220 0.303
NSB [10] 0.553 0.565 0.641 0.601 0.566 0.594 0.580 0.527 0.435 0.476
UDM [9] 0.548 0.541 0.767 0.635 0.712 0.515 0.598 0.474 0.320 0.382
CORF [12] 0.559 0.590 0.627 0.608 0.704 0.495 0.581 0.476 0.515 0.495
MOW-Net () 0.629 0.752 0.489 0.592 0.558 0.851 0.675 0.600 0.675 0.635
MOW-Net () 0.672 0.764 0.596 0.670 0.600 0.802 0.686 0.642 0.690 0.665
MOW-Net () 0.687 0.768 0.623 0.688 0.668 0.705 0.686 0.606 0.792 0.687
Table 1: Results of classification on LIDC-IDRI dataset. Following  [9], the values with underlines indicate the best results while less important in the clinical diagnosis.

Figure 2: The visualization results on the testing set using -SNE.

Figure 3: The variations of the learned weights for all classes.

3.3 Classification Performance

In our experiments, we mainly focus on the precision of benign class, recall of malignant and unsure classes [9]. We compared our MOW-Net with the state-of-the-art ordinal regression methods and the normal CE loss. In Table 1, we can see that the MOW-Net achieves the best accuracy by a large margin against other methods. Specifically, the MOW-Net significantly improves the recall of the unsure class by 0.28 over the previous best result. This is significant for the clinical diagnosis since a higher recall of the unsure class can encourage more follow-ups and reduce the probabilities of the nodules that are misdiagnosed as malignant or benign. In addition, the precision of benign and the recall of the malignant get a great improvement.

3.4 Analysis on Learned Weights

In order to further understand the weighting scheme of the MOW-Net, we plot the variations of the weights in in Fig. 3. At the beginning of the training, the weight for the unsure class is increasing while the weight for the malignant class is decreasing, indicating that the MOW-net focuses on classifying the unsure class from the other two classes, and the malignant class is an easy-classified class. Then, at epoch 10, the trends of these two weights become opposite. The curve of the benign fluctuates slightly through the whole training process. At epoch 45, the weights for all the three classes begin to converge. This indicates that the model pays different attentions (weights) to different classes, and these attentions affect the update of the backbone network. At the end of the training, the model has similar sensitivities for each class.

Together with Fig. 2, the malignant samples are easier to be classified than the other two classes at the beginning. At epoch 10, the unsure samples are fused with other samples severely so that it has the highest weight. Simultaneously, the malignant class performs worse than that at the beginning. As the training continues, the weight for the malignant began to increase. At epoch 45, the malignant samples are clustered again and the unsure samples are more centralized than that of the previous epochs. At epoch 97, the model achieves the best accuracy, and it is obvious that the samples are distributed orderly, which demonstrates the effectiveness of the meta ordinal set.

3.5 Effects on the Size of MOS

The definition of the MOS in Eq. (1) shows that the parameter determines the number of samples of each class. Here, we explore the effect of varied . Table 1 shows that when , the MOW-Net obtained the best performance. The performance of and is better than that of , which indicates that the more number in MOS, the better generalizability of the model.

4 Conclusions

In this paper, we proposed an MOW-Net and the corresponding MOS to explore the ordinal relationship resided in the data itself for lung nodule classification in a meta-learning scheme. The experimental results empirically demonstrate a significant improvement compared to existing methods. The visualization results further confirm the effectiveness of the weighting scheme and the learned ordinal relationship.


  • [1] A. A. A. Setio, F. Ciompi, G. Litjens et al., “Pulmonary nodule detection in CT images: false positive reduction using multi-view convolutional networks,” IEEE Transactions on Medical Imaging, vol. 35, no. 5, pp. 1160–1169, 2016.
  • [2] S. Hussein, R. Gillies, K. Cao, Q. Song, and U. Bagci, “TumorNet: Lung nodule characterization using multi-view convolutional neural network with Gaussian process,” in 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI), 2017, pp. 1007–1010.
  • [3] H. Shan, G. Wang, M. K. Kalra, R. de Souza, and J. Zhang, “Enhancing transferability of features from pretrained deep neural networks for lung nodule classification,” in The Proceedings of the 2017 International Conference on Fully Three-Dimensional Image Reconstruction in Radiology and Nuclear Medicine (Fully3D), 2017, pp. 65–68.
  • [4] Y. Lei, Y. Tian, H. Shan, J. Zhang, G. Wang, and M. K. Kalra, “Shape and margin-aware lung nodule classification in low-dose CT images via soft activation mapping,” Medical Image Analysis, vol. 60, p. 101628, 2020.
  • [5]

    Q. Zhang, J. Zhou, and B. Zhang, “A noninvasive method to detect diabetes mellitus and lung cancer using the stacked sparse autoencoder,” in

    ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 1409–1413.
  • [6] Y. Li, D. Gu, Z. Wen, F. Jiang, and S. Liu, “Classify and explain: An interpretable convolutional neural network for lung cancer diagnosis,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 1065–1069.
  • [7] F. Li, H. Huang, Y. Wu, C. Cai, Y. Huang, and X. Ding, “Lung nodule detection with a 3D convnet via iou self-normalization and maxout unit,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 1214–1218.
  • [8] R. Xu, Z. Cong, X. Ye, Y. Hirano, and S. Kido, “Pulmonary textures classification using a deep neural network with appearance and geometry cues,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 1025–1029.
  • [9] B. Wu, X. Sun, L. Hu, and Y. Wang, “Learning with unsure data for medical image diagnosis,” in

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    , 2019, pp. 10 590–10 599.
  • [10] X. Liu, Y. Zou, Y. Song, C. Yang, J. You, and B. K Vijaya Kumar, “Ordinal regression with neuron stick-breaking for medical diagnosis,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 335–344.
  • [11] C. Beckham and C. Pal, “Unimodal probability distributions for deep ordinal classification,” arXiv preprint arXiv:1705.05278, 2017.
  • [12]

    H. Zhu, Y. Zhang, H. Shan, L. Che, X. Xu, J. Zhang, J. Shi, and F.-Y. Wang, “Convolutional ordinal regression forest for image ordinal estimation,”

    IEEE Transactions on Neural Networks and Learning Systems, 2021.
  • [13] Y. Lei, H. Zhu, J. Zhang, and H. Shan, “Meta ordinal regression forest for learning with unsure lung nodules,” in 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2020, pp. 442–445.
  • [14] J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng, “Meta-weight-net: Learning an explicit mapping for sample weighting,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 1919–1930.
  • [15] S. Liu, A. Davison, and E. Johns, “Self-supervised generalisation with meta auxiliary learning,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 1679–1689.
  • [16] R. Vuorio, S.-H. Sun, H. Hu, and J. J. Lim, “Multimodal model-agnostic meta-learning via task-aware modulation,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 1–12.
  • [17] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” arXiv preprint arXiv:1703.03400, 2017.
  • [18] M. A. Jamal and G.-J. Qi, “Task agnostic meta-learning for few-shot learning,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2019, pp. 11 719–11 727.
  • [19] S. G. Armato III, G. McLennan, L. Bidaut et al., “The lung image database consortium (LIDC) and image database resource initiative (IDRI): A completed reference database of lung nodules on CT scans,” Medical Physics, vol. 38, no. 2, pp. 915–931, 2011.
  • [20] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015, pp. 1–14.
  • [21] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.