Log In Sign Up

MVCNet: Multiview Contrastive Network for Unsupervised Representation Learning for 3D CT Lesions

by   Penghua Zhai, et al.

Objective and Impact Statement. With the renaissance of deep learning, automatic diagnostic systems for computed tomography (CT) have achieved many successful applications. However, they are mostly attributed to careful expert annotations, which are often scarce in practice. This drives our interest to the unsupervised representation learning. Introduction. Recent studies have shown that self-supervised learning is an effective approach for learning representations, but most of them rely on the empirical design of transformations and pretext tasks. Methods. To avoid the subjectivity associated with these methods, we propose the MVCNet, a novel unsupervised three dimensional (3D) representation learning method working in a transformation-free manner. We view each 3D lesion from different orientations to collect multiple two dimensional (2D) views. Then, an embedding function is learned by minimizing a contrastive loss so that the 2D views of the same 3D lesion are aggregated, and the 2D views of different lesions are separated. We evaluate the representations by training a simple classification head upon the embedding layer. Results. Experimental results show that MVCNet achieves state-of-the-art accuracies on the LIDC-IDRI (89.55%), LNDb (77.69%) and TianChi (79.96%) datasets for unsupervised representation learning. When fine-tuned on 10% of the labeled data, the accuracies are comparable to the supervised learning model (89.46% vs. 85.03%, 73.85% vs. 73.44%, 83.56% vs. 83.34% on the three datasets, respectively). Conclusion. Results indicate the superiority of MVCNet in learning representations with limited annotations.


page 3

page 6


Spiral Contrastive Learning: An Efficient 3D Representation Learning Method for Unannotated CT Lesions

Computed tomography (CT) samples with pathological annotations are diffi...

Contrastive learning, multi-view redundancy, and linear models

Self-supervised learning is an empirically successful approach to unsupe...

Lesion-Aware Contrastive Representation Learning for Histopathology Whole Slide Images Analysis

Local representation learning has been a key challenge to promote the pe...

Unsupervised Contrastive Learning of Sound Event Representations

Self-supervised representation learning can mitigate the limitations in ...

Viewmaker Networks: Learning Views for Unsupervised Representation Learning

Many recent methods for unsupervised representation learning involve tra...

Prototypical Contrastive Learning of Unsupervised Representations

This paper presents Prototypical Contrastive Learning (PCL), an unsuperv...

A Deep Learning System That Generates Quantitative CT Reports for Diagnosing Pulmonary Tuberculosis

We developed a deep learning model-based system to automatically generat...

1 Introduction

Computed Tomography (CT) is one of the most widely used examinations in clinical applications. CT scans contain hundreds of slices, making it time-consuming for the physicians to browse and analyze them layer-by-layer. In addition, the image interpretation may vary in physicians, leading to ambiguity in decision-making. With the development of deep learning, computer-aided diagnosis (CAD) systems have greatly improved the efficiency and diagnostic accuracy of physicians. Currently, building a CAD system usually requires a large amount of annotated data. However, it is laborious and time-consuming for physicians to annotate CT scans. It is also difficult to aggregate data from different institutes due to the data island effect. Therefore, CAD systems are still confined by the lack of annotated volumetric data.

To cope with the lack of annotated data, researchers have attempted to exploit useful information from the unlabeled data with unsupervised learning algorithms

[dike2018unsupervised, pouyanfar2018survey, chellappa2021advances]. Recently, self-supervised learning (SSL), a new unsupervised learning paradigm, has attracted increasing attention due to its excellent representation learning ability [noroozi2016unsupervised, caron2018deep, chen2020simple, he2020momentum, grill2020bootstrap, jing2020self, lee2020predicting, liu2021self]. As shown in Table S1, we compare several typical SSL in terms of three aspects: data type, transformation and pretext task. SSL can be divided into two categories: pretext task methods [gidaris2018unsupervised, zhu2020rubik, chen2019self, doersch2015unsupervised] and transformation-based methods [chen2020simple, he2020momentum, grill2020bootstrap, chen2021exploring]. The former aims to learn representations by the pseudo-labels, such as solving the Jigsaw puzzles [noroozi2016unsupervised] and predicting spatial patches [doersch2015unsupervised]. The latter typically use contrastive losses to endow the model with the invariance to transformations such as rotation, painting and Coloring. However, the pretext tasks and transformations are designed empirically, and different transformations or pretext tasks have different optimal configurations according to downstream tasks [xiao2020should].

Due to the complexity of medical imaging, it is important to consider expert knowledge when designing pretext tasks and transformations [zhou2019models, zhu2020rubik, shen2017deep, li2020self]. For example, [zhu2020rubik] learned representations by solving a Rubik’s Cube task. [zhou2019models] designed a series of transformations and learn representations by reconstructing the original image. Nevertheless, the transformations are subject to the human experience [zhou2019models, xiao2020should]. Developing a new SSL paradigm for learning representations in 3D CT scans without transformations and pretext tasks is a challenging yet important task.

Figure 1: Examples of lesions in the LIDC-IDRI, LNDb, and TianChi datasets. For LIDC-IDRI and LNDb, (a)-(d) are the malignant nodules and (e)-(h) are the benign nodules. For TianChi, (a)-(b), (c)-(d), (e)-(f) and (g)-(h) are the lung nodules, streak shadow, arteriosclerosis (calcification) and lymph node calcification.

To deal with these challenges, we propose a novel multi-view contrastive network (MVCNet) for learning 3D CT representations. Different from traditional transformations, we extract nine views of each lesion from different orientations. This method is prevalent in recent years [setio2016pulmonary, xie2018knowledge, xie2019semi, su15mvcnn, li2018survey, zhao2017multi]. We regard the views of the same lesion as positive pairs, and the views of different lesions as negative pairs. Inspired by the procedures of previous studies [tian2020contrastive, chen2020simple], we learn representations by minimizing a contrastive loss, resulting in an embedding space with the property of within-lesion compactness and between-lesion separability. We validate the effectiveness of our method on three lung CT datasets. Experimental results show that the proposed MVCNet outperforms state-of-the-art SSL methods.

The contribution of this paper is threefold. 1) We propose a multi-view contrastive learning framework for unsupervised representation learning on lung 3D CT data. 2) We exploit a transformation-free method for SSL, where the transformations are replaced by 2D views of 3D lesions from different orientations. To the best of our knowledge, this is the first work to implement 3D SSL without transformations and explicit pretext tasks in the medical image analysis field. 3) Extensive experiments on three open CT datasets (LIDC-IDRI, LNDb and TianChi) demonstrate the superiority of MVCNet over existing state-of-the-art SSL methods. Moreover, MVCNet obtains comparable results to the 3D supervised methods, indicting the gap between supervised learning and unsupervised learning has been largely bridged in our task. Figure. 1 shows some lesions from the three datasets.

2 Results

Methods AUC () Sensitivity () Specificity () Accuracy () Precision ()
Natural Image Augmentation
Context [doersch2015unsupervised]
RotNet [gidaris2018unsupervised]
MoCo [he2020momentum]
MoCo V2 [chen2020improved]
SimCLR [chen2020simple]
BYOL [grill2020bootstrap]
SimSiam [chen2021exploring]
Medical Image Augmentation
MoCo [he2020momentum]
MoCo V2 [chen2020improved]
SimCLR [chen2020simple]
BYOL [grill2020bootstrap]
SimSiam [chen2021exploring]
Models Genesis [zhou2019models]
Rubik’s Cube+ [zhu2020rubik]
Restoration  [chen2019self]
Table 1: Comparison of MVCNet with state-of-the-art SSL methods on the LIDC-IDRI

2.1 Linear Evaluation

We evaluate the representations by training a linear classifier on top of the frozen representation, which is a common practice in previous studies

[tian2020contrastive, chen2020simple]. Three sets of different experiments on LIDC-IDRI are listed in Table 1, including SSL based on natural image augmentations (NIA), SSL based on medical image augmentations (MIA) and the proposed MVCNet using nine views. Considering the limitation of space, the results on LNDb and TianChi are listed in Table S3 and Table S4. We used the transformations in the SimCLR as the basic natural image augmentation and the augmentation in the Models genesis as the medical image augmentation (see Table S2). The Models genesis, Rubik’s Cube+, and Restoration are originally developed for medical imaging, and the others are developed based on natural images. By comparing the metrics, we harvest the following observations:

First, the performance of the methods based on medical image augmentation is higher than the method based on natural image augmentation when applying them to CT scans. For SimCLR and MoCo V2, we get a 1% and 7% improvement separately when changing the augmentations from natural image augmentations to medical image augmentations. Rubik’s Cube+ obtains the highest accuracy among all the methods except the proposed MVCNet.

Second, for the proposed MVCNet, we use the different views decomposed from lesion volumes instead of transformations to learn representations for 3D lesions. As is shown in Table 1, we obtain an accuracy of 89.55% using nine views. Compared with SimCLR (NIA), SimCLR (MIA) and Rubik’s Cube+, MVCNet achieves 10%, 9%, and 8% improvement, respectively. The results show that the proposed MVCNet has superior performance comparing with previous state-of-the-arts in SSL.

Methods AUC () Sensitivity () Specificity () Accuracy () Precision ()
SimCLR (NIA) [chen2020simple]
SimCLR (MIA) [chen2020simple]
AlexNet [krizhevsky2012imagenet]
AlexNet (3D) [krizhevsky2012imagenet]
SimCLR (NIA) [chen2020simple]
SimCLR (MIA) [chen2020simple]
AlexNet [krizhevsky2012imagenet]
AlexNet (3D) [krizhevsky2012imagenet]

NIA: natural image augmentation. MIA: medical image augmentation. LE: linear evaluation. AlexNet and AlexNet (3D) are trained on the annotated data.

Table 2: Performance comparison of MVCNet using 10% data for fine-tuning on LIDC-IDRI and LNDb

2.2 Performance with Fine-tuning

To further evaluate the effectiveness of the MVCNet with limited data, we fine-tune the model with 10% annotated samples in datasets. This protocol has been used in some previous studies [chen2020simple]. In Table 2, Figure 2 and Table 3, we fine-tune SimCLR with 10% datasets and compare the results with the fully-supervised methods using all the samples.

As shown in Table 2, the accuracies are 89.46% and 73.85% on LIDC-IDRI and LNDb, respectively. For LIDC-IDRI, MVCNet shows advantages over SimCLR and AlexNet with large margins (4% improvement, respectively). For LNDb, MVCNet is better than AlexNet while but inferior to the 3D AlexNet. As is shown in Figure 2, we diagnose each of the four diseases and use AUC as the metric due to the class imbalance. Figure 2 demonstrates that the performance of MVCNet is better than SimCLR and is comparable to the fully-supervised AlexNet. We also conduct multi-disease diagnosis experiments on the TianChi dataset, as is shown in Table 3. The diagnostic accuracy of MVCNet is slightly higher (0.22%) than that of AlexNet, and it is higher (3%) than that of SimCLR. Overall, the MVCNet fine-tuned with 10% datasets is better than SimCLR and is comparable to the fully-supervised AlexNet.

Figure 2: Results of fine-tuning with 10% data on TianChi dataset. The vertical axis represents the diagnosis results evaluated by AUC. The horizontal axis represents different methods. NIA: natural image augmentation, MIA: medical image augmentation. LE: linear evaluation, FT: fine-tuning. Different colors represent different diseases (black: Nodules; red: Streak shadows; yellow: Arteriosclerosis (Calcification); white: Lymph node calcification).
Methods Accuracy ()
SimCLR (NIA) [chen2020simple]
SimCLR (MIA) [chen2020simple]
AlexNet [krizhevsky2012imagenet]
AlexNet (3D) [krizhevsky2012imagenet]

NIA: natural image augmentation. MIA: medical image augmentation. LE: linear evaluation. FT: fine-tuning.

Table 3: Performance comparison of MVCNet using 10% data for fine-tuning on TianChi for multi-diseases diagnosis.

Figure 3: Results of fine-tuning with different numbers of samples.

2.3 Fine-tuning with Different Percentages of Samples

We also investigate the performance of MVCNet with respect to the fine-tuning samples on the three datasets and summarize the results in Figure 3. We consider 1%, 5%, 10%, 25%, 50%, 75% and 100% of the fine-tuning samples. According to the overall trend, the accuracy increases as the amount of sample increases. When the sample size is very small (e.g., 1%), the mean accuracy of MVCNet is poor, especially for LIDC-IDRI (accuracy=54.31%). The main reason is that there are not enough samples to learn the inter-sample differences. However, the accuracy can reach a rather high point when using 10% samples.

Figure 4: The visualization of the learned representations of the MVCNet on three datasets by t-SNE. We opt for simplicity and show the representations of three views (view1, view2, and view3 in Figure 5), and randomly sample 30 examples. Each shape represents a view, and different colors represent different lesions. The paired views are fed into MVCNet to obtain the representations, followed by t-SNE to reduce the representation dimension to 2. The closer the distance between the three views of the same lesion, the better the spatial aggregation of the learned representations. Best viewed electronically.

2.4 Representation Visualizations

To visually demonstrate the effect of representation learning, we visualize the representations learned from the three datasets in Figure 4. For a clear demonstration, we randomly sample 30 lesions and three views from each dataset. These views are fed into the network to yield the representations, which are followed by t-SNE [van2008visualizing]. In Figure 4, each shape represents a view, and different colors represent different lesions. The closer the distance between the views of a same lesion, the better the within-lesion compactness. It is observed from Figure 4 that the MVCNet can minimize the distance among views of a same lesion. At the same time, the representations of different lesions do not collapse together. The three datasets all show the same property.

3 Discussion

Learning view-invariant representations for 3D lesions in CT is fundamental in many applications since the local structures of tissues occur at arbitrary views (see Supplementary Figure S2, Figure S3 and Figure S4). MVCNet aggregates multiple 2D views of the same 3D CT scan to make the model make view-invariant predictions. From the perspective of the training scheme, our view-based method is quite different from the transformation-based works. The transformation-based methods can learn useful semantic representations via virtual transformations. However, the transformations are designed empirically and are not universal for all data. For example, color transformation can hurt the fine-grained classification of birds and the rotation is not useful for coarse-grained classification [xiao2020should]. The proposed framework has two advantages. First, it is transformation-free, and we do not spend a lot of time and effort to design and validate the transformations. Inappropriate transformations may decrease the performance of the model. Second, multi-view learning and CT data can be combined logically and smoothly, given that the CT data itself is three-dimensional.

Choosing how many views for each batch for training is an important variable, which is associated with the diagnostic accuracy. To understand this, Figure S1 depicts the accuracy on the LIDC-IDRI with respect to the number of views. We find that the diagnostic accuracy also increased as the number of views increased. The result proves our motivation that multiple views can effectively improve the diagnostic performance of 3D lesions. Essentially, the lesion (e.g., lung nodule) is a 3D object and the lesion has different characteristics for different views. Therefore, the challenge in learning 3D lesion representations using multi-view is twofold, namely to maximize the commonality of all views while still preserving the differences between the views. We solve the first challenge by contrast learning, the contrast loss can maximize the similarity between the representations of different views of the same lesion. Modeling for each view can ensure the fairness of the view and learn as many characteristics of each view as possible. Therefore, we can achieve the goal of learning the 3D representations of the lesion by the proposed MVCNet.

The proposed MVCNet is a novel multi-view contrastive learning framework for learning representations from volumetric CT data. Unlike existing SSL methods, MVCNet works in a transformation-free manner and no explicit pretext task is needed. Our approach is evaluated on three CT datasets for disease diagnosis. Experimental results show that our method outperforms state-of-the-art SSL methods, indicating the superiority of MVCNet in unsupervised representation learning. Being fine-tuned with a small percentage of the datasets (10%), our model is comparable to the 3D fully supervised model, demonstrating its superiority in small-data scenarios and the potential of reducing the annotation efforts in 3D CAD systems.

The main insufficiency of our work is that MVCNet is designed for the lesion identification task and is not optimized for the localization task on the whole CT volumes. Lesion identification is similar to object recognition in computer vision in that both are prospect dominant. The effectiveness of multi-view methods in lesion localization (often with background dominance) needs to be explored. The future work is twofold. First, in theory, we can expand MVCNet to any number of views from unlimited orientations. We will investigate the number of views and the correlations between views. Second, we will adapt MVCNet to more tasks (lesion localization and segmentation) and evaluate it on more imaging modalities such as the magnetic resonance imaging.

4 Materials and Methods

4.1 Lidc-Idri

The LIDC-IDRI 111 is a commonly-used CT dataset. It contains 1,018 cases used for lung nodules diagnosis [armato2011lung]. The malignancy of each nodule was evaluated by up to four experienced radiologists in two stages with a 5-point scale from benign to malignant. All suspicious lesions are categorized to three types according to the diameter in long axis: non-nodule , nodule , and nodule . We only consider nodules in diameter, since nodules were not considered to be clinically relevant. The slice thickness of CT scans ranges from to with a median of . In this study, scans with slice thickness thicker than were eliminated, as this could easily lead to the omission of small nodules [setio2016pulmonary, naidich2013recommendations]. Following the procedures used in previous studies [xie2018knowledge], we calculate the mean malignancy degree () of a nodule which was annotated by at least three radiologists, and annotate a nodule whose as benign, a nodule whose as uncertain and a nodule whose as malignant. There are totally 369 benign, 405 uncertain and 335 malignant nodules. To reduce the impact of uncertain evaluation, we exclude all uncertain lung nodules from the dataset.

4.2 LNDb

The LNDb 222 dataset contains a total of 294 CT scans [pedrosa2019lndb]. The annotation method is adapted from the LIDC-IDRI. Each CT scan is read by at least one radiologist to evaluate the nodules with a 5-point scale from benign to malignant. Therefore, we process the LNDb dataset in the LIDC-IDRI way. There are few nodules annotated by at least three radiologists. Therefore, we calculate the mean malignancy degree () of all annotated nodules. Finally, there are 451 benign and 768 malignant nodules after removing the uncertain nodules.

4.3 TianChi

The TianChi lung multi-disease diagnosis dataset 333 contains a total of 1,470 CT scans with four diseases. In the annotation file, radiologists record the center coordinates, size (consisting of three diameters) and disease category. The annotation process is divided into two stages: two physicians performed the original annotation, then a third independent physician performed the disambiguation to ensure the consistency of the data annotation. We use all the annotated lesions in the dataset. There are 3,264 nodules, 3,613 Streak shadows, 4,201 arteriosclerosis or calcification and 1,140 lymph node calcification.

Figure 5: Illustration of the training stage of the MVCNet. For simplicity, we show two 3D lesions. They are fed into a set of view filters with nine orientations to generate multiple 2D views, respectively. Inherited from [tian2020contrastive], each view has a private encoder and a projector to generate representations in a embedding space. We make views of the same lesions attract each other and views of different lesions repeal each other. Being optimized by a contrastive loss, the shared embedding space is endowed with well local-aggregating properties and the spread-out properties of representations are preserved. Note that each view model is shared across lesions.

4.4 Implementation Details

We extract 2D views of lesion volumes from different orientations, and resize each view to . We use a tiny AlexNet [krizhevsky2012imagenet]

as our base encoder without fully connected layers. The model contains three convolutional layers, and the number of channels produced by the convolution is 48, 192, 128, respectively. For the projector, we use a light neural network that maps representations to the embedding space where contrastive loss is applied. The projector includes three fully connected layer, and the first two are followed by a batch normalization layer and activated by ReLU

[nair2010Rectified]. The output dimensions of the layers are 2,048, 2,048 and 128. After the projector, we apply a

-normalize. Stochastic gradient descent with a base learning rate of 0.1 is used to optimize the encoder and projector. The training epoch is set to 240, and when reaching 120, 160, and 200 epochs, the learning rate decays by a rate of 0.1. We use a batch size of 64, and the weight decay rate is

. The temperature is set to 0.07 by following previous works [tian2020contrastive, wu2018unsupervised]. Most of the details are based on the Contrastive Multiview coding (CMC) implementation [tian2020contrastive]

. All the experiments are conducted with PyTorch

444 using a single Tesla V100 32GB GPU.

We show the flowchart of the proposed MVCNet in Figure 5

. It is notable that we focus on the lesion diagnosis task and assume that the detection of suspected lesions has been completed. Nine views are extracted from each lesion volume from different orientations. Next, we construct a convolutional neural network to learn representations by minimizing a contrastive loss. Finally, we evaluate the representations with 1) linear evaluation by training a classification head upon the MVCNet with fixed parameters and 2) fine-tuning the model with a small fraction of annonated data.

4.5 View Extraction

Before extracting lesion volumes, we process the datasets based on a same procedure since all the datasets are for lung CT scans. Following previous studies, we truncate the range of Hounsfield (HU) values to and then scaled it to to reduce the influence of other organs [setio2016pulmonary, xie2018knowledge]. Then, we resize the pixel resolution of all CT scans to , which corresponds to the most common resolution of CT scans [xie2018knowledge, shen2017multi].

After preprocessing, we extract the lesion volumes according to the lesion diameter. The LIDC-IDRI and LNDb are used to diagnose lung nodules. We set the size of each lesion volume to since the diameter of nodules is generally between and [xie2018knowledge]. For the TianChi dataset, there are four different types of diseases, and we calculate the lesion size based on the longest diameter of each lesion. Assuming that the longest diameter of the lesion is , the size of the lesion volume is . Finally, to avoid the empirical design of transformations and retain the 3D characteristics of lesions, we introduce a new approach to extract nine views of a lesion volume from different orientations [setio2016pulmonary]. We show the nine orientations as view filters at the bottom of Figure 5. Supplementary Figure S2, Figure S3 and Figure S4 present some examples of 2D views for 3D lesions.

4.6 Contrastive Loss

Contrastive learning aims at constructing a latent embedding space for separating samples from different clusters in an unsupervised manner. Like most of the existing works [tian2020contrastive, chen2020simple, he2020momentum], we use a contrastive loss to enhance the similarity within lesions and separability between lesions.

Following the CMC [tian2020contrastive], we build a multi-view contrastive learning model to learn 3D representation through 2D views. In this way, we can maximize the mutual information and preserve the differences among the views. We consider the 3D characteristics of lesions by combining the representations of all the 2D views for downstream tasks.

The model is composed by an encoder and a projector . We illustrate the process with two views (represented by and , e.g., view1 and view2 in Figure 5). Following the previous works [chen2020simple, tian2020contrastive]

, we learn the representation vector

by the encoder from each view. i.e., and . Then, we map the representation to a projection by the projector, i.e., and . We adopt a tiny AlexNet [krizhevsky2012imagenet] as our encoder, and the is the output after the last convolutional layer. The projector is a multi-layer perception (MLP) with three hidden layers and a -normalization. The MLP can be seen as the fully connected layers of AlexNet. The model is initialized randomly. Same as most SSL methods [chen2020simple, he2020momentum, grill2020bootstrap], the role of the projector is to eliminate the semantically irrelevant low-level information in the representation. After pretraining, the projector is discarded and the representations of the encoder are used for downstream tasks.

We randomly sample a batch of lesions and define the contrastive task within the batch, and so the number of views is in each batch. We denote a given batch as . Then, the views of the same lesion form a positive pair (e.g., ), and we treat the left views of different lesions as negative samples, which is in consistent with the previous studies [tian2020contrastive, chen2020simple].

We exploit a contrastive loss to achieve the high similarity for positive pairs and low similarity for negative pairs. We calculate the cosine similarity (

) [tian2020contrastive] as the metric to evaluate the similarity of views. Assuming , and are the two different normalized views, we can define the objective function for a positive pair as


where denotes a temperature parameter to adjust the dynamic range of the loss. is an indicator function to determine whether and belong to a same lesion with . If and are extracted from the same lesion, , and otherwise . The final loss is computed across all positive pairs in a batch, and treats the as anchor and enumerates over . Symmetrically, we can get by anchoring . Therefore, the final loss is denoted by adding the two losses up:


where represents all views extracted from the same lesion, and .

MVCNet is a general framework that can be applied to different numbers of views. In this paper, we limit the maximum view number to nine. Suppose we have a collection of views in a batch, . We treat the view that we want to optimize as the anchor, e.g., . After that, we form the positive pairs between anchor and each other view . According to Equation 1, we compute the loss by summing up all the positive pairs:


where is the loss between the anchor and other views. Considering all views act as anchor in turn, the objective function is formulated as:


where is the final loss when we only consider the -th lesion in the batch. Taking a seep forward, we formulate the objective function in the batch as:


We optimize all the encoders and projectors with the objective function . From Equation 5 . Empirically, the more views we consider, the heavier the computational cost.

Figure 6:

The inference stage of MVCNet. We extract representations of each view and concatenate them together. Then we feed the concatenated representation to a simple task head, computing the disease probability with a softmax function.

4.7 Target Task

The goal of the target task is to evaluate the quality of the representations learned by the MVCNet. Inspired by [tian2020contrastive, chen2020simple], we evaluate the representation using the diagnostic accuracy by adding a simple classification head (see Figure 6). It only contains a fully connected layer trained from scratch. The input of the head is the concatenated representations from different views. It is noteworthy that we extract representations from the encoder rather than the projector . After pretraining, the projector is discarded. Then, we feed the concatenated representations to the head. The dimension of the input representation is subject to the number of views. Namely, the input dimension is , where represents the number of views, and the dimension from each view is . Finally, we get the disease probability with a softmax function.

Conflicts of Interest

The authors declare no competing interests.


This work was supported by the National Natural Science Foundation of China (62020106015) ,the Strategic Priority Research Program of CAS (XDB32040000), the Zhejiang Provincial Natural Science Foundation of China (LQ20F030013), and the Ningbo Public Service Technology Foundation, China (202002N3181), Ningbo Natural Science Foundation, China (202003N4270). The authors appreciate the doctors at department of radiology, HwaMei Hospital, University of Chinese Academy of Sciences for their comments on learning representations with limited annotated data.

Author Contributions

J. li conceived the idea and designed the experiments. P. Zhai conducted the experiments. H. Cong, G. Zhao, C. Fang, T. Cai and H. He contributed equally to this work.

Supplementary Materials

Table S1: Several typical self-supervised learning approaches proposed to process natural and medical images. Table S2: The augmentations of natural images and medical imaging. Table S3-S4: Comparison of MVCNet with state-of-the-art SSL methods in LNDb and TianChi. Figure S1: The accuracy with respect to view numbers on LIDC-IDRI. Figure S2-S4: Visualization of nine views of malignant and benign nodules on LIDC-IDRI, LNDb and TianChi.