Fibro-CoSANet: Pulmonary Fibrosis Prognosis Prediction using a Convolutional Self Attention Network

04/13/2021 ∙ by Zabir Al Nazi, et al. ∙ 16

Idiopathic pulmonary fibrosis (IPF) is a restrictive interstitial lung disease that causes lung function decline by lung tissue scarring. Although lung function decline is assessed by the forced vital capacity (FVC), determining the accurate progression of IPF remains a challenge. To address this challenge, we proposed Fibro-CoSANet, a novel end-to-end multi-modal learning-based approach, to predict the FVC decline. Fibro-CoSANet utilized CT images and demographic information in convolutional neural network frameworks with a stacked attention layer. Extensive experiments on the OSIC Pulmonary Fibrosis Progression Dataset demonstrated the superiority of our proposed Fibro-CoSANet by achieving the new state-of-the-art modified Laplace Log-Likelihood score of -6.68. This network may benefit research areas concerned with designing networks to improve the prognostic accuracy of IPF. The source-code for Fibro-CoSANet is available at: <>.



There are no comments yet.


page 3

Code Repositories


Idiopathic pulmonary fibrosis (IPF) is a restrictive interstitial lung disease that causes lung function decline by lung tissue scarring. Although lung function decline is assessed by the forced vital capacity (FVC), determining the accurate progression of IPF remains a challenge. To address this challenge, we proposed Fibro-CoSANet, a novel end-to-end multi-modal learning-based approach, to predict the FVC decline. Fibro-CoSANet utilized CT images and demographic information in convolutional n

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Idiopathic pulmonary fibrosis (IPF) is a chronic lung disease which is caused by forming scar tissue within the lungs [spagnolo_idiopathic_2015]. IPF leads to a gradual, irreversible deterioration of lung function by replacing the healthy lung tissues with scar tissue over time. IPF can potentially lead to rapid deterioration from long-term stability, which results in complete pulmonary dysfunction [raghu_diagnosis_2018]. Due to the high variability in deterioration speed, management of pulmonary fibrosis relies on the decline in the lung function progression. Therefore,

an accurate estimation of the lung function progression decline would lead to better management of IPF


The current guideline for IPF diagnosis follows several procedures, such as surgical or transbronchial lung biopsy [raghu_diagnosis_2018]. After the diagnosis, physicians often assess the decline of lung function by Force vital capacity (FVC) using spirometry tests to monitor the prognosis of IPF. FVC measures the total amount of air exhaled after breathing in as deeply as possible [zappala2010marginal]. To assess the lung function, observing the FVC at intervals of six to twelve months is recommended [raghu_diagnosis_2018]. While the FVC provides a general understanding of the prognosis of the IPF [flaherty_idiopathic_2006], there are no widely used techniques to estimate the IPF progression. As such, due to the heterogeneous course of IPF, imaging modalities may provide valuable information regarding IPF prognosis.

Computed tomography (CT) images of the chest can be effectively used to assess the lung function decline from pulmonary fibrosis as the CT scans contain several visual signs essential for assessment by radiologists. Shi et al. [shi_prediction_2019] developed a voxel-wise radio-logical model using high-resolution CT scans and achieved accuracy in predicting the progression of IPF. Furthermore, Salisbury et al. [salisbury_idiopathic_2016] utilized CT scans of IPF patients to predict the survival and FVC decline for 12 months with a significant correlation value of 0.6 between visual and predicted measurement. These studies have demonstrated the effectiveness of utilizing CT imaging as an important modality to predict the progression of pulmonary fibrosis. However, precisely predicting the progression of IPF from CT images remains challenging due to the high variability.

The recent advancements of artificial intelligence (e.g., convolutional neural networks (CNNs) 

[he2016deep]) and the Kaggle: OSIC Pulmonary Fibrosis Progression Challenge [osic_kaggle]

have significantly inspired to develop CT image based machine learning systems to obtain computer-aided clinical decision for IPF prognosis. In particular, Wong et al.

[wong_fibrosis-net_2021] recently proposed Fibrosis-Net based on deep CNNs for predicting pulmonary fibrosis progression from chest CT images. Fibrosis-Net utilized the chest CT scans of a patient along with spirometry measurement and clinical metadata to predict the FVC of a patient at a specific time-point in the future [wong_fibrosis-net_2021]

. While the existing CNNs based approaches have a higher capacity to predict pulmonary fibrosis progression from chest CT images, we strongly believe there is still room for improvement in terms of overall correctness. In this work, we argue that extracted convolutional features from chest CT scans along with patient’s clinical or demographic features are not discriminative enough to correctly predict the FVC of a patient in cases where the network requires to focus on a specific region of the lung. To address this issue, we proposed a simple and efficient end-to-end multi-modal network, termed as Fibro-CoSANet, which utilized both the chest CT scan images and demographic information, such as sex, age, smoking history to predict the FVC of a patient at a specific time-point. Our proposed Fibro-CoSANet used a convolutional self-attention network that extracted features from a randomly selected CT image which are merged with the normalized demographics features. The merged features were then passed through a one-layer perceptron to obtain the predicted FVC. While the Fibrosis-Net 

[wong_fibrosis-net_2021] utilized the multiple CT slices to generate convolutional features, we introduced an efficient formulation of the IPF prognosis task where we randomly selected a single CT image from multiple scans to extract convolutional features. However, we used the approximated lung volume information from all the available scans as a shallow feature which was merged with the convolutional features. In addition, we predicted the slope of FVC based on a linear prior assumption to reduce the computational overhead, while Fibrosis-Net [wong_fibrosis-net_2021] used an elastic net to obtain the local FVCs.

We summarize our main contributions as follows:

  • To the best of our knowledge, this is the first study that proposed a simple and efficient end-to-end multi-modal based convolutional self-attention network to predict the progress of IPF by utilizing the deeper CT and shallower demographic features.

  • We introduced an intuitive and efficient way to apply a stacked self-attention layer on top of extracted convolutional CT features for further refinement and the advantages of this module are demonstrated with extensive experiments.

  • We further introduced a unique formulation for FVC measurement of a patient where the goal of the proposed network was to predict the slope of the FVC trend.

  • We showed, through extensive quantitative experiments under different settings, that our proposed Fibro-CoSANet achieved lower modified Laplace Log-Likelihood score than existing works on the publicly available Kaggle: OSIC Pulmonary Fibrosis Progression dataset.

2 Data Preparation

In this section, we discuss the number of preprocessing steps conducted to prepare the training inputs and labels. We used the recently introduced Kaggle: OSIC Pulmonary Fibrosis Progression Challenge Benchmark Dataset which consists of CT scans, FVC measurements, and associated demographic features, such as age, sex, smoking status[osic_kaggle]. As the main goal of our method was to predict the slope of the FVC trend of a patient, we first calculated the initial slope

of FVC values using singular value decomposition (SVD)

[golub_singular_1971] which were used as pseudo-labels in our proposed model (see Fig. 1 (3)). Then, we estimated the lung volume from the CT scans (Fig. 1 (1)) followed by the extraction and normalization of demographics, such as age, sex, and smoking status (Fig. 1 (2)). Note that, we trained our proposed Fibro-CoSANet using a random CT image, estimated lung volume, age, sex, and smoking status.

2.1 Forced Vital Capacity (FVC) Formulation

We introduced a unique formulation to predict the slope of the FVCs by using the calculated initial slope of FVC values as ground-truth. First, we pre-processed the CT scans, {}, where refers to the number of patients. For each CT scan, {}, we randomly selected a slice, , from for extracting features, where is the number of slices in and is the index of the selected slice. The selected slice, , was fed to a self-attention driven CNN model to extract pulmonary CT features. In parallel, we generated the demographics and volumetric feature sets, , where {}. Finally, these two sets of features were concatenated to predict the slope of FVC, {}, based on a linear priori assumption. Each patient data had FVC values, {}, where and refers to the number of FVC values and representation of the corresponding weeks, respectively. We can formalize the FVC value of the patient in week as follows:

Figure 1: Illustration of the process of dataset preparation (1⃝, 2⃝) and ground-truth generation (3⃝) for training Fibro-CoSANet. We randomly selected a CT slice and passed through the Fibro-CoSANet. We further normalized the demographic features and estimated the lung volume to fuse with the convolutional CT features. We calculated the slope of FVC using SVD least squares method explained in Sec. 2.1 which is used as ground-truth to measure the correctness of the predicted FVC slope.

where, is the base FVC and is the slope of the patient. We can further extend Eq. 1 by expanding FVC along the week, upto , as follows:


For ease of presentation, we vectorized Eq. 2 as follows:


Next, we decomposed the matrix

into singular value form as:


where and refer to orthogonal matrix and diagonal matrix, respectively. is a orthogonal matrix. We replaced in Eq. 3 with Eq. 4 to achieve our desired least square solution [golub_singular_1971]. The replacement operations can be formalized and solved using singular value decomposition [golub_singular_1971] as follows:


where is Moore–Penrose inverse of the matrix and minimizes our desired least square solution, . Thus, we calculated the slope, , for a patient , and used it as a pseudo ground-truth slope to train our network.

2.2 CT Pre-processing and Lung Volume Extraction

Contrary to natural images, CT scans consist of inconsistent and high-dimensional redundant information [park2020annotated]

which is computationally expensive to process and can result in poor performance. Therefore, to achieve a better signal-to-noise ratio, it is imperative to pre-process the CT scan images before feeding them to the CNN model. We applied the following pre-processing steps to resolve the issues.

Slice Selection. Each patients’ CT scan, , contains many CT slices which represent the depth information of the lung. To reduce the computational complexity, we selected one slice per CT scan by the following operations: (i) We first truncated the first and last of the slices as these slices contain minimal volume information. (ii) Then, we randomly select one CT slice, , from the middle to feed into the CNN.

Resizing and Normalizing CT Slices. We resized the randomly chosen CT slice based on the input specification of the backbone CNN model. Further, to mitigate the inconsistency in tissue intensities across different scanners and improve convergence of the model, we normalized the pixel values using, , where and for any CT slice, .
Lung Volume Estimation from the CT Slices. Along with demographics, we calculated the lung volume from CT images for each patient. The main motivation of using the lung volume estimation was to incorporate the approximated volume in the feature set, as we didn’t include all the CT slices to extract the volumetric features due to the computational complexity. Given a CT slice, , we applied the watershed algorithm [beucher_use_1979] to extract a segmentation map, , of the lung for slice, . The generated map, , is a binary map between , where 1 and 0 indicate if a pixel belongs to the lung or not. Then, we generated the segmented lung image by simply multiplying the binary segmentation map with the original CT image. The procedure of generating the lung volume, , for patient can be formalized as follows:


where and are the pixel spacing in and directions respectively, and denotes the thickness of the slice.

2.3 Extracting Demographic Features

As IPF is associated with demographics, such as baseline FVC [raghu_diagnosis_2018, ley_multidimensional_2012], age [garcia-sancho_familial_2011, ley_multidimensional_2012], gender [garcia-sancho_familial_2011], smoking status [garcia-sancho_familial_2011], we take inspiration from [wong_fibrosis-net_2021] to incorporate these features along with CT image to improve the performance of our proposed method. We normalized the estimated lung volume, age, sex, and smoking status features using: where, is the raw numeric feature, is the arithmetic mean of , and

is the standard deviation of


Figure 2: Illustration of the dual-stream pipeline of our proposed Fibro-CoSANet. We passed the randomly selected CT image to a pre-trained CNNs, , (e.g., ResNet18 [he_deep_2016]) which produced a CT feature map, . Then, we integrated a stacked self-attention module on top of the last convolutional layer of CNN to allow the network to focus on specific regions. The resultant output feature map, , from the stacked self-attention module is passed through the global average pooling to produce the final representation, , of the CT image. In parallel, we merged the demographics and estimated lung volume features, with which is further passed through a linear layer to predict the slope of FVC. We computed loss between the predicted slope, and pseudo ground-truth slope, .


3 Network Architecture of Fibro-CoSANet

We proposed a novel multi-modal convolutional self-attention network, Fibro-CoSANet, to predict the slope of FVC in an end-to-end manner. The overall pipeline of our proposed Fibro-CoSANet is illustrated in Fig. 2

. Our proposed training framework consists of two key steps: (i) Extraction of the deep features from the normalized CT image using a CNN with self-attention module (Sec.


), (ii) Fusing the deep features extracted from a CNN with shallow lung volume and demographic features followed by a fully-connected layer which predicts the

slope of the FVC (Sec. 3.2).

3.1 Convolutional Self-Attention Network

In this section, we present our proposed convolutional self-attention network for extracting features from CT scan images with the ultimate goal of predicting the slope of FVC. Our proposed deep CT feature extractor network consists of two key components, (i) a CNN-based feature extractor network and (ii) a self-attention module which further refined the convolutional features extracted from the CNN. We first discuss the convolutional feature extractor network (Sec. 3.1.1) followed by the self-attention module (Sec. 3.1.2).

3.1.1 Deep CNN for CT Feature Extraction

In recent years, CNNs have been widely adopted for processing medical images (e.g., CT scans) [sarvamangalaconvolutional]. In general, CNN-based networks on medical imaging can be characterized as generic feature extractor networks which are termed encoder networks. The encoder network is simply a CNN that extracts features from a given input image. However, one downside of training CNNs is that it requires a huge amount of labeled training data to learn the millions of parameters involved in the network. This crucial issue limits the adoption of CNNs on medical image-based tasks as the majority of the datasets have a small volume of training data. To address this limitation, inspired by the existing works [sajja2019lung, wang2020classification], we fine-tuned the feature extractor CNN on CT scan images rather than training from scratch with random initialization. Let be the input CT scan image (where , are the spatial dimensions). Given the input CT image, , we adopted a CNN, , to extract a feature representation from the last convolutional layer of the CNN. Let be the extracted feature map which has smaller spatial dimensions than the original CT image, . We used ResNet [he2016deep], ResNeXt [Xie2016], and EfficientNet [tan2019efficientnet] based architectures to build encoder networks in our study. We formalize the key operations as follows:


where denotes weights of the CNN model and denotes the convolutional operation. The extracted feature map, , was fed to a self-attention module which further refines the feature representations before combining them with the demographics and lung volume features.

3.1.2 Self-Attention Module

The extracted feature map, , was likely to capture high-level semantics of the CT images; however, allowing the network to focus on the specific region of the CT feature map was important to accurately predict the progression of IPF. Since IPF can result in honeycomb cysts in the lungs [gruden_ct_2016], these regions in the CT image require more attention than others. To focus on these regions of interest, we took inspiration from the existing works [ramachandran_stand-alone_2019, zhang_self-attention_2019] and applied a self-attention module on top of the CNN feature extractor.

In the self attention module, we first rearranged the extracted convolutional feature map, , resulting in a feature map, where is the number of channels and is the product of and . Then we fed the feature map, , to the self-attention layer [zhang_self-attention_2019], and obtained an attention map, , with same dimension.

We further multiplied the output of the attention layer, , by a scaling parameter, , and added with the input feature map, to obatin the self-attention map, . We formalized the operations as follows:


Note that

is a learnable scalar that is initialized from a uniform distribution in our work. The main advantage of learning

was that it enabled the network to first focus on the local neighborhood indicators since it was easier. Then it eventually tried to assign more weight to the non-local region. Thus the module learned simpler tasks first to improve convergence [zhang_self-attention_2019]. Finally, was passed through an adaptive global average pooling operation (GAP), followed by a linear layer to obtain the deep CT feature set, (see Fig. 2). We considered as our final feature representation extracted from the CT scan image.

We augmented the CNN with the self-attention layer for realizing a richer effective receptive field and learning better feature representation as the recent work [ramachandran_stand-alone_2019] has shown the advantages of applying a self-attention layer on top of convolutional feature representation. Unlike the previous self-attention-based works [ramachandran_stand-alone_2019, zhang_self-attention_2019], we extended the existing idea by stacking self-attention layers just before applying the GAP operation. In our implementation, is the stacking factor and we considered as a hyper-parameter. Note that, we placed the attention module between the last convolutional layer of CNNs and the pooling layer as convolutions were likely to better capture the low-level features while stand-alone attention layers may integrate global information by modeling long range pixel interactions [ramachandran_stand-alone_2019]. Furthermore, this placement reduced the computational complexity as the attention module was applied on a relatively low dimensional convolutional feature map.

Figure 3: Overview of the proposed self-attention module for Fibro-CoSANet. Note that the stacking factor is one here.

3.2 Hybrid Fusion of Convolutional and Shallow Modality Feature

Finally, we concatenated both the deep CNN features, , extracted by the self-attention driven CNN backbone from the CT modality and the shallow modality features representation, , to generate a hybrid multi-modal feature representation, . The resultant feature representation, , was passed to a fully connected layer to obtain the slope, , of FVC, which was used to predict the patient’s progression curve. We computed FVC from the predicted slope, along the timeline, , as follows:


where is the baseline force vital capacity and is the week index. 111Patient index is used with variables when specifying a particular subject, for general case, the index is omitted for simplicity.

4 Experiments

We evaluated the effectiveness of our proposed approach for predicting the progression of pulmonary fibrosis and demonstrated the efficacy of the method under different settings. First, we showed the superiority of our proposed multi-modal learning pipeline followed by a comparison with recent approaches. Then, we evaluated our approach to generating baseline results under different backbone and metric settings to show generalizability and consistency. We further conducted an ablation study to investigate the necessity of each component of our proposed approach. Finally, we provided a comparison between different variants of our approach in terms of computational complexity, inference time, and total memory.


In this study, we used the publicly available Kaggle: OSIC Pulmonary Fibrosis Progression challenge benchmark dataset provided by Open Source Imaging Consortium (OSIC)

[osic_kaggle]. The dataset consists of chest CT scans and associated demographics about fibrosis diagnosed patients. It contains 176 unique patients with a total of 1576 demographic information (multiple from the same patients) collected from numerous follow-up visits over the course of approximately 1-2 years. The demographics include the patient’s ID, weeks, FVC, percent, age, sex, and smoking status. Note that the weeks represent the relative number of weeks pre or post from the baseline CT scan for each patient and we determined the time series of the weeks of a specific patient based on the patient’s ID. For each patient, CT scan images (varies between 10 -180) are provided in DICOM format files that contained meta-data about the patients and the scan. We used fold cross-validation scheme to validate the best performance model. In the cross-validation setting, we carefully restrict to have no overlapping between the train and test splits of different subjects. The test set includes a baseline CT scan with only the initial FVC measurement for each patient.

Evaluation Metric. We used a modified Laplace Log-Likelihood () and root mean square error () metrics to report the performance of our proposed model. We choose to evaluate a model’s confidence in its decisions as it is designed to reflect both the accuracy and certainty of each prediction. For each true FVC measurement, we calculated the FVC and confidence measure as follows [osic_kaggle]:


where is the standard deviation and we threshold the error at 1000 ml to avoid the adverse penalty due to large errors. The confidence values were clipped at 70 ml to reflect the approximate measurement uncertainty in FVC. We calculated the final score by averaging the metric across all weeks. Note that, the calculated value of the metric was always negative, and lower is better.

 Multi modality -6.68 0.31 181.5 25.88
CT modality -6.69 0.28 184.16 22.84
Demographics + Lung Volume -6.75 0.33 185.52 22.89
Table 1: Performance comparison between different modalities. Our proposed multi-modality based Fibro-CoSANet outperforms the single modalities (e.g., CT and Demographics).

Implementation Details.

We used publicly available PyTorch


framework to implement our proposed Fibro-CoSANet and an Intel(R) Xeon(R) Gold 5118 CPU with 187 GB physical ram and an Nvidia Tesla V100 SXM2 (32GB) GPU to run our experiments. We trained models for 40 epochs using Adam optimizer with a decoupled weight decay regularization of 0.01. We initialized the backbone CNN by the ImageNet pre-trained model and optimized the network to minimize the


4.1 Results of Proposed Fibro-CoSANet

We first conducted an ablation study to analyze the effectiveness of the multi-modality feature fusion technique by comparing it with other available modes. To demonstrate the superiority of our proposed Multi-modal training pipeline, we conducted experiments under three different modes: (i) Multi modality: convolutional features from CT images + shallow features (demographics + lung volume), (ii) CT modality: convolutional features from only CT modality, (iii) Shallow Modality: only lung volume and demographic features were used to train our model without any CNN backbone. We found that the multi-modality modes achieved higher performance than standalone CT modality and shallow modality in terms of and (Table 1). These results suggested that demographics with lung volume or CT scans independently achieve reasonable performance while combing these two modalities improved the overall performance. Note that, the reported experimental results in the following sections are based on only multi modalities.

 Work Regression Type
 Fibrosis-Net [wong_fibrosis-net_2021] Elastic Net –6.82 -
Mandal et al. [mandal_prediction_2020] Quantile –6.92 -
Ridge –6.81 -
Elastic Net –6.72 -
Fibro-CoSANet (Ours) EfficientNet-b2 –6.68 0.31 181.5 25.88
ResNet-50 –6.68 0.31 181.6 22.89
EfficientNet-b3 –6.68 0.28 182.58 24.04
EfficientNet-b1 –6.68 0.28 183.96 22.89
Table 2: Performance comparison of Fibro-CoSANet with recent works in terms of and . Our proposed Fibro-CoSANet outperforms the existing state-of-the-art works on predicting the progression of pulmonary fibrosis.

4.2 Comparison with Recent Approaches

We compared the overall performance of our proposed method with recent state-of-the-art approaches which predict the progression of pulmonary fibrosis (Table 2). Mandal et al. used Multiple Quantile Regression, Ridge Regression, and Elastic Net Regression to predict the progression, while Elastic Net Regression achieved the best result, achieving of -6.72 [mandal_prediction_2020]. Wong et al. achieved of -6.82 [wong_fibrosis-net_2021]. Our proposed algorithm with EfficientNet-b2 performed better than [mandal_prediction_2020, wong_fibrosis-net_2021], resulting in of –6.68 and of (Table 2). Also, the approximate complexity of our model was linear with respect to the number of patients as we processed all the information of a patient in a single mini-batch.

 Backbone (CV) (CV)
 ResNet-18 -6.70 0.29 183.68 23.52
ResNet-34 -6.72 0.28 185.18 22.71
ResNet-50 -6.72 0.27 186.52 21.03
ResNet-101 -6.71 0.25 188.92 19.94
ResNet-152 -6.73 0.28 186.19 21.75
ResNeXt-50 -6.72 0.27 186.39 24.64
ResNeXt-101 -6.70 0.26 184.04 22.62
EfficientNet-b0 -6.70 0.29 183.00 23.60
EfficientNet-b1 -6.72 0.31 183.22 23.35
EfficientNet-b2 -6.74 0.34 184.17 22.89
EfficientNet-b3 -6.74 0.34 184.17 22.89
EfficientNet-b4 -6.70 0.30 183.00 22.42
Table 3: Fibro-CoSANet results under different CNN backbone. 333 = Modified Laplace Log Likelihood, RMSE = Root mean square error, CV = Cross valitaion (5-fold)
 Backbone FS SF
 EfficientNet-B0 32 3 -6.7 0.29 183.7 23.55
32 5 -6.77 0.31 185.63 21.33
64 1 -6.72 0.34 182.13 22.63
128 3 -6.73 0.33 183.67 24.56
128 5 -6.74 0.36 183.57 22.55
EfficientNet-B1 32 3 -6.68 0.28 183.96 22.89
32 5 -6.7 0.28 185.64 24.25
64 1 -6.71 0.29 184.16 24.78
128 3 -6.79 0.38 18-6.31 24.79
128 5 -6.69 0.31 183.12 22.05
EfficientNet-B2 32 3 -6.68 0.31 181.5 25.88
32 5 -6.69 0.3 183.39 21.98
64 1 -6.73 0.28 184.71 20.74
128 3 -6.77 0.33 187.13 21.03
128 5 -6.75 0.33 18-6.03 23.14
EfficientNet-B3 32 3 -6.72 0.34 183.28 22.87
32 5 -6.74 0.31 184.68 21.05
64 1 -6.71 0.34 183.34 22.57
128 3 -6.68 0.28 182.58 24.04
128 5 -6.72 0.33 184.01 24.4
EfficientNet-B4 32 3 -6.73 0.3 184.86 22.6
32 5 -6.68 0.3 183.45 23.19
64 1 -6.73 0.39 182.29 23.11
128 3 -6.74 0.32 184.35 22.04
128 5 -6.71 0.28 184.06 23.57
ResNet-18 32 3 -6.73 0.32 184.71 23.79
32 5 -6.71 0.31 183.79 21.39
64 1 -6.69 0.28 183.84 21.9
128 3 -6.72 0.27 184.96 22.54
128 5 -6.73 0.35 185.29 24.12
ResNet-34 32 3 -6.73 0.31 184.79 21.45
32 5 -6.72 0.28 185.4 21.98
64 1 -6.71 0.31 183.3 22.79
128 3 -6.7 0.29 183.32 23.83
128 5 -6.72 0.28 184.33 21.54
ResNet-50 32 3 -6.72 0.31 184.07 21.74
32 5 -6.7 0.28 184.94 22.52
64 1 -6.73 0.27 185.46 22.6
128 3 -6.68 0.31 181.6 22.89
128 5 -6.74 0.34 185.06 24.97
ResNet-101 32 3 -6.71 0.3 183.13 24.87
32 5 -6.69 0.28 184.17 23.38
64 1 -6.73 0.28 185.33 23.05
128 3 -6.72 0.3 185.53 22.72
128 5 -6.7 0.3 183.63 22.39
ResNet-152 32 3 -6.7 0.29 183.89 22.25
32 5 -6.7 0.3 183.64 23.6
64 1 -6.72 0.33 183.05 22.54
128 3 -6.73 0.35 184.15 23.15
128 5 -6.72 0.33 182.92 23.9
ResNeXt-50 32 3 -6.7 0.27 184.75 22.75
32 5 -6.71 0.3 184.61 23.26
64 1 -6.71 0.32 183.84 20.58
128 3 -6.73 0.28 185.22 22.57
128 5 -6.73 0.32 184 22.88
ResNeXt-101 32 3 -6.72 0.28 184.01 23.38
32 5 -6.7 0.3 183.46 23.37
64 1 -6.7 0.31 182.5 24.82
128 3 -6.72 0.3 184.9 22.38
128 5 -6.7 0.28 183.57 22.21
Table 4: Performance of Fibro-CoSANet under different backbone with respect to attention module hyper-parameters. FS and SF refer to filter size and stacking factor, respectively. It is clear that stacking the self-attention module improves the overall performance.

4.3 Baseline Analysis of Fibro-CoSANet

We conducted an extensive experimental evaluation using widely-used CNNs, including ResNet, ResNeXt, and EfficientNet to show the consistency and generalizability of our proposed approach. We reported experimental results under two key variants of our proposed pipeline as follows:
Baseline Model without Self-Attention Module. We implemented the base model under different network backbones without any self-attention layer. To show the consistency and generalizability of our approach, we used different CNNs architectures (five of ResNets, two of ResNeXts and five of EfficientNets) as the feature extractor for our proposed Fibro-CoSANet. Table 3 presents the baseline results in terms of and . Interestingly, ResNets with lighter architecture (e.g., ResNet-18- : -6.70 RMSE: 183.68) and ResNeXt-101 achieved lower and compared to the deeper ResNets. Furthermore, EfficientNet-b0, and EfficientNet-b4 achieved comparative performance ( and ) to ResNet-18 . These results altogether suggested that our proposed Fibro-CoSANet with various backbones had the ability to predict FVC slope. The heavier models were prone to over-fitting as the size of the dataset was relatively smaller.

 Backbone Macs (G) Params (M) Infer
EfficientNet-B0 0.07 4.05 0.67 -6.7 0.29 183.7 23.55
EfficientNet-B1 0.1 6.65 0.7 -6.68 0.28 183.96 22.89
EfficientNet-B2 0.1 7.75 0.73 -6.68 0.31 181.5 25.88
EfficientNet-B3 0.14 10.75 0.7 -6.72 0.34 183.28 22.87
EfficientNet-B4 0.18 17.61 0.71 -6.73 0.3 184.86 22.6
ResNet-18 9.11 11.19 0.67 -6.73 0.32 184.71 23.79
ResNet-34 18.79 21.3 0.63 -6.73 0.31 184.79 21.45
ResNet-50 21.13 23.57 0.69 -6.7 0.27 184.75 22.75
ResNet-101 40.61 42.56 0.69 -6.71 0.3 183.13 24.87
ResNet-152 60.1 58.21 0.7 -6.7 0.29 183.89 22.25
ResNeXt-50 21.92 23.04 0.68 -6.7 0.27 184.75 22.75
ResNeXt-101 85.84 8-6.81 0.7 -6.72 0.28 184.01 23.38
Table 5: Comparison of different variants of our proposed Fibro-CoSANet in terms of Macs (G), parameters (M), inference time (s), , and .

Baseline Model with Self-Attention Module. To further improve the overall performance, we introduced a stacked self-attention module (Sec. 3.1.2) on top of the each CNN backbone (Table 4). Here, we used fixed output channel dimension (32) of CNN backbones with several attention filter sizes, such as 32, 64, and 128 along with a different number of stacking factors, such as 1, 3, and 5 to empirically identify the best combination that achieved superior performance. As shown in Table 4, the overall performance was improved for most of the models with the addition of the self-attention layer. For instance, EfficientNet-B1, B2, B3, B4, and ResNet-50 improved the overall performance by a considerable margin, resulting in . EfficientNet-B2 and RestNet-50 achieved better score than other models in terms of (). Comparing the results of different variants under various design choices, EfficientNet-B2 achieved overall best performance ( and ) followed by RestNet-50 ( and ) compared to other variants. We empirically found that EfficientNet-B2 and Reset-50 achieved the best performance with the attention filter size of 32 and 128, respectively, and three attention layers. ResNeXt-101 results were further not improved with the addition of self-attention later.

4.4 Performance Analysis

We analyzed the overall performance of our proposed approach under two key aspects: (i) Efficiency and (ii) Computational complexity.
Efficiency. One of the important aspects of the high-volume biomedical data analysis is the latency or inference speed of the system. Our approach used a single CT image and shallow modality features to calculate the prognosis line from a single scalar, . This simple linear priori assumption made the training and inference much faster, making our pipeline much efficient in handling a large amount of data. Note that, that the training complexity depends on the number of patients.
Computational complexity of CNNs. Table 5 presents the comparison results of different baselines models in terms of the total number of parameters, inference time, and memory. Note that, we reported the best result for each CNN used in our experiments. EfficientNet-B0 achieved the lowest computational complexity (0.07 GMacs, 4.05 million parameters, 0.67 s inference) compared to other CNNs backbones; however, failed to achieve superior performance. This could be due to the fact that the EfficientNet-B0 architectures were relatively light-weight compared to ResNets. As EfficientNet-B2 achieved the best result with relatively lower computational complexity (0.1 GMacs, 7.75 million parameters, 0.73 s inference), we termed EfficientNet-B2 as the best network for FVC slope prediction.

5 Discussion and Conclusion

We proposed a novel multi-modal convolutional self-attention-based learning pipeline to predict the prognosis of IPF. To the best of our knowledge, this work was one of the earliest attempts that incorporated both CT scan and demographic information in an end-to-end manner. Furthermore, we integrated a self-attention layer on top of the CNNs to further refine the convolutional features by allowing the network to focus on a specific region of the CT scan image. Moreover, we predicted the slope of the FVC trend of a patient based on a simple linear prior assumption. Extensive experiments demonstrated the superiority of our proposed approach over the recent models tested on the same dataset [mandal_prediction_2020, wong_fibrosis-net_2021].

Despite the impressive performance, one major limitation of our proposed approach was that the prognosis of pulmonary fibrosis was of linear nature. This assumption limited us to predict the actual FVC values at each temporal point. However, it allowed us to solve the problem as a simple regression problem for predicting the future prognosis status. Fibro-CoSANet failed to produce better performance with deeper architectures, which could be due to the relatively small sample size. Finally, throughout our experiments, we used a fixed set of hyper-parameters. The overall performance for each backbone can be further improved by a careful selection of the best possible hyper-parameters.

In conclusion, we aimed to provide a framework to the research community that can be used on a larger dataset and clinical trial in the future. As accurate progression prediction of IPF patients is crucial for the effective treatment and IPF based datasets are rarely available, our proposed algorithm could shed light on the new approaches to build trustworthy algorithms for IPF prognosis.