Log In Sign Up

BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis

Vision-and-language(V L) models take image and text as input and learn to capture the associations between them. Prior studies show that pre-trained V L models can significantly improve the model performance for downstream tasks such as Visual Question Answering (VQA). However, V L models are less effective when applied in the medical domain (e.g., on X-ray images and clinical notes) due to the domain gap. In this paper, we investigate the challenges of applying pre-trained V L models in medical applications. In particular, we identify that the visual representation in general V L models is not suitable for processing medical data. To overcome this limitation, we propose BERTHop, a transformer-based model based on PixelHop++ and VisualBERT, for better capturing the associations between the two modalities. Experiments on the OpenI dataset, a commonly used thoracic disease diagnosis benchmark, show that BERTHop achieves an average Area Under the Curve (AUC) of 98.12 higher than state-of-the-art (SOTA) while it is trained on a 9 times smaller dataset.


page 1

page 3

page 5

page 8


ViDeBERTa: A powerful pre-trained language model for Vietnamese

This paper presents ViDeBERTa, a new pre-trained monolingual language mo...

Chop Chop BERT: Visual Question Answering by Chopping VisualBERT's Heads

Vision-and-Language (VL) pre-training has shown great potential on many ...

MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

Large pre-trained models have proved to be remarkable zero- and (prompt-...

Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering

Despite the excellent performance of large-scale vision-language pre-tra...

Curriculum Script Distillation for Multilingual Visual Question Answering

Pre-trained models with dual and cross encoders have shown remarkable su...

Composing Ensembles of Pre-trained Models via Iterative Consensus

Large pre-trained models exhibit distinct and complementary capabilities...

1 Introduction

Computer-Aided Diagnosis (CADx) [14] systems could provide valuable benefits for disease diagnosis including but not limited to improving the quality and consistency of the predictions and reducing medical mistakes as they are not subject to human error. Although most existing studies focus on diagnosis based on medical images such as chest X-ray (CXR) images [4, 2, 1], the radiology reports often contain substantial information (e.g. patient history and previous studies) that are difficult to be detected from the image alone. Besides, diagnosis from both image and text is more closely aligned with disease diagnosis by human experts. Therefore, V&L models that take both images and text as input can be potentially more accurate for CADx and several attempts have been made in this direction [40, 42, 23].

Figure 1: An overview of BERTHop. BERTHop takes X-ray image and clinical report as input. It first encodes the image and text and extracts potential features from both modalities. Then a transformer-based model learns the associations between these two modalities. By applying appropriate vision and text extractor, the model is capable to identify the abnormality and associate it with the text labels.

However, the shortage of annotated data in the medical domain makes utilizing V&L models challenging. Annotating medical data is an expensive process as it requires human experts. Although a couple of recent large-scale auto-labeled datasets have been provided for some medical tasks, e.g., chest X-ray [39, 6, 19], they are often noisy (low-quality) and degrade the performance of models. Besides, such datasets are not available for most medical tasks. Therefore, training V&L models with limited annotated data remains a key challenge.

Recently, pre-trained V&L models have been proposed for reducing the amount of labeled data required to train an accurate downstream model [22, 37, 36, 25] in the general domain. These models are first trained on large-scale image caption data with self-supervision signals (e.g., using masked language model loss) to learn the association between objects and text tokens. Then, the pre-trained V&L models are used to initialize the downstream models and fine-tuned on the target tasks. In most V&L tasks, it has been reported that V&L pre-training is a major source of performance improvement. However, we identify a key problem in applying common pre-trained V&L models for the medical domain: the large domain gap between the medical (target) and the general domain (source) makes such pre-train and fine-tune paradigm considerably less effective in the medical domain. Therefore, domain-specific designs have to be applied.

Notably, V&L models mainly leverage object-centric feature extraction methods such as Faster R-CNN

[31] which is pre-trained on general domain to detect everyday objects, e.g., cats, and dogs. However, the abnormalities in the X-ray images do not resemble everyday objects and will likely be ignored by a general-domain object detector.

To overcome this challenge, we propose BERTHop, a transformer-based V&L model designed for medical applications. BERTHop resolves the domain gap issue by leveraging pre-training language encoder, BlueBERT [28], a BERT [12] variant that has been trained on biomedical and clinical datasets. Furthermore, in BERTHop, the visual encoder of the V&L architecture is redesigned leveraging PixelHop++ [9] and is fully unsupervised which significantly reduces the need for labeled data [32]. PixelHop++ can extract image representations at different frequency levels that is beneficial for abnormality detection.

We evaluate BERTHop by conducting extensive experiments and analysis for CADx in chest disease diagnosis on the OpenI dataset [11]. The OpenI dataset contains thoracic diseases, including 14 common chest diseases. Compared to SOTA (TieNET [40]), BERTHop outperforms in 11 out of 14 thoracic diseases diagnoses and achieves an average AUC of 98.23% that is 1.73% higher, using significantly less training data (TieNet is trained on the ChestX-ray14 [39] dataset that is 9 times larger than OpenI). Compared to the similar transformer-based V&L model pre-trained on general domain and fine-tuned on OpenI [22, 23], BERTHop requires no expensive V&L pre-training yet outperforms it by 14.37%.

We summarize our contributions as follows: (1) We propose BERTHop, a novel data-efficient V&L model for CXR disease diagnosis surpassing existing approaches. (2) Our proposed model incorporates PixelHop++ into a transformer-based model. To the best of our knowledge, this is the first study which integrates PixelHop++ and Deep Neural Network (DNN) models. (3) We conduct extensive experiments to demonstrate the effectiveness of each submodel we used in BERTHop. (4) We study how transformer initialization with a model, pre-trained on in-domain data (even on a single modality) is highly beneficial in the medical domain.

2 Related Work

Transformer-based V&L models

Inspired by the success of BERT for NLP tasks, various transformer-based V&L models have been proposed [22, 8, 37]. They generally use an object detector pre-trained on Visual Genome [20] to extract visual features from an input image and then use a transformer to model the visual features and input sentence. They are pre-trained on a massive amount of paired image-text data with a mask-and-predict objective similar to BERT. During pre-training, part of the input is masked and the objective is to predict the masked words or image regions based on the remaining contexts. Such models have been applied to many V&L applications [43, 26, 10] including the medical domain [23]. However, the performance of these models is not satisfactory due to the domain shift between the general domain and medical domain.

V&L models in the medical domain

Various CNN-RNN-based V&L models have been proposed for disease diagnosis on CXR. Zhang et al. [42] proposed TNNT (Text-guided Neural Network Training) which helps a CNN model get guidance from text report embedding for a more efficient training process on V&L data and evaluated the model on four V&L datasets including the OpenI dataset. They showed that the text report has important information that can improve the diagnosis compared with prior vision-only models, e.g., ResNet.

TieNet is a CNN-RNN-based model for V&L embedding integrating multi-level attention layers into an end-to-end CNN-RNN framework for disease diagnosis and radiology report generation tasks. TieNet uses a ResNet-50 pre-trained for general-domain visual feature extraction and an RNN for V&L fusion. As a result, it requires a large amount of in-domain training data (ChestX-ray14) for adapting to the medical domain, limiting its practical usage. In contrast, our method achieves higher performance with very limited in-domain data.

Recently, Li et al. [23] evaluated the transferability of well-known pre-trained V&L models by fine-tuning them on MIMIC-CXR [19] and OpenI. However, the pre-trained models are designed and pre-trained for general-domain, and directly fine-tuning it with limited in-domain data leads to suboptimal performance. We refer to this method as VB w/ BUTD (section 4.2).

Figure 2:

The proposed BERTHop framework for CXR disease diagnosis. A PixelHop++ model followed by a “PCA and concatenation” block is used to generate Q feature vectors. These features along with language embedding are fed to the transformer that is initialized with BlueBERT.

PixelHop++ for visual feature learning

PixelHop++ is originally proposed as an alternative to deep convolutional neural networks for feature extraction from images and video frames in resource-constrained environments. It is a multi-level model which generates output channels representing an image at different frequencies.

PixelHop++ is used in various applications and shown to be highly effective on small datasets. These applications include face gender classification [33]

, face recognition

[34], and deep fake detection [7]. It has also been recently applied to a medical task. VoxelHop [24] leveraged this model on 3D Magnetic resonance imaging (MRI) imaging data and could achieve superior results for Amyotrophic Lateral Sclerosis (ALS) disease classification task.

To the best of our knowledge, this is the first study which integrates PixelHop++ and DNN models. Our proposed model takes advantage of the attention mechanism to integrate visual features extracted from PixelHop++ and the language embedding.

3 Approach

Inspired by the architecture of VisualBERT, our framework uses a single transformer to integrates visual features and language embeddings. The overall framework of our proposed approach is shown in Figure 2. We first utilize PixelHop++ to extract visual features from the X-ray image; then the text (a radiology report) is encoded into subword embeddings; a joint transformer is applied on top to model the relationship between two modalities and capture implicit alignments.

There are two main differences between BERTHop and previous approaches:

  • Visual feature encoder

    Considering the lack of data in the medical domain, instead of using an object detector pre-trained on a general-domain dataset, we leverage PixelHop++, an unsupervised data-efficient method, to extract visual features. As the size of the PixelHop++ output channels is relatively large to be directly fed into the transformer, we apply Principle Component Analysis (PCA) to the output channels for dimension reduction. PCA is an orthogonal linear transformation that maps the data to a new coordinate system of lower dimension so that the variation of data is better preserved. By applying PCA to the PixelHop++ output channels, we capture the most prominent features and prevent over-fitting. Then, we concatenate the results to generate the final visual feature vectors. (Section


  • In-domain text pre-training Instead of resorting to computation-extensive V&L pre-training on a general domain image-text dataset, we find in-domain text-only pre-training considerably more beneficial in our application. Thus, we use BlueBERT as the backbone for our model, a transformer pre-trained on biomedical and clinical datasets. (Section 3.2)

3.1 Visual encoder

We argue that extracting visual features from a general-domain object detector, i.e. the BUTD [3] approach that is dominant in most V&L tasks, is not suitable for the medical domain. BUTD111In the following, we use the term “BUTD” to refer to extracting visual features from a pre-trained object detector rather than the full model from [3]. takes an image and employs a ResNet-based Faster-RCNN [31] for object detection and feature extraction from each object. The detector is pre-trained on Visual Genome [20] to detect objects in everyday scenes. Such an approach fails to detect medical abnormalities when applied to X-ray images. The reason is that the abnormalities in the image, which are of high importance for facilitating diagnosis, usually do not resemble the normal notion of an “object” and will likely be ignored by a general-domain object detector. Further, there exists no large-scale annotated dataset for disease abnormality detection from which to train a reliable detector [35].

We propose to adopt PixelHop++ [9] for unsupervised visual feature learning in the medical domain, which has been shown to be highly effective when trained on small-scale datasets. The key idea of PixelHop++ is computing the parameters of its model by a closed-form expression without using back-propagation [32]

. As PixelHop++ leverages PCA for computing parameters, the model is able to extract image representations at various frequencies in an unsupervised manner. Inspired by the architecture of DNN models, PixelHop++ is a multi-level model in which each level consists of one or several PixelHop++ units followed by a max-pooling layer. An illustration of data flow in a 3-level PixelHop++ model is shown in Figure

3. When training a PixelHop++ model, parameters of PixelHop++ units (kernels and biases) are computed, and during the inference, they are used for feature extraction from pixel blocks.

Figure 3: Data flow in a 3-level PixelHop++ model. A node represents a channel.

Training phase of PixelHop++

Suppose that we have training images of size , where d is for gray-scale and for color images. They are all fed into a single PixelHop++ unit in the first level of the model. The goal of training a PixelHop++ unit is to compute linearly independent projection vectors (kernels) which can extract strong features from its input data. There are one or more PixelHop++ units in each level of a PixelHop++ model.

In the first step of processing data in a PixelHop++ unit, using a sliding window of size

and a stride of

, patches from each training image are extracted and flattened, i.e., where is the th flattened patch for image , and is the number of extracted patches per image.

In the second step, the set of all patches extracted from training images are used to compute the kernels of the PixelHop++ unit. Kernels are computed as follows:

  • The first kernel, called DC kernel, is the mean filter, i.e., where n is the size of the input vector, and extracts the mean of each input vector.

  • After computing the mean (DC component) of each vector, PCA kernels of the residuals are computed and stored as AC kernels. First, PCA kernels are the top orthogonal projection vectors which can capture the variation of residuals best.

Each image patch is projected on computed kernels and a scalar bias is added to the projection result to avoid the sign-confusion problem [21]. This transformation on the input vector can be shown as follows:


where represents kernel parameters associated with the th kernel of a PixelHop++ unit and is the kernel’s corresponding bias term.

By transforming by a kernel in a PixelHop++ unit, one output channel is generated. For example, in the first level of the model, the PixelHop++ unit generates DC channel and AC channels. Each channel is shown by a node in Figure 3.

In the last step, model pruning is executed to remove the channels which include deficient data. The ratio of the variance explained by each kernel to the variance of training data is called the “energy ratio” of the kernel or its corresponding channel and is used as a criterion for pruning the model. An energy ratio threshold value,

, is selected and model pruning is performed using the following rule:

  • If the energy ratio of a channel is less than , it will be discarded (discarded nodes/channels in Figure 3) as the variation of data along the corresponding kernel is very small.

  • If the energy ratio of a channel is more than , it is forwarded to the next level for further energy compaction (intermediate nodes/channels in Figure 3).

Each output intermediate channel generated by a PixelHop++ unit will be fed into one separate PixelHop++ unit in the next level. So, except for the first level of the model, other levels contain more than one PixelHop++ unit.

Inference phase of PixelHop++

Data flow is similar to the training phase but all parameters including kernel weights and biases are computed during the training phase. Therefore, according to Equation 1, feature extraction from test images is conducted in each PixelHop++ unit using the computed kernels and biases.

3.2 In-domain text pre-training

As shown in an example in Figure 4, the report is written by an expert radiologist, who lists the normal and abnormal observations in the “finding” section and other important patient information e.g. patient history in the “impression” section of the report. The text style of the report is drastically different from that of the pretraining corpora of BERT (Wikipedia and BookCorpus) or V&L models (MSCOCO and Conceptual Captions).

However, previous methods [23] do not take such a significant domain gap into consideration. Rather, they initialize the transformer with a model trained on general-domain image-text corpora, as in most V&L tasks. Meanwhile, pre-training with text-only corpora has been reported to how only marginal or no benefit [37]. In the medical domain, however, we find that using a transformer pre-trained on in-domain text corpora as our initialized backbone serves as a simpler yet stronger approach.

Peng et al. [28] proposed a Biomedical Language Understanding Evaluation (BLUE) benchmark which evaluated the performance of BERT and Elmo [29] on 5 common biomedical text-mining tasks with ten corpora and showed the superiority of BERT when is pre-trained on biomedical and clinical datasets (BlueBERT).222 Recently, BlueBERT has been widely used in the bioNLP community for various NLP tasks [15, 13, 38] and a few V&L tasks, e.g, data labeling [18]. Thus, we leverage this pre-trained version of BERT as the backbone in BERTHop and initialized its single-stream transformer [22] with BlueBERT to better capture the clinical report information.

Figure 4: A sample image-text pair in the OpenI dataset. The text report from a radiologist is important for disease diagnosis but has a significantly different style compared to general-domain text.

4 Experiments

In this section, we evaluate BERTHop on the OpenI dataset and compare it with other existing models. To understand the effectiveness of the model designs, we also conduct detailed studies to verify the value of the visual encoder and the transformer initialization. Finally, we demonstrate a case study to show that BERTHop can effectively identify abnormal regions in CXR images.

4.1 Experiment setup


For CADx in CXR disease diagnosis, commonly used datasets include ChestX-ray14, MIMIC-CXR, and OpenI. In this paper, we focus on the OpenI dataset for which professional annotators labeled the data. OpenI comprises 3,996 reports and 8,121 associated images from 3,996 unique patients collected by Indiana University from multiple institutes. Its labels include 14 commonly occurring thoracic chest diseases, i.e., Atelectasis, Cardiomegaly, Effusion, Infiltration, Mass, Nodule, Pneumonia, Pneumothorax, Consolidation, Edema, Emphysema, Fibrosis, Pleural Thickening (PT), and Hernia. OpenI is a reliable choice for both training and evaluating V&L models as it is annotated by experts (labels are not learned from text reports or images). The disadvantage of using OpenI for training is that it contains a small amount of training data which is a challenge for DNN models. We apply the same pre-processing as TieNet and obtain 3,684 image-text pairs.

We do not consider ChestX-ray14 and MIMIC-CXR for benchmarking because their labels are generated automatically from the images and/or associated reports. Specifically, ChestX-ray14 labels are mined using text process technique from the radiology reports, and MIMIC-CXR labels are generated using ChexPert[17] and NegBio[27] auto labelers. As their labels are machine-generated, evaluating the V&L model on these datasets is not reliable. Therefore, we considered evaluation on OpenI to accurately compare the performance of BERTHop with human expert performance.

Figure 5: OpenI label statistics: (A) Percentage of normal and abnormal cases (B) Percentage of different diseases.

Model and training parameters

We first resize all images of OpenI to and apply the unsupervised feature learner, PixelHop++. We use a three-level PixelHop++ with the following hyper-parameters: , , , and . Then, we apply PCA to its output channels and concatenate the generated vectors to form a set of visual features of dimension D, i.e., . In BERTHop, is set to be 2048. In our experiments setup, is equal to 15 but may vary depending on the size of the output channels of the PixelHop++ model and also the number of PCA components.

As for the transformer backbone, we use BlueBERT-Base (Uncased, PubMed+MIMIC-III) from Huggingface [41], a transformer library. Having the visual features from the visual encoder and text embedding, we train the transformer on the training set of OpenI with 2,912 image-text pairs. We use batch size = 18, learning rate =

, max-seq-length = 128, and Stochastic Gradient Descent (SGD) as the optimizer with momentum = 0.9 and train it for 240 epochs.

Evaluation metric All mentioned datasets are highly imbalanced and mostly contain normal cases. Figure 5 shows the percentages of different diseases compared with normal cases in OpenI. Therefore, evaluating models using metrics such as accuracy does not reflect model performance. Instead, we follow prior studies to evaluate models based on Receiver Operating Characteristic (ROC) and Area Under the ROC Curve (AUC) score.

TNNT [42] TieNet [40] VB w/ BUTD [23] BERTHop
Atelectasis - 0.976 0.9247 0.9838
Cardiomegaly - 0.962 0.9665 0.9896
Effusion - 0.977 0.9049 0.9432
Infiltration - 0.984 0.8867 0.9926
Mass - 0.903 0.6428 0.9900
Nodule - 0.960 0.8480 0.9810
Pneumonia - 0.994 0.8537 0.9967
Pneumothorax - 0.960 0.8931 1.0000
Consolidation - 0.989 0.7870 0.9671
Edema - 0.995 0.9500 0.9987
Emphysema - 0.868 0.8565 0.9971
Fibrosis - 0.960 0.6274 0.9966
PT - 0.953 0.7612 0.9330
Hernia - - - -
AVG 0.854 0.965 0.8386 0.9823

Table 1: The AUC thoracic diseases diagnosis comparison of our model with other three methods on OpenI. BERTHop significantly outperforms models trained with a similar amount of data (e.g. VB w/ BUTD). *TieNet is trained on a much larger dataset than BERTHop.

4.2 Main results

We train BERTHop on the OpenI training dataset containing 2,912 image-text pairs and evaluate it on the corresponding test set comprising 772 image-text pairs. The ROC curve for each disease is plotted in Figure 6.

We compare BERTHop with the following approaches:

  • TNNT [42]: a Text-giuded Nueral Network Training method. See the details in Section 2.

  • TieNET [40]: a CNN-RNN-based model. See the details in Section 2.

  • VB w/ BUTD [22, 23]: Fine-tuning the original VisualBERT.

we evaluate all the models using the same AUC implementation in scikit-learn [5]. Table 1 summarizes the performance of BERTHop compared with existing methods.

The results demonstrate that BERTHop outperforms SOTA (TieNet) in 11 out of 14 thoracic disease diagnoses and achieves an average AUC of 98.23% which is 14.37%, 12.83%, and 1.73% higher than VB w/ BUTD, TNNT, and TieNet, respectively. Note that TieNet has been trained on a much larger annotated dataset, i.e., the ChestX-ray14 dataset containing 108,948 training data while BERTHop is trained on only 2,912 case examples.

Regarding the VB w/ BUTD results, we re-evaluate the results based on the released code333 from the original authors. However, we cannot reproduce the results reported in the paper even after contacting the authors.

Figure 6: ROC curve of BERTHop for all 14 thoracic diseases.

4.3 In-domain text pre-training

We further investigate the influence of different transformer backbone initializations on model performance by pairing it with different visual encoders. The results are listed in Table 2.

First, we find that the proposed initialization with a model pre-trained on in-domain text corpora (BlueBERT) brings significant performance boosts when paired with PixelHop++. Initializing with BlueBERT gives a 6.46% performance increase compared to initializing with BERT.

Second, when using BUTD, the model is less sensitive to the transformer initialization and the performance is generally low (from 83.09% to 85.64%). In contrast to other V&L tasks [22], general-domain V&L pre-training is not instrumental.

The above findings suggest that for medical V&L applications, in-domain single modality pre-training can bring larger performance improvement than using pre-trained V&L models from the general domain, even though the latter is trained on a larger corpus. The relation and trade-off between single-modality pre-training and cross-modality pre-training are overlooked by previous works [22] and we advocate for future research on this.

Visual Encoder
BUTD PixelHop++
Transformer Backbone VB BERT BlueBERT BERT BlueBERT

0.9247 0.8677 0.8866 0.9890 0.9838
Cardiomegaly 0.9665 0.8877 0.8875 0.9772 0.9896
Effusion 0.9049 0.8940 0.9120 0.9013 0.9432
Mass 0.6428 0.7365 0.7373 0.8886 0.9900
Consolidation 0.7870 0.8766 0.8906 0.8949 0.9671
Emphysema 0.8565 0.7313 0.8261 0.9641 0.9971

0.8386 0.8309 0.8564 0.9177 0.9823

Table 2: Effect of the transformer backbones when paired with different visual encoders. We find that when using BUTD features, the model becomes insensitive to the transformer initialization and the expensive V&L pre-training brings little benefit compared to BERT initialization. When using PixelHop++, the model benefits significantly from initialization with BlueBERT, which is pre-trained on in-domain text corpora.

4.4 Visual encoder

To better understanding what visual encoder is suitable for medical applications, we compare three visual feature extraction methods (BUTD, ChexNet [30], and PixelHop++). In particular, we replace the visual encoder of BERTHop with different visual encoders and report their performance. BUTD extracts visual features from a Faster R-CNN pre-trained on Visual Genome, which is prevailing in recent V&L models. ChexNet is a CNN-based method that is proposed for pneumonia disease detection. It is a 121-layer DenseNet [16]

trained on the ChestX-ray14 dataset for pneumonia detection having all pneumonia cases labeled as positive examples and all other cases as negative examples. By modifying the loss function, it is also trained to classify all 14 thoracic diseases and achieved state-of-the-art among existing vision-only models, e.g.,

[39]. To augment the data, it extracts 10 crops from the image (4 corners and one center and horizontally flipped version of them) and feeds it into the network to generate a feature vector of dimension 1024 for each of them. In order to make it compatible with our transformer framework, we apply a linear transformation that maps feature vectors of size 1024, generated by ChexNet, to 2048. We fine-tune ChexNet and train the parameters of the linear transformation on the OpenI dataset.

The results in Table 3 show that the visual encoder of BERTHop, PixelHop++, can extract richer features from the CXR images as it uses a data-efficient method capable of extracting image representations at different frequencies. Then, the transformer can highlight the most informative features from image-text data in an attention mechanism to make the final decision. In section 5, we explore the visual encoder of BERTHop and its effectiveness to capture abnormality regions.

BUTD ChexNet PixelHop++
Atelectasis 0.8866 0.9787 0.9838
Cardiomegaly 0.8875 0.9797 0.9896
Effusion 0.9120 0.8894 0.9432
Mass 0.7373 0.7529 0.9900
Consolidation 0.8906 0.9000 0.9671
Emphysema 0.8261 0.9067 0.9971

0.8564 0.8798 0.9823

Table 3: Comparison betwee different visual encoders (BUTD, ChexNet, and PixelHop++) under the same transformer backbone of BlueBERT. PixelHop++ outperforms BUTD and even ChexNet, which is pre-trained on a large in-domain disease diagnosis dataset.

5 Analysis

Effectiveness of BERTHop with different dataset scales

To demonstrate the effectiveness of BERTHop on datasets of different scales and justify our designs, we conduct an experiment to compare BERTHop with its two variants: (1) In PH_BERT, we replace BlueBERT with BERT. We compare BERTHop with PH_BERT to show how a domain-specific BERT model helps to improve the performance in medical applications. (2) In BUTD_BlueBERT, we replace the visual encoder PixelHop++ with the general visual encoder of BUTD.

We randomly select fractions of the training set of OpenI to train these three models and compare their performance on the entire test set of OpenI. Figure 7 illustrates that the performance of BERTHop is consistently better than the other two settings.

Figure 7: Avg AUC of three settings with different percentages of training data. BERTHop remains effective with different dataset scales.

Visualize abnormal regions identified by BERTHop

We visualize PixelHop++ output channels of BERTHop to probe whether it can effectively capture abnormal regions in CXR images. In this study, we asked two radiologists to annotate pathology regions of a few examples related to different diseases. As shown in Figure 8, some output channels can successfully highlight the abnormalities in CXR images. This is due to the fact that PixelHop++ extracts image representations at different frequencies which is beneficial for abnormality detection.

Figure 8: On the top, we mark the pathology regions annotated by two radiologists (the yellow circles and lines); on the bottom, we visualize the visual features from BERTHop (brighter colors means higher feature values). BERTHop can successfully highlight the abnormal regions identified by expert radiologists.

6 Conclusion and future work

In this paper, we proposed a high-performance data-efficient V&L model, BERTHop, for CXR disease diagnosis. We showed that BERTHop outperforms state-of-the-art while it is trained on a much smaller training set. Our studies verify the effectiveness of the visual feature extractor PixelHop++ and the transformer backbone initialization BlueBERT.

For future research direction, we plan to study how anomaly detection techniques can be incorporated to further improve the performance of the model. As no large-scale annotated CXR dataset for anomaly detection is available, we may use weekly supervised techniques or knowledge transfer from similar tasks. We are also interested in how our proposed BERTHop model can help other biomedical tasks, e.g., COVID-19 disease diagnosis and radiology report generation.


  • [1] R. H. Abiyev and M. K. S. Ma’aitah (2018) Deep convolutional neural networks for chest diseases detection. Journal of healthcare engineering 2018. Cited by: §1.
  • [2] I. Allaouzi and M. B. Ahmed (2019) A novel approach for multi-label chest x-ray classification of common thorax diseases. IEEE Access 7, pp. 64279–64288. Cited by: §1.
  • [3] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018)

    Bottom-up and top-down attention for image captioning and visual question answering


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 6077–6086. Cited by: §3.1, footnote 1.
  • [4] E. Ayan and H. M. Ünver (2019)

    Diagnosis of pneumonia from chest x-ray images using deep learning

    In 2019 Scientific Meeting on Electrical-Electronics & Biomedical Engineering and Computer Science (EBBT), pp. 1–5. Cited by: §1.
  • [5] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux (2013)

    API design for machine learning software: experiences from the scikit-learn project

    In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122. Cited by: §4.2.
  • [6] A. Bustos, A. Pertusa, J. Salinas, and M. de la Iglesia-Vayá (2020) Padchest: a large chest x-ray image dataset with multi-label annotated reports. Medical image analysis 66, pp. 101797. Cited by: §1.
  • [7] H. Chen, M. Rouhsedaghat, H. Ghani, S. Hu, S. You, and C. -C. J. Kuo (2021) DefakeHop: a light-weight high-performance deepfake detector. External Links: 2103.06929 Cited by: §2.
  • [8] Y. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu (2020) Uniter: universal image-text representation learning. In European Conference on Computer Vision, pp. 104–120. Cited by: §2.
  • [9] Y. Chen, M. Rouhsedaghat, S. You, R. Rao, and C. J. Kuo (2020) Pixelhop++: a small successive-subspace-learning-based (ssl-based) model for image classification. In 2020 IEEE International Conference on Image Processing (ICIP), pp. 3294–3298. Cited by: §1, §3.1.
  • [10] S. Chou, W. Chao, W. Lai, M. Sun, and M. Yang (2020) Visual question answering on 360deg images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1607–1616. Cited by: §2.
  • [11] D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald (2016) Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association 23 (2), pp. 304–310. Cited by: §1.
  • [12] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • [13] K. C. Fraser, I. Nejadgholi, B. De Bruijn, M. Li, A. LaPlante, and K. Z. E. Abidine (2019) Extracting umls concepts from medical text using general and domain-specific deep learning models. arXiv preprint arXiv:1910.01274. Cited by: §3.2.
  • [14] M. L. Giger and K. Suzuki (2008) Computer-aided diagnosis. In Biomedical information technology, pp. 359–XXII. Cited by: §1.
  • [15] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon (2020)

    Domain-specific language model pretraining for biomedical natural language processing

    arXiv preprint arXiv:2007.15779. Cited by: §3.2.
  • [16] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §4.4.
  • [17] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al. (2019) Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 590–597. Cited by: §4.1.
  • [18] S. Jain, A. Smit, S. Q. Truong, C. D. Nguyen, M. Huynh, M. Jain, V. A. Young, A. Y. Ng, M. P. Lungren, and P. Rajpurkar (2021) VisualCheXbert: addressing the discrepancy between radiology report labels and image labels. arXiv preprint arXiv:2102.11467. Cited by: §3.2.
  • [19] A. E. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C. Deng, Y. Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, and S. Horng (2019) MIMIC-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042. Cited by: §1, §2.
  • [20] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (1), pp. 32–73. Cited by: §2, §3.1.
  • [21] C. J. Kuo, M. Zhang, S. Li, J. Duan, and Y. Chen (2019) Interpretable convolutional neural networks via feedforward design. Journal of Visual Communication and Image Representation 60, pp. 346–359. Cited by: §3.1.
  • [22] L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2019) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §1, §1, §2, §3.2, 3rd item, §4.3, §4.3.
  • [23] Y. Li, H. Wang, and Y. Luo (2020)

    A comparison of pre-trained vision-and-language models for multimodal representation learning across medical images and reports

    In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1999–2004. Cited by: §1, §1, §2, §2, §3.2, 3rd item, Table 1.
  • [24] X. Liu, F. Xing, C. Yang, C. J. Kuo, S. Babu, G. E. Fakhri, T. Jenkins, and J. Woo (2021) VoxelHop: successive subspace learning for als disease classification using structural mri. arXiv preprint arXiv:2101.05131. Cited by: §2.
  • [25] J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265. Cited by: §1.
  • [26] J. Lu, V. Goswami, M. Rohrbach, D. Parikh, and S. Lee (2020) 12-in-1: multi-task vision and language representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10437–10446. Cited by: §2.
  • [27] Y. Peng, X. Wang, L. Lu, M. Bagheri, R. Summers, and Z. Lu (2018) Negbio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Summits on Translational Science Proceedings 2018, pp. 188. Cited by: §4.1.
  • [28] Y. Peng, S. Yan, and Z. Lu (2019) Transfer learning in biomedical natural language processing: an evaluation of bert and elmo on ten benchmarking datasets. arXiv preprint arXiv:1906.05474. Cited by: §1, §3.2.
  • [29] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §3.2.
  • [30] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, et al. (2017) Chexnet: radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225. Cited by: §4.4.
  • [31] S. Ren, K. He, R. Girshick, and J. Sun (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39 (6), pp. 1137–1149. Cited by: §1, §3.1.
  • [32] M. Rouhsedaghat, M. Monajatipoor, Z. Azizi, and C. J. Kuo (2021) Successive subspace learning: an overview. arXiv preprint arXiv:2103.00121. Cited by: §1, §3.1.
  • [33] M. Rouhsedaghat, Y. Wang, X. Ge, S. Hu, S. You, and C. J. Kuo (2020) Facehop: a light-weight low-resolution face gender classification method. arXiv preprint arXiv:2007.09510. Cited by: §2.
  • [34] M. Rouhsedaghat, Y. Wang, S. Hu, S. You, and C. J. Kuo (2020) Low-resolution face recognition in resource-constrained environments. arXiv preprint arXiv:2011.11674. Cited by: §2.
  • [35] H. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and R. M. Summers (2016) Deep convolutional neural networks for computer-aided detection: cnn architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging 35 (5), pp. 1285–1298. Cited by: §3.1.
  • [36] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: §1.
  • [37] H. Tan and M. Bansal (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §1, §2, §3.2.
  • [38] S. Wada, T. Takeda, S. Manabe, S. Konishi, J. Kamohara, and Y. Matsumura (2020) A pre-training technique to localize medical bert and enhance biobert. arXiv preprint arXiv:2005.07202. Cited by: §3.2.
  • [39] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers (2017) Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2097–2106. Cited by: §1, §1, §4.4.
  • [40] X. Wang, Y. Peng, L. Lu, Z. Lu, and R. M. Summers (2018) Tienet: text-image embedding network for common thorax disease classification and reporting in chest x-rays. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9049–9058. Cited by: §1, §1, 2nd item, Table 1.
  • [41] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. (2019) HuggingFace’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771. Cited by: §4.1.
  • [42] Z. Zhang, P. Chen, X. Shi, and L. Yang (2019) Text-guided neural network training for image recognition in natural scenes and medicine. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2, 1st item, Table 1.
  • [43] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao (2020) Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 13041–13049. Cited by: §2.