Detecting clinically relevant objects in medical images is a challenge despite large datasets due to the lack of detailed labels. To address the label issue, we utilize the scene-level labels with a detection architecture that incorporates natural language information. We present a challenging new set of radiologist paired bounding box and natural language annotations on the publicly available MIMIC-CXR dataset especially focussed on pneumonia and pneumothorax. Along with the dataset, we present a joint vision language weakly supervised transformer layer-selected one-stage dual head detection architecture (LITERATI) alongside strong baseline comparisons with class activation mapping (CAM), gradient CAM, and relevant implementations on the NIH ChestXray-14 and MIMIC-CXR dataset. Borrowing from advances in vision language architectures, the LITERATI method demonstrates joint image and referring expression (objects localized in the image using natural language) input for detection that scales in a purely weakly supervised fashion. The architectural modifications address three obstacles – implementing a supervised vision and language detection method in a weakly supervised fashion, incorporating clinical referring expression natural language information, and generating high fidelity detections with map probabilities. Nevertheless, the challenging clinical nature of the radiologist annotations including subtle references, multi-instance specifications, and relatively verbose underlying medical reports, ensures the vision language detection task at scale remains stimulating for future investigation.READ FULL TEXT VIEW PDF
Recently, the release of large scale datasets concerning chest x-rays has enabled methods that scale with such datasets [7, 18, 25]. Whereas image classification implementations may reach adequate performance using scene-level labels, significantly more effort is required for annotation of bounding boxes around numerous visual features of interest. Yet there is detailed information present in released clinical reports that could inform natural language (NL) methods. The proposed method brings together advances in object detection , language , and their usage together [26, 28].
Typically object detection algorithms are either multi-stage with a region proposal stage or single stage with proposed regions scored to a certain object class when proposed [20, 4]. The single stage detectors have the benefit of fast inference time at often nearly the same performance in accuracy [20, 21]. The object detection networks benefit from using the same classification network architecture as their image classification cousins, where the visual features carry significance and shared modularity of networks holds. The modularity of network architectures is further realized with recent vision language architectures.
Vision language networks seek to encode the symbolic and information dense content in NL with visual features to solve applications such as visual question and answering, high fidelity image captioning, and other multi-modal tasks, some of which have seen application in medical imaging[26, 27, 31, 16]. Recent advances in NLP incorporate the transformer unit architecture, a computational building block that allows the attention of every word to learn an attentional weighting with regards to every other word in a sequence, given standard NLP tasks such as cloze, next sentence prediction, etc 
. Furthermore, deep transformer networks of a dozen or more layers trained for the language modeling task (next word prediction) were found to be adaptable to a variety of tasks, in part due to their ability to learn the components of a traditional NLP processing pipeline. The combination of NLP for the vision task of object detection centers around the issue of visual grounding, namely given a referring phrase, how the phrase places an object in the image. The computational generation of referring phrases in a natural fashion is a nontrivial problem centered on photographs of cluttered scenes, where prevailing methods are based on probabilistically mapped potentials for attribute categories . Related detection methods on cluttered scenes emphasize a single stage and end-to-end training [10, 8].
In particular, our method builds on a supervised single stage visual grounding method that incorporates fixed transformer embeddings . The original one-stage methods fuses the referring expression in the form of a transformer embedding to augment the spatial features in the YOLOv3 detector. Our method alters to a weakly supervised implementation, taking care to ensure adequate training signal can be propagated to the object detection backbone through the technique adapted for directly allowing training signal through a fixed global average pooling layer . The fast and memory efficient backbone of the DarkNet-53 architecture combines with fixed bidirectionally encoded features  to visually ground radiologist-annotated phrases. The fixed transformer embeddings are allowed increased flexibility through a learned selection layer as corroborated by the concurrent work , though here our explicit reasoning is to boost the NL information (verified by ablation study) of a more sophisticated phrasing then encountered in the generic visual grounding setting. To narrow our focus for object detection, we consider two datasets – the ChestXray-14 dataset , which was released with 984 bounding box annotations spread across 8 labels, and the MIMIC-CXR dataset , for which we collected over 400 board-certified radiologist bounding box annotations. Our weakly supervised transformer layer-selected one-stage dual head detection architecture (LITERATI) forms a strong baseline on a challenging set of annotations with scant information provided by the disease label.
The architecture of our method is presented in Fig. 1 with salient improvements numbered. The inputs of the model are anteroposterior view chest x-ray images and a referring expression parsed from the associated clinical report for the study. The output of the model is a probability map for each disease class (pneumonia and pneumothorax) as well as a classification for the image to generate the scene level loss. Intersection over union (IOU) is calculated for the input and ground truth annotation as where A is the input and B is the ground truth annotation.
The MIMIC-CXR dataset second stage release  included a reference label file built on the CheXpert labeler , which we used for our filtering and data selection. The CheXpert labels are used to isolate pneumonia and pneumothorax images and the corresponding chest x-ray reports retrieved by subject and study number. The images are converted from the full resolution (typically 2544x3056) to 416x416 to match the preferred one-stage resolution. For the ChestXray-14 dataset, the 1024x1024 PNG files are converted in PNG format.
For the MIMIC-CXR dataset, the radiologist reports are parsed to search for referring expressions, i.e. an object of interest is identified and located in the image . The referring expression is created using tools adapted from . Namely, the tooling uses the Stanford CoreNLP  parser and the NLTK tokenizer  to separate sentences into the R1-R7 attributes and reformed where there is an object in the image as either subject or direct object. Specifically, the referring phrase consists of the R7 attributes (generics), R1 attributes (entry-level name), R5 attributes (relative location), and finally R6 attributes (relative object) . Sample referring phrases in the reports are “confluent opacity at bases”, “left apical pneumothorax”, and “multifocal bilateral airspace consolidation”. As occasionally, the referring phrase does not capture the disease focus, the reports are additionally processed to excerpt phrases with “pneumonia” and “pneumothorax” to create a disease emphasis dataset split. For example, a phrase that does not qualify as a canonical referring expression but is present for the disease is, “vague right mid lung opacity, which is of uncertain etiology, although could represent an early pneumonia” which is a positive mention, standing in contrast to, “no complications, no pneumothorax” as a negative mention. To include the presence of normal and negative example images, data negative for pneumothorax and pneumonia was mixed in at an equal ratio to positive data for either category. The experiments on the disease emphasis phrases are presented as an interrogative data distribution ablation of the NL function. To further capture the language function, the scene level label can be submitted for the language embedding itself. For the ChestXray-14 dataset, the scene level label is the only textual input publicly available.
During the NL ablation, experiments are performed on the MIMIC-CXR dataset with three different levels of referring phrases provided during training. The tersest level of phrasing is simply the scene level label, which include the cases pneumonia, pneumothorax, pneumonia and pneumothorax, or the negative phrase ’no pneumo’. The second level and third level are the phrasing described previously. At test time, the full annotation provided by the radiologist is used.
Once the referring expressions are in place, the ingredients are set for board-certified radiologist clinical annotations. We pair the images and highlight relevant phrases in the clinical report by building functionality on the publicly available MS COCO annotator . The radiologist is given free rein to select phrases that are clinically relevant in the report to highlight. The radiologist has annotated 455 clinically relevant phrases with bounding boxes on the MIMIC-CXR dataset, which we release at https://github.com/leotam/MIMIC-CXR-annotations. As of writing, the annotations constitute the largest disease focussed bounding box labels with referring expressions publicly released, and we hope is a valuable contribution to the clinical visual grounding milieu.
There are six modifications to  noted in Fig. 1 beyond the parsing discussed. To adapt the network from supervised to weakly supervised, the classification layer must be not be trained to reduce partitioning of the data information. To achieve that purpose, a global average pooling layer  was combined with a cross-entropy loss on the scene labels, Fig. 1 (5, 6), to replace the convolution-batch norm-convolution (CBC) layer that generates the bounding box anchors in the original implementation. To replace the CBC layer, a deconvolution layer norm layer was prepended to the pooling layer Fig. 1 (3), which additionally allowed grounding on an image scale probability map Fig. 1 (7), instead of the anchor shifts typically dictated by the YOLOv3 implementation .
For the NL implementation, the ability of transformer architectures to implement aspects of the NLP pipeline  suggests a trained layer may be successful in increasing the expressivity of the language information for the task. Normally, a fully connected layer is appended to a transformer model to fine-tune for a given task for various classification tasks . The architecture includes a convolutional 1D layer in the language mapping module Fig. 1 (2) that allows access to all the transformer layers instead of the linear layers on the averaged last four transformer layers output in the original implementation . Such a modification echos the custom attention on layers mechanism from a concurrent study on the automatic labeler work .
For the bounding box generation, we move to the threshold detection method similar to , which differs from the threshold detection method in . The current generation method uses the tractable probability map output after the deconvolution-layer norm stage close to the weak supervision signal to select regions of high confidence given a neighborhood size. Specifically, a maximal filter is applied to the map probabilities as follows:
First the M map probabilities are generated from the convolutional outputs C via softmax, followed by maximal filtering (with the threshold d as a hyperparameter set by tree-structured parzen estimator, as are all hyperparameters across methods) to generate the regions S, and then center of masses collected as bounding box centers. The original method  used the confidence scores to assign a probability to a superimposed anchor grid. As the LITERATI output is at parity resolution with the image, the deconvolution and maximal filtering obviates the anchor grid and offset prediction mechanism.
The implementation in PyTorch
allows for straight-forward data parallelism via the nn.DataParallel class, which we found was key to performance in a timely fashion. The LITERATI network was trained for 100 epochs (as per the original implementation) on a NVIDIA DGX-2 (16x 32 GB V100) with 16-way data parallelism using a batch size of 416 for approximately 12 hours. The dataset for the weakly supervised case was split in an 80/10/10 ratio for training, test, and validation respectively, or 44627, 5577, and 5777 images respectively. The test set was checked infrequently for divergence from the validation set. The epoch with the best validation performance was selected for the visual grounding task and was observed to always perform better than the overfitted last epoch. For the supervised implementation of the original, there were not enough annotations and instead the default parameters from were used.
In Tab. 1, experiments are compared across the ChestXray-14 dataset. The methods show the gradient CAM implementation derived from [22, 9] outperforms the traditional CAM implementation , likely due to the higher resolution localization map. Meanwhile the LITERATI with scene label method improves on the grad CAM method most significantly at IOU = 0.2, though lagging at IOU=0.4 and higher. Above both methods is the jointly trained supervised and unsupervised method from . The LITERATI method and the one-stage method were designed for visual grounding, but the performance for detection on the supervised one-stage method with minimal modification is not strong at the dataset size here, 215 ChestXray-14 annotations for specifically pneumonia and pneumothorax as that is the supervision available.
|CAM  WS||0.505||0.290||0.150||0.075||0.030|
|Multi-stage S + WS ||0.615||0.505||0.415||0.275||0.180|
|Gradient CAM WS||0.565||0.298||0.175||0.097||0.049|
Supervised (S), weakly supervised (WS), and scene-level NLP (SWS) methods
While we include two relevant prior art baselines without language input, the present visual grounding task is more specifically contained by the referring expression label and at times further removed from the scene level label due to an annotation’s clinical importance determined by the radiologist. The data in Tab. 2 shows detection accuracy on the visual grounding task. The LITERATI method improves on the supervised method at IOU=0.1 and on gradient CAM at all IOUs.
|Gradient CAM WS||0.316||0.104||0.049||0.005||0.001|
Supervised (S), weakly supervised (WS), and NLP (NWS) methods
We present qualitative results in fig. 2 that cover a range of annotations from disease focussed (“large left pneumonia”) to more subtle features (“patchy right infrahilar opacity”). Of interest are the multi-instance annotations (e.g. “multifocal pneumonia” or “bibasilar consolidations”) which would typically fall out of scope for referring expressions, but were included at the radiologist’s discretion in approximately 46 of the annotations. Considering the case of “bibasilar consolidations”, the ground truth annotation indicates two symmetrical boxes on each lung lobe. Such annotations are especially challenging for the one-stage method as it does not consider the multi-instance problem.
|Referring disease emphasis||0.349||0.125||0.060||0.024||0.007|
Weakly supervised experiments with varying language input, i.e. scene label (“pneumonia”), referring expression (“patchy right infrahilar opacity”), or referring expression with scene label disease (“large left pneumonia”).
The improvements using detail ranging from the scene level label to a full disease related sentence show performance gain at high IOUs. Although the training NL phrasing differs in the ablation, at test time the phrasing is the same across methods. Since a pretrained fixed BERT encoder is used, the ablation probes the adaptability from the NL portion of the architecture to the task. Since the pretrained encoder is trained on a large generic corpora, it likely retains the NLP pipeline (named entity recognition, coreference disambiguation, etc.) necessary for the visual grounding task. In limited experiments (not shown), fine-tuning on corpora appears to depend granularly on matching the domain specific corpora with the task.
We present a weakly supervised vision language method and associated clinical referring expressions dataset on pneumonia and pneumothorax chest x-ray images at scale. The clinical reports generate expressions that isolate discriminative objects inside the images. As parsing into referring expressions is accurate and mostly independently of vocabulary (i.e. it’s tractable to identify a direct object without knowing exactly the meaning of the object) , the referring phrases represent a valuable source of information during the learning process.
Though not necessarily motivated by learning processes in nature, algorithms including NL bend towards explainable mechanisms, i.e. the localized image and language pairs form clear concepts. The explainable nature of visually grounded referring expressions in a clinical setting, while cogent here, merits further investigation on the lines of workflow performance. For pure NLP tasks, training on a data distribution closely matching the testing distribution has encountered success. An appropriately matching referring expressions dataset may draw from an ontology [25, 27] or from didactic literature.
The study suggests that vision language approaches may be valuable for accessing information within clinical reports.
2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 1440–1448. External Links: Cited by: §1.
DenseCap: fully convolutional localization networks for dense captioning. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
Automated labeling of bugs and tickets using attention-based mechanisms in recurrent neural networks. CoRR abs/1807.02892. External Links: Cited by: §1, §2.2.
Learning deep features for discriminative localization. CoRR abs/1512.04150. External Links: Cited by: Table 1, §3.