1 Introduction
Largescale labelled datasets are one of the key ingredients in the rapidly developing domain of computeraided diagnosis systems. In particular, deep learning, which has come to dominate the field of medical imaging in the same way that it has taken over computer vision more generally, is a very data hungry approach. The combination of large labelled datasets with deep learning techniques has resulted in stateoftheart (SOTA) algorithms in many medical imaging tasks, including in classification, detection, and segmentation.
A problematic aspect of this approach is the cost of labelling, particularly in localization tasks – either detection or segmentation. In detection, one wishes to find an object in an image by placing a bounding box around it; in the more finegrained segmentation, one wishes to localize an object with pixellevel granularity. Generally, in order to use deep learning to train networks to perform either of these tasks in standard fashion, one requires a fair amount of annotated images which mirrors the desired output: bounding boxes in the case of detection, and pixellevel masks in the case of segmentation. The major problem is that collecting such annotations can be very expensive. Indeed, these annotations are much more expensive than their counterparts in the corresponding classification task, in which the labeller must simply specify a label for the image. This problem plagues the general computer vision problems of detection and segmentation, but is even worse in the case of medical imaging: the labeller must generally be a physician, resulting in a very costly labelling procedure.
Our goal in this paper is to learn to perform localization with considerably less annotation. In particular, we consider the following setting: relatively cheap whole image (i.e. classificationstyle) labels are available for each image in the dataset, but only a very small number of examples have a bounding box or segmentation mask labels. This sort of weakly supervised task has been studied before in the computer vision literature [babenko2008multiple]. In medical imaging, Li et al. [Li_2018_CVPR] proposed an approach in the spirit of multiple instance learning, and achieved SOTA results on the ChestXray14 dataset [8099852].
Despite the success of the approach in [Li_2018_CVPR], it has a number of shortcomings, both in terms of the the underlying probabilistic model – which, for example, assumes patch independence, as well in terms of the numerical problems that arise from the formulation. We propose a new technique in which these problems are resolved. In particular, our contributions are twofold: First, we propose a novel loss function, which is a continuous relaxation of a welldefined discrete formulation of weakly supervised learning, and which is numerically wellposed; and secondly, we propose a new architecture which accounts for both patch dependence and shiftinvariance, through the inclusion of CRF layers and antialiasing filters, respectively. Using the new technique, we show SOTA results for the localization task on the ChestXray14 dataset.
The remainder of the paper is organized as follows. Section 2 reviews related work. Section 3 describes the new technique, focusing alternately on the new loss function and the proposed architecture. Section 4 presents the experimental results, and Section 5 concludes the paper.
2 Related Work
As our results are presented on the ChestXray14 dataset of Wang et al. [8099852]
, we begin with a brief discussion thereof. The dataset is a collection of over 100K frontview Xray images. Using automatic extraction methods from the associated radiological reports by natural language processing, each image was labelled with up to 14 different thoracic pathology classes; in addition, a small number of images with pathology were annotated with handlabeled bounding boxes for a subset of 8 of the 14 diseases.
Simultaneous with the release of the dataset, Wang et al. [8099852]
also presented the first benchmark for classification and localization by weaklysupervised CNN architectures pretrained on ImageNet. This benchmark used only the whole label images for training, ignoring the bounding box annotations. Following the release of the dataset and initial benchmark, several notable works proposed more sophisticated networks for more accurate classification or localization results. Yao
et al. [Yao2018LearningTD]leveraged the interdependencies among all 14 deseases using long shortterm memory (LSTM) networks without pretraining, outperforming Wang
et al. [8099852] on 13 of 14 classes. [Rajpurkar2017CheXNetRP]proposed classifying multiple thoracic pathologies by transferlearning with fine tuning, using a 121layer Dense Convolutional Network (DenseNet)
[Huang2016DenselyCC], yielding SOTA results for the classification task for all 14 diseases. Unlike both previous methods, which do not exploit any of the bounding box annotations, Li et al. [Li_2018_CVPR] took advantage of these annotations to simultaneously perform disease identification and localization through the same underlying model. Although their method did not attain the top classification results, they did achieve a new SOTA for localization. Subsequent works [Guan2018DiagnoseLA, 10.1007/9783030134693_88] have focused on improving the SOTA in classification; by contrast, our focus is localization rather than classification, so we shall not elaborate further on these results.Another generally related area of interest is object detection. Current SOTA object detectors are of two types: twostage or singlestage. The twostage family is represented by the RCNN framework [Girshick_2014_CVPR], comprised of a region proposal stage followed by the application of a classifier to each of these candidate proposals. This architecture, through a sequence of advances [girshick2015fast, ren2015faster, he2017mask], consistently achieves SOTA results on the challenging COCO benchmark [10.1007/9783319106021_48]. The initial singlestage detectors, such as YOLO [Redmon_2016_CVPR] and SSD [10.1007/9783319464480_2], exhibited greater runtime speed at the expense of some accuracy. More recently Lin et al. proposed RetinaNet [Lin2017FocalLF], whose training is based on the “focal loss”; this network was able to match the speed of previous singlestage detectors while surpassing the accuracy of all existing SOTA twostage detectors. These detection approaches, however, are not aimed for tasks that contain small number of annotated samples as in our setting of interest, and are often prone to low accuracy on such datasets.
Finally, we mention the multiple instance learning (MIL) literature [babenko2008multiple]. Examples of MIL in medical imaging include [yan2016multi, zhu2017deep, hou2016patch].
3 Localization with Limited Annotation
3.1 Problem Formulation
Setup: The Base Model As our starting point we take the approach of Li et al. [Li_2018_CVPR], which proposes a technique for the classification and localization of abnormalities in radiological images. This approach is very appealing in that it allows for localization to be achieved with a very limited number of bounding box annotations. We now give a brief summary of the technique.
The architecture used in [Li_2018_CVPR] is shown in Figure 2. A preactResNet network [10.1007/9783319464930_38], with the final classification layer and global pooling layer removed, is used as the backbone; this part of the architecture encodes the images into a set of feature maps. These feature maps are subsequently divided into a
grid of patches. Through application of two convolutional layers (including batch normalization and ReLU activation), the number of channels is modified to
, whereis the number of possible disease types. A perpatch probability for each disease class is then derived by the application of a sigmoid function; this is denoted
, where the probability is that the patch of the image belongs to class . Note that a sigmoid function is applied, rather than a softmax, as a particular patch may belong to more than one disease.As mentioned above, it is assumed that some images have bounding box annotations, while most do not. Let us define some terms: the image is ; for a disease , the label if the disease is present, otherwise; if the disease is annotated with a bounding box , then , otherwise . Now, the loss function can then be broken into two cases, in terms of whether a bounding box annotation is supplied or not. In the case in which there is a bounding box for a disease of class , i.e. , the bounding box is a subset of the patches. Then the annotated loss is taken to be
where denotes the probability that disease is within bounding box of image , and is given by
(1) 
and is the complement of bounding box . The above formula is simply the standard formula for combining independent patch probabilities. In the case in which no bounding box is supplied, i.e. , the unannotated loss is
where
(2) 
The latter probability is simply the probability that there is exactly zero patches with disease , again assuming independence of patches. Finally, the overall loss per image is
(3) 
We refer to this model – the architecture and the loss – as the base model. It was shown to attain SOTA performance in terms of localization on the NIH Chest Xray dataset [8099852].
Issues with the Approach of Li et al. The probabilistic formulation of Li et al. is a nice approach to localization tasks with a very limited number of bounding box annotations. In spite of this, there are several issues with the technique that limit its performance:

Single Patch Positive Declaration: In Equation (2) of the above derivation for the unannotated loss, only a single patch needs to be positive for a positive disease detection within the image. In general, this assumption is prone to false positives. One would like for multiple patches to be present for a declaration; in particular, a single positive detection could easily be caused by noise. We would therefore like to do away with this assumption, by using a novel loss function.

Numerical Issues: The paper refers to a particular numerical problem that result from the multiplication of many small numbers, as in Equation (1). This numerical underflow is fatal to the approach, and the problem is circumvented in [Li_2018_CVPR]
through a series of unjustified heuristics. We propose a formulation of the loss in which these issues never arise.

Patch Independence: In both Equations (1) and (2), the probabilities of each patch containing an object of a particular class () are treated as independent between patches. This is not correct in practice as we would like to integrate more elaborate terms that model contextual relationships between object/patch classes. We solve this through the use of a Conditional Random Field (CRF) model.

Lack of ShiftInvariance: As has been pointed out by Zhang [Zhang2019MakingCN], modern CNNs are not technically shiftinvariant. In order to improve the performance of the localization, one can therefore address this issue through the addition of antialiasing filters prior to downsampling, as suggested in [Zhang2019MakingCN].
Summary of the Proposed Approach. To summarize, our technique is related to the base model, but is differentiated in two key ways:

The use of a novel loss function, which addresses many of the aforementioned issues.

Modifications to the architecture, specifically (a) the incorporation of conditional random field (CRF) layers and (b) the inclusion of antialiasing filters.
We now elaborate on each of these, in turn.
3.2 The New Loss Function
Notation As described above, the output of the base model is a tensor of shape of ; we will continue to denote the output as , the probability is that the th patch of the image belongs to class . This output will then be fed into a series of layers which implement a CRF model, with representing the unary terms in the CRF. The output of the CRF is denoted as . We will discuss the details of the CRF model in Section 3.3; for now, we may think of
as a sharper estimate of whether a particular disease
is present in patch . indicates the length vector of all patch values for a given disease .The Loss Function: First Pass
We first consider an annotated example with , i.e. one with a bounding box. As above, the bounding box , consisting of a subset of the patches, contains an object of class . In this case, the following loss function is natural:
(4) 
where is the indicator function; are thresholds; and is the complement of the bounding box . A blob may be made precise as a connected component; however, this will not lead to a nice differentiable loss. So we make the following continuous relaxation of the above discrete formulation:
(5) 
where is now an indicator function on the bounding box; is the Hadamard product; is the vector of all ’s; and is a sigmoid function, i.e. a smooth approximation to the indicator function. What this says is that there must be a total of patches within which detect class ; the relaxation is that the total no longer has to be in a single connected component. This is a reasonable relaxation, especially since the CRF already encourages smoothness. In addition, we require that there be fewer than patches outside of which detect class .
Note that this logic extends in a straightforward manner to the case of an unannotated example with , when there is no bounding box specified. In the case of a positive patch (i.e. one in which disease is present), the loss is simply
(6) 
so that the threshold now has a meaning in absolute terms, i.e. the absolute number of patches vs. the number of patches relative to the size of a bounding box. In the case of a negative patch – where disease is absent – we have an equation analogous to (6):
(7) 
where is another threshold, whose meaning is in terms of the absolute number of patches, similar to . Finally, we can combine Equations (6) and (7) to get
(8) 
Addressing Issues (1) and (2). The above formulation addresses Issues (1) and (2) raised in Section 3.1. Regarding Issue (1), Equation (6) requires more than a single patch in order to make a positive declaration; the number of patches must be equal to , which is a perclass parameter which can be chosen. Regarding Issue (2), neither Equations (5) or (8) involve the multiplication of many small values; indeed, both are wellposed from a numerical point of view.
Dealing with Vanishing Gradients Due to the presence of sigmoid functions in Equations (5) and (8), in practice we experience issues of vanishing gradients during training. We propose the following remedy, based on a different relaxation. We replace Equation (5) with
(9) 
Note that there are three fundamental differences between Equations (5) and (9). First, we have replaced the sigmoid function with ReLU functions; this has the effect of still leading to minimal loss (in this case, zero) once the constraints are satsified, but leads to more nicely behaved gradients. Second, we have replaced the multiplication with addition. Once sigmoids have been replaced by ReLU’s, the notion of a “fuzzy and” relaxation is no longer relevant; in this case, addition makes more sense, and again leads to better behaved gradients. Finally, sigmoids are scaled between and , whereas ReLU’s can grow without bound; this necessitates the insertion of a scaling factor of and , to ensure that the two terms in the sum are properly balanced.
Similarly, for unannotated samples we replace Equation (8) with:
(10) 
The thresholds
can be treated as parameters of the network, that can be optimized during training, or can be considered hyperparameters. Note that the losses described in Equations (
9) and (10) do not suffer from any numerical issues. This is due to the fact that they are not the product of many individual probabilities; rather, they aggregate information across patches in such a way that the resulting loss is numerically stable.Balancing Factors and the Final Loss Function There are two sources of data imbalance to account for. The first is the large imbalance of negative (nondiseased) vs. positive (diseased) examples in the data. To deal with this, we modify Equation (10) slightly, to read
(11) 
where is the ratio of positive to negative examples in the data. The latter deemphasizes negative examples relative to the positive examples, and follows the practice in [8099852].
The second form of imbalance we must account for is that between the annotated and unannotated examples; in practice, there are many more unannotated examples. However, this is already accounted for by the factor in Equation (3), which is set to a large value. Combining our own annotated and unannotated losses in Equations (9) and (11), respectively, using Equation (3), we arrive at the final form of the perexample loss:
(12)  
where the dependence of the variables on the image has been made explicit.
3.3 Architectural Modifications
CRF Model As mentioned in Issue (3) in Section 3.1, we would like to do away with the assumption of patch independence; and indeed, in our derivation of Equation (12
) we did not make use of such assumptions. However, to further bolster the dependence between neighboring patches, we introduce a CRF model into our network. The CRF introduces, in an explicit manner, a spatial dependency between patches. The effect of the CRF is to increase the confidence for a given patch’s predicted label, and thereby to improve localization. There are several choices amongst neural network compatible CRFs; we choose the recent pixeladaptive convolution (PAC) approach of Su
et al. [su_2019_CVPR], due to its simplicity and excellent performance. We thus integrate the PACCRF modification to our base network and train the model endtoend.Given the patch probability outputs of the base model, the unary potentials of the CRF are simply taken as the . Thus, in the absence of neighbor dependence, the CRF will simply choose . Neighbour dependence is introduced through pairwise potentials. As in [su_2019_CVPR], we take this potential to be , where are a set of learnable features on the grid; is a fixed Gaussian kernel; is the pixel coordinates of patch ; and is the interclass compatibility function, which varies across different spatial offsets, and is also learned. The pairwise connections are defined over a fixed window around each patch.
Our training regime for the CRF is as follows. First, we train the unary terms using the base model until converges. Then, we freeze the base model and train only the PACCRF part, using a PAC filter. As our unary model outputs a
tensor, we add to our PACCRF model two 2D convolution layers, each followed by a rectified linear unit and batchnormalization, to output a tensor with the same size.
AntiAliasing An important property of any model whose goal is to perform localization or segmentation is that the output of the model should be shiftinvariant with regard to its input. However, as Zhang [Zhang2019MakingCN] has noted, standard CNNs use downsampling layers while ignoring sampling theorem, and are therefore not shiftinvariant; this is in spite of the fact that CNNs are commonly used as the backbone of many localization/segmentation tasks. To circumvent this problem, an antialiasing filter is required prior to every downsampling part in the network. In particular, Zhang [Zhang2019MakingCN] proposed the insertion a blur kernel as a lowpass filter prior to each downsampling step in the network; and thereby demonstrated an increased accuracy across several commonly used architectures and tasks. Following [Zhang2019MakingCN], we thus modify the backbone of our base model and integrate such lowpass filters as part of the preactResNet network [10.1007/9783319464930_38]. This effectively addresses Issue (4) in Section 3.1.
4 Results
Dataset As in [Li_2018_CVPR], We have examined our model over NIH Chest Xray dataset [8099852]. The NIH Chest Xray dataset consists of 112,120 frontalview Xray images with 14 disease labels (each image can have multilabels); images can have no label as well. Out of the more than 100K images, the dataset contains only 880 images with bounding box annotations; some images have more than one such box, so there are a total of 984 labelled bounding boxes. The remaining 111,240 images have no bounding box annotations, but do possess class labels. The images are , but we have resized them for faster processing; we have also normalized the image range to .
The 984 bounding boxes annotations are only given for 8 of the 14 disease types. Despite this fact, we have noticed that we get superior results by continuing to learn on the full complement of 14 disease types, which seems to imply some interesting interdependence amongst disease classes.
Evaluation Metrics Although we train over both annotated and unannotated examples, we are interested in localization accuracy, so we evaluate solely over annotated samples. We use IntersectionoverUnion (IoU) and IntersectionoverRegion (IoR) to measure localization accuracy. A patch is taken to be positive, i.e. the disease is present in patch , if its value is greater than 0.5: . The union of all positive patches is the detected region. The IoU and IoR can then be computed between the ground truth bounding box and the detected region. A localization is taken to be correct if or for given threshold ; following the practice of [Li_2018_CVPR], we use . Performance statistics are then computed over a 5fold cross validation of the annotated samples, as is also done in [Li_2018_CVPR].
IoU accuracy  

un[%]  model  Atelectasis  Cardiomegaly  Effusion  Infiltration  Mass  Nodule  Pneumonia  Pneumothorax 
0%  ours  0.818  1  0.882  0.927  0.695  0.404  0.918  0.726 
ref.  0.488  0.989  0.693  0.842  0.342  0.081  0.715  0.437  
20%  ours  0.779  1  0.843  0.945  0.709  0.444  0.915  0.732 
ref.  0.687  0.978  0.831  0.9  0.634  0.241  0.568  0.576  
IoR accuracy  
un[%]  model  Atelectasis  Cardiomegaly  Effusion  Infiltration  Mass  Nodule  Pneumonia  Pneumothorax 
0%  ours  0.889  1  0.92  0.95  0.773  0.58  0.933  0.767 
ref.  0.528  1  0.753  0.875  0.452  0.111  0.786  0.473  
20%  ours  0.844  1  0.896  0.967  0.808  0.52  0.935  0.806 
ref.  0.724  0.991  0.874  0.921  0.674  0.271  0.644  0.624 
Training Details We use ResNet50 as the backbone of our model. We initialize the weights based on ImageNet [deng2009imagenet] pretraining, and then allow them to evolve as training proceeds. We take for a patch grid, and take . We use the ADAM optimizer [kingma2014adam] with weight decay regularization coefficient equal to and exponentially decaying learningrate, initialized to . We take the batch size to be 48.
It is natural to treat the thresholds as parameters of the network, which can be optimized during training. Unfortunately, this quickly leads to the degenerate solution, i.e: , and . In order to avoid reaching the degenerate solution, we employ the following procedure. We begin by freezing the thresholds, and training only the network weights; we then freeze the network weights, and find the optimal thresholds; and we continue to alternate this procedure until convergence.
Experiments We compare our results with those of Li et al. which represent the SOTA for the localization task on the NIH Chest Xray dataset. As has been mentioned above, we use 5fold cross validation. We examine two separate settings: (a) the model is trained using only 80% of the annotated samples, with the 80% representing the training part of the fold; (b) the model is trained using 80% of the annotated samples as described in (a), as well as 20% of the unannotated samples, representing about 20K samples (selected randomly). In both cases, the results are evaluated on the remaining 20% of the annotated samples of each fold.
In Table 1 we present two versions of the localization accuracy, with the top based on IoU and the bottom based on IoR; best results are shown in bold. Our method outperforms that of Li et al. for all 8 disease classes, for IoU and IoR, and for both settings  with no extra unannotated data added, and with 20% unannotated data. Examining the IoU data more closely, we see several patterns. First, we perform considerably better than [Li_2018_CVPR] when no unannotated data is added; for example, the accuracy on Atelectasis and Mass is nearly double that of [Li_2018_CVPR], whereas the performance on Nodule is five times better. Second, with the addition of unannotated data, the gaps narrows – for example, Nodule is now slightly less than double, and many other disease classes have quite a bit smaller gap – but the gap is still present for each disease class. We hypothesize that our improvement is less in most cases simply because the algorithm trained with no unannotated data already has a fairly high performance in most cases; thus, the marginal benefit of adding the unannotated data is smaller. Third, in examining our own results, we see that the addition of unannotated data often helps, but does not always do so. In particular, there is an increase in localization accuracy for four of the eight diseases – Infiltration, Mass, Nodule, Pneumothorax – while two diseases, Cardiomegaly and Pneumonia, undergo little or no change. The remaining two, Atelectasis and Effusion, actually suffer a decrease in accuracy due to the addition of the extra unannotated data. In examining the data, the solution to this puzzle becomes apparent: Atelectasis and Effusion have the largest number of annotated examples of the eight disease classes. This explains why they have quite high localization accuracies to begin with, when no unannotated data has been added (0.818 and 0.882, respectively); and why the addition of extra unannotated examples does not help. (It is interesting to note that Pneumonia has a high accuracy and a decent number of annotated examples; the reason it differs from Atelectasis and Effusion is that it also has a relatively small number of unannotated examples, so that adding them does not affect its performance too much.)
We note that the IoR data is fairly similar to the IoU data, and most of the observations above hold in this case as well. Qualitative examples of the localizations derived are shown in Figure 1.
5 Conclusions
We have a presented a new technique for localization with limited annotation. Our method is based on a novel loss function, which is mathematically and numerically wellposed; and an architecture which explicitly accounts for patch nonindependence and shift invariance. We present SOTA results for localization on the ChestXray14 dataset. Future work will focus on applying these ideas to the realm of semantic segmentation.