Recent advances of deep learning have achieved remarkable performances in various challenging computer vision tasks. Especially in object localization, deep convolutional neural networks outperform traditional approaches based on extraction of data/task-driven features instead of hand-crafted features. Although location information of region-of-interests (ROIs) gives good prior for object localization, it requires heavy annotation efforts from human resources. Thus a weakly supervised framework for object localization is introduced. The term "weakly" means that this framework only uses image-level labeled datasets to train a network. With the help of transfer learning which adopts weight parameters of a pre-trained network, the weakly supervised learning framework for object localization performs well because the pre-trained network already has well-trained class-specific features. However, those approaches cannot be used for some applications which do not have pre-trained networks or well-localized large scale images. Medical image analysis is a representative among those applications because it is impossible to obtain such pre-trained networks. In this work, we present a "fully" weakly supervised framework for object localization ("semi"-weakly is the counterpart which uses pre-trained filters for weakly supervised localization) named as self-transfer learning (STL). It jointly optimizes both classification and localization networks simultaneously. By controlling a supervision level of the localization network, STL helps the localization network focus on correct ROIs without any types of priors. We evaluate the proposed STL framework using two medical image datasets, chest X-rays and mammograms, and achieve signiticantly better localization performance compared to previous weakly supervised approaches.READ FULL TEXT VIEW PDF
Tensorflow framework for weakly supervised algorithms
Recently, deep convolutional neural networks (CNN) show promising performances in various computer vision tasks such as object classification [14, 22], localization (or detection) [19, 4], segmentation [6, 7], video classification 
, and pose estimation. CNN hierarchically builds up high-level semantic concepts from low-level visual features in a layer-by-layer manner based on convolution kernels which convolve pixels on local receptive fields. Among those tasks, object localization (or detection) is one of the fundamental problems in this research field. In object localization tasks, region-of-interests (ROIs), the most discriminative region in terms of semantic concepts, should be properly defined for each given image. A lot of training images with annotated bounding boxes or segmentation maps of ROIs are required in order to achieve good performance in object localization since those information gives strong prior in terms of exploring exact ROIs on test images [19, 4, 6, 7]. However, a dataset with such location information is hard to obtain because it requires heavy annotation efforts.
Weakly supervised learning for localization only uses a weak-labeled (i.e. image-level label) dataset which does not have any location information to localize objects in an image. In terms of finding common semantic features within a set of images having the same class label, this can be interpreted as a varient of multiple instance learning (MIL) [1, 15, 5].
Several previous works for CNN-based weakly supervised object localization have been presented with reasonable methods and good performances with the help of transfer learning from pre-trained networks [16, 28, 18]. Those approaches, however, require base networks pre-trained on relatively well-localized datasets (e.g., ILSVRC classification dataset) which is able to extract discriminative regions appropriately from semantically similar datasets (e.g., VOC) while providing good initial seed for localization. In other words, they fine-tune good initial feature maps extracted from pre-trained networks with respect to the objectives of localization tasks. Weakly supervised localization methods which rely on those base networks cannot be used in the applications which do not have enough well-localized images. Medical image analysis is representative because it is impossible to obtain such pre-trained networks. Furthermore, it is not feasible to use base networks pre-trained on general images such as ILSVRC or VOC datasets since ROI characteristics of medical images are thoroughly different from general images.
In this work, we propose a self-transfer learning (STL) framework for fully weakly supervised localization. STL co-optimizes both classification and localization networks simultaneously in order to guide the localization network with the most discriminative features in terms of the classification task (see Figure 1). The term fully means that the proposed method does not require not only the location information but also any types of pre-trained networks in a training stage, and the term self stands for weight sharing between classification and localization networks.
Our contributions can be summarized as follows:
We develop a fully weakly supervised learning framework based on CNN, a self-transfer learning, which enables accurate ROI localization given only the image-level labeled dataset without any pre-trained model.
We show that a weakly supervised localization based on CNN without good initial weights is not effective by itself since errors are backpropagated through a restricted path or with insufficient information.
We conduct computational experiments on the medical application which is one of the most important areas in computer vision. We use chest X-rays and mammograms to show the localization performance of the proposed STL. It is shown that STL helps the localization network finding a good local optimum.
The remainder of this paper is organized as follows. Section 2 presents previous works for the weakly supervised learning for object localization. In Section 3, the proposed STL framework is described in detail including its architecture and training scenarios. Section 4 shows experimental setup and results, and finally Section 5 concludes this paper.
There exist many studies to develop learning algorithms for object localization based on the weakly labeled dataset. Most previous works can be interpreted as the same framework, which use candidate regions extracted from an image and then select the correct localization among those regions [26, 17, 25, 20]. In this work, recent methods based on CNN are considered since they have shown a promising performance on weakly supervised object localization [16, 28, 18].
In a weakly supervised object localization task, we should find common features within a set of intra-class images, discriminate those intra-class features with each other, and define the most probable region in terms of target class. Transfer learning based on well pre-trained networks (preliminarily trained on different-but-similar datasets) helps to perform those challenging tasks, since pre-trained CNN properly defines ROIs based on discriminative convolutional filters already learned from semantically similar datasets[16, 28, 18].
In 3], so it appropriately extracts class-specific features from semantically similar datasets. From those feature maps, adaptation layers build up the class number of score maps. Adaptation layers consist of additional convolutional and global max pooling layers. The convolutional layer in the adaptation layers generates per-class score maps, and the most probable positions (with the highest activation) with respect to each class are pooled at each iteration in order to compute and backpropagate errors in a training phase. Since all the layers in this architecture are convolutional or pooling layers, rescaled input images can generate corresponding sizes of score maps. Each score map stands for the confidence level of existence of each object, i.e. per-class localization maps.
In , the authors train the deep multiple instance learning framework for jointly learning the object proposals and keywords simultaneously. They exploit parameters pre-trained on the ImageNet dataset 
for the object proposal network by assuming the object proposals per each image as a positive bag in the MIL framework. This assumption is quite reasonable, since the pre-trained network extracts class-specific feature representations properly. Given an image, the class probabilities for all object proposals are pooled at each iteration in order to be compared with the true label vector while propagating errors backward. In an inference stage, multiple object proposals with appropriate keywords enable the image-level auto-annotation.
Inferring segmentation map is more challenging compared to object localization, since it should infer the class label per each pixel. In , the authors propose the weakly supervised segmentation framework which uses datasets only with image-level labels in a training phase. A convolutional network pre-trained on the ImageNet dataset  is used to extract feature maps from an input image. The class number of extracted feature maps (N) are aggregated into a single N-dimensional vector in the proposed aggregation layer to be compared with the true label. Compared to , the proposed aggregation layer adopts a Log-Sum-Exponential pooling method which is a smooth version of the max pooling in order to fairly explore the entire feature maps. In an inference stage, the class number of feature maps extracted from an input image are used as initial maps for segmentation. By adding some segmentation priors (image-level prior and smoothing prior), they achieve good segmentation results compared to previous works related to weakly supervised object segmentation.
Those weakly supervised approaches for object localization or segmentation are very helpful in general image domain because they do not require heavy annotation efforts as mentioned in the previous section. Strictly speaking, however, those are semi-weakly instead of fully weakly supervised framework, since they essentially require base networks pre-trained on different-but-similar datasets. Such networks pre-trained on well-localized datasets (e.g., ILSVRC classification dataset) can extract discriminative regions appropriately from different-but-similar datasets (e.g., VOC). Good initial feature maps extracted from pre-trained networks can be easily fine-tuned with respect to the objectives of localization tasks. Those previous works for semi-weakly supervised localization cannot be used in the applications which do not have base networks pre-trained on semantically similar datasets.
In this section, we present our proposed STL framework for weakly supervised learning. STL consists of three main components: shared convolutional layers, fully connected layers (i.e. classifier), and localization layers (i.e. localizer) (see Figure1). The key features of STL are twofold. First, it simultaneously propagates errors backward from both classifier and localizer to prevent the localizer from wandering a loss surface to find a local optimum. Second, an adjustable hyperparameter is introduced to control the relative importance between classifier and localizer.
For the task of image classification, CNN works well by virtue of its ability to extract useful features which discriminate the classes. As we pointed out in Section 2, all previous works based on CNN utilize those features which already have good representation capability.
The common strategy for weakly supervised localization based on CNN is to produce activation maps (in other words, score maps) for each class, and select or extract some representative activation value. Specifically, in case of a -class classification problem, the network should give a -dimensional class probability vector as an output to calculate errors using the true label vector. The intuitive way to make such output vector is to extract or select per-class representative activation from -channel activation maps. The dimensions of those maps are automatically determined by a network architecture. For example, if the fully connected layers of  are replaced with convolutional layers and 512512 input image feeds into that network whose size of global receptive field is 224224, we can obtain 10
10 activation maps since the global stride of the network is 32. If such a network is trained well, it is expected that a target object can be easily localized by examining the activation maps corresponding to its class.
To select or extract the representative activations for each class, typical pooling methods can be effectively used. In , a global max pooling method is used and its classification and localization performances are verified in the domain of general images. Another choice can be a global average pooling method. It might be more effective if there are some classes which have no ROIs characterizing the image-level label. Those classes might be a background class in general image domain or a normal class in medical images considered in this work.
Those approaches can be interpreted as a variant of multiple instance learning (MIL), which is designed for classification where labels are associated with sets of instances, called bags, intead of individual instances. In image classification tasks, the full size image and its subsampled patches are considered as a bag and instances, respectively. For instance, if we use a global max pooling to select a representative value among activations of patches, it is equivalent to use a well-classified single patch for building the decision boundary.
The proposed STL framework is basically based on a joint learning of classifier and localizer. For successful training of localizer, the initial values of network weights (i.e. filters) are very important because the learned filters can guide the localizer. Without such good filters, it is hard to find a good local optimum since the localizer consistently concentrates on the subsampled region of original image whether it is a correct ROI or not.
To overcome such a limitation, classifier and localizer are trained simultaneously based on a weighting strategy in our framework. Figure 1 illustrates a systematic view of STL. In detail, it consists of shared convolutional layers C, classification layers F, and localization layers C. Two losses, Loss from classifier and Loss from localizer, are computed at the forward pass, and the weighted sum of those errors is propagated at the backward pass. The errors from classifier contribute to train the filters in an overall view, while those from localizer are backpropagated through the subsampled region which is the most important window to classify training set. At the early stage of training phase, the errors from classifier should be more weighted than those from localizer to prevent the localizer from falling in a bad local optimum. By reducing the effects of errors from localizer, good filters which have a discriminative power can be well trained although localizer fails to find objects associated with the class label. As training proceeds, the weight for localizer increases to focus on the subsampled region of input image. At this stage, the network’s filters are fine-tuned for the task of localization.
Consider a data set of input-target pairs . and denote an -th image and the corresponding -dimensional true label vector, respectively, where represents the number of classes. Note that if contains a single object, will be an one-hot coded vector. Assuming an image with a single class label, our objective function to be optimized is a weighted sum of cross-entropy losses from classifier and localizer, which can be defined as follows:
where and are -dimensional class probability vectors from classifier and localizer, respectively, for -th input, and
denotes an element-wise log operation. Note that the loss function can be extended to be dealt with the case in which a single image has multiple labels.
The effect of the proposed STL can be explained by examining a backpropagation process at the end of shared convolutional layers C, which can be depicted as Figure 2. In this figure, the node represents a particular node in the convolutional layer which is connected with nodes in the fully connected layer and nodes in the convolutional layer . Note that the layer is obtained by 11 convolution on the layer as shown in Figure 1 and, the backpropagated error at the node can be written as follows:
It should be noted that the relative importance between classifier and localizer is already reflected in the errors and through the weighted loss function defined as Equation 1.
It can be seen that the errors are backpropagated undesirably without due to the special treatment, a global pooling, for activation maps in the layer . For instance, if a global max pooling is used to aggregate the activations within each activation map and the location corresponding to node in the layer is not selected as the maximum, all ’s to be backpropagated from the layer will be zero. Therefore, the computed errors of most of nodes in the layer except for the nodes whose locations correspond to the maximal responses for each activation map will be zero. In case of a global average pooling, zero errors will be merely replaced with a mean of errors. This situation is not certainly desirable, especially when we train the network from the bottom up (i.e. without pre-trained filters). By incorporating classifier into a network architecture, the shared convolutional layers C can be consistently improved even if the backpropagated errors from localizer do not contribute to learn useful features.
The proposed STL framework is similar to well-known multi-task learning (MTL) [24, 27, 29]. MTL improves the performance of learning algorithms by learning classifiers for multiple related tasks jointly using the shared representation. We show that such MTL framework works well even if the network jointly learns exactly the same tasks. In this point of view, the proposed framework is more appropriate to be called as a multi-purpose learning since it has exactly the same tasks (i.e. classification with the same loss function), but have totally different purpose, one for classification and the other for localization.
|Testset||Shenzhen set||MC set|
In this section we use two medical image datasets, chest X-rays (CXRs) and mammograms, to evaluate the classification and localization performances of the proposed STL. CXRs and mammograms are very effective and frequently used for screening at the early stage of diagnosis process for tuberculosis and breast cancer. As mentioned in Section 1, such medical images generally do not have any additional information for localization (e.g., bounding boxes and/or segmentation maps for ROIs) except for image-level labels (e.g., normal, abnormal). However, it is very important not only to predict the precise image-level diagnosis result, but to provide finely localized ROIs for understanding of abnormalities.
As abnormal ROIs usually have only a small portion of the entire image, the network should be trained by high resolution images enough to capture the ROIs into its global receptive field. Therefore, all training CXRs and mammograms are resized to 500500. The network architecture used in this experiment is slightly modified based on the network from . We add one convolutional layer (i.e. the 6th convolutional layer) since the resolution of the input image is relatively high compared to images for general object classification tasks. Also, we set the number of hidden nodes in the fully connected layers to 2048. For localizer, 11 convolution operation performs on the added convolutional layer, and therefore 15
In this experiment, the number of activation maps is two since we are dealing with classification of two classes, normal and abnormal. To verify the effectiveness of STL, two typical pooling methods, max and average poolings, are applied globally to the activation maps, and their performance improvements with STL are examined. As depicted in Figure 1, the softmax loss function is used for both classifier and localizer.
All the experiments in this work are performed using Caffe. Each training set is randomly divided into 80% for training and 20% for validation, and the final model with the best validation accuracy is selected for performance evaluation. We consider an initial learning rate
and it is decreased by a factor of 2 for every 30 epochs. The network is trained via stochastic gradient descent with momentum 0.9 and the minibatch size is set to 64. The weight decay parameter is determined by a grid search through the comparison of validation accuracy. There is an additional hyperparameteron STL to determine the level of importance between classifier and localizer. We set its initial value to 0.1 so that the network more focuses on learning the representative features at the early stage, and it is increased to 0.9 after 60 epochs to fine-tune the localizer.
To compare the classification performance, an area under characteristic curve (AUC), accuracy and average precision (AP) of each class are used. For STL, class probabilities obtained from localizer is used for measuring performance. For a localization task, a similar performance metric in  is used. It is based on AP, but the difference is the way to count true positives and false positives. In classification, it is a true positive if its class probability exceeds some threshold. To measure the localization performance under this metric, the test image whose class probability is greater than some threshold (i.e. a true positive in case of classification) but the maximal response in the activation map does not fall within the ground truth annotations allowing some tolerance is counted as a false positive.
In our experiment, only positive class is considered for localization AP since there is no ROI on negative class. First, the activation map of positive class is resized to the size of original image via simple bilinear interpolation, then it is examined whether the maximal response falls into the ground truth annotations within 16 pixels tolerances which is a half of the global stride 32 of the considered network architecture. If the response is located inside true annotations, the test image is counted as a true positive. If not, it is counted as a false positive.
Tuberculosis (TB) is one of the major global health threats. Many curable TB patients in the developing countries are obliged to die because of delayed diagnosis, partly by the lack of radiography and radiologists. Therefore, developing a computer-aided diagnosis (CAD) system for TB screening can contribute to early diagnosis of TB, which results in prevention of deaths from TB.
We use three CXRs datasets, namely KIT, Shenzhen and MC sets in this experiment. All the CXRs used in this work are de-identified by the corresponding image providers. KIT set contains 10,848 DICOM images, consisting of 7,020 normal and 3,828 abnormal (TB) cases, from the Korean Institute of Tuberculosis (KIT) under Korean National Tuberculosis Association (KNTA), South Korea. Shenzhen and MC sets are available limited to research purpose [2, 11, 10]. Shenzhen set has 326 normal and 336 TB cases from Shenzhen No. 3 People’s Hospital, Guangdong Medical College, Shenzhen, China. Finally, MC set was collected from National Library of Medicine, National Institutes of Health, Bethesda, MD, USA. It consists of 80 normal and 58 TB cases.
We train the models using the KIT set, and test the classification and localization performances using the Shenzhen and MC sets. To evaluate the localization performance, we obtain their detailed annotations from the TB clinician since the testsets, Shenzhen and MC sets, do not contain any annotations for TB ROIs. Table 1 summarizes the experimental results. For both classification and localization tasks, STL consistently outperforms other methods. The best performance model is STL+AvePool. A global average pooling works well in this experiment since images of negative class act as a background. Since the value of localization AP is always less than that of classification AP (refer to the definition of localization AP in Section 4.2), it is important to see the improvement ratio. For a global average pooling, the localization APs are improved about 26% and 58% for Shenzhen and MC sets, respectively, while the improvement of classification APs for positive class are about 2% and 17%. This means that STL certainly assists localizer to find the most important ROIs which define the class label. Precision-recall curve is shown in Figure 3. The left half of Figure 4 shows the representative examples among the test sets.
We use two public mammography databases, called Digital Database for Screening Mammography (DDSM) [8, 9] and Mammographic Image Analysis Society (MIAS) , in this experiment. The DDSM and MIAS are used for training and testing, respectively. We preprocess DDSM images to have two labels, positive (abnormal) and negative (normal). Originally, abnormal mammographic images contain several types of abnormalities such as masses, microcalcification, etc. We merge all types of abnormalities into positive class to distinguish any abnormalities from normal, thus the number of positive and negative images are 4025 and 6338 respectively in the training set (DDSM). In test set (MIAS), there are 112 positive and 210 negative images. Note that we do not use any additional information except for image-level labels for training networks although the training set (DDSM) has boundary information for abnormal ROIs. The boundary information of test set (MIAS) is utilized to evaluate the localization performance.
Table 2 reports the classification and localization results. As we can see, classification of mammograms is much difficult compared to TB detection. First of all, mammograms used for training are low quality images which contain some degree of artifact and distortion generated from the scanning process for creating digital images from films. Moreover, this task is inherently complicated since there also exist quite a few irregular patterns in normal class caused by various shapes and characteristics of fatty tissues. Nevertheless, it is confirmed that STL is significantly better than other methods for both classification and localization. Again, for a global average pooling, the localization performance is improved about 242% while the classification performance is improved about 6%. For a global max pooling without STL, training loss is not decreased at all, i.e., it cannot be trained. Therefore, the localization performance of that is not reported in Table 2 and Figure 3 since there are no true positives for all probability thresholds. Figure 3 and Figure 4 show precision-recall curve and the representative examples among the test sets.
In this work, we propose a novel framework STL which enables training CNN for object localization without neither any location information nor pre-trained models. Our framework jointly learns both classifier and localizer using a weighted loss as an objective function for the purpose of preventing localizer from falling in a bad local optimum. Self-transfer is realized via a weight controlling the relative importance between classifier and localizer. Also, the effect of classifier on localizer is discussed to provide the rationale behind the advantages of the proposed framework. Computational experiments for medical vision tasks given only image-level labels show that the proposed framework outperforms the existing approaches in terms of both classification and localization performance metrics.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.
The Knowledge Engineering Review, 25(01):1–25, 2010.