AttentionBoost: Learning What to Attend by Boosting Fully Convolutional Networks

08/06/2019 ∙ by Gozde Nur Gunesli, et al. ∙ 8

Dense prediction models are widely used for image segmentation. One important challenge is to sufficiently train these models to yield good generalizations for hard-to-learn pixels. A typical group of such hard-to-learn pixels are boundaries between instances. Many studies have proposed to give specific attention to learning the boundary pixels. They include designing multi-task networks with an additional task of boundary prediction and increasing the weights of boundary pixels' predictions in the loss function. Such strategies require defining what to attend beforehand and incorporating this defined attention to the learning model. However, there may exist other groups of hard-to-learn pixels and manually defining and incorporating the appropriate attention for each group may not be feasible. In order to provide a more attainable and scalable solution, this paper proposes AttentionBoost, which is a new multi-attention learning model based on adaptive boosting. AttentionBoost designs a multi-stage network and introduces a new loss adjustment mechanism for a dense prediction model to adaptively learn what to attend at each stage directly on image data without necessitating any prior definition about what to attend. This mechanism modulates the attention of each stage to correct the mistakes of previous stages, by adjusting the loss weight of each pixel prediction separately with respect to how accurate the previous stages are on this pixel. This mechanism enables AttentionBoost to learn different attentions for different pixels at the same stage, according to difficulty of learning these pixels, as well as multiple attentions for the same pixel at different stages, according to confidence of these stages on their predictions for this pixel. Using gland segmentation as a showcase application, our experiments demonstrate that AttentionBoost improves the results of its counterparts.

READ FULL TEXT VIEW PDF

Authors

page 1

page 3

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Due to their ability to learn high-level complex features on image data [1]

, convolutional neural networks (CNNs) have shown a huge success on various image classification 

[2, 3, 4] and object detection [5] tasks over the last years. For the segmentation tasks, especially dense prediction models using fully convolutional networks (FCNs) have provided significant improvements in terms of both efficiency and accuracy [6]. Thus, FCNs have become a popular architectural choice also for medical image segmentation [7]. In spite of the success of the FCNs trained on very large datasets, training may become much more difficult when small quantities of annotated data are available and when pixels of background and foreground classes are highly imbalanced, which are indeed very typical cases for medical images. In such cases, without further adjustments, the networks tend to yield poor generalizations for pixels of a minority class as well as for hard-to-learn pixels.

The most common approach to mitigate the class-imbalance problem is to increase the relative weight of minority class predictions in the loss function. Although this approach forces the network to give more attention to learning the minority class, it may not increase the performance on hard-to-learn pixels when these pixels occur in both the majority and minority classes and when they distribute unevenly in a particular class. For instance, for the task of segmenting glands in a histopathological image, it is harder to learn the pixels close to gland boundaries, regardless of whether they belong to the foreground or the background class. Furthermore, although the number of such hard-to-learn pixels (and as a result, the total weight contribution of their predictions to the loss function) is relatively low, their correct classification greatly affects the success of the entire segmentation task since these boundary pixels separate multiple gland instances from each other.

To address this problem, it has been proposed to give specific attention to the classification of boundary pixels. One proposed solution is to adjust the weights of these pixels in the loss function based on their distances to the boundary of the closest gland instances [8]. The other solution is to give this attention via designing a multi-task architecture. This has been achieved by defining boundary prediction as an additional task, learning it together with the main task of gland segmentation, and combining the predicted maps at the end, either with a simple fusion function [9] or with an additional fusion network [10]. The multi-task architecture proposed by [11]

also includes one more additional task to predict the bounding boxes of the gland instances. Both of these solutions help better classify the boundary pixels since they give specific attention to decreasing mistakes that their network would make on these pixels. This attention is defined to alleviate one single mistake type relating to one group of hard-to-learn pixels, namely “incorrect boundary classification”, and this mistake type needs to be manually (externally) identified before designing and learning a network. This manual identification is indeed a natural choice for the gland segmentation task since multiple gland instances may seem as touching in histopathological images due to their nature. On the other hand, there may exist other groups of hard-to-learn pixels, and thus other types of mistakes associated with these pixels, in the images (see Fig. 

1). In order for these solutions to be scalable against multiple mistake types, either new weight adjustments or new additional tasks should be defined for each mistake type separately. Nevertheless, this should be done externally and manually, which might be challenging especially when these mistakes are not related with the nature of the images but with noise and artifacts. As shown in Fig. 1, histopathological images typically contain such noise and artifacts due to the tissue preparation (fixation, sectioning, and staining) procedures.

 
 (a) (b) (c)
Fig. 1: Examples of histopathological images of colon glands. In the gland segmentation task, it is more difficult to correctly classify the boundary pixels when two glands are very close to each other. The image shown in (a) contains such kind of glands. Additionally, these images typically contain noise and artifacts due to the tissue preparation procedures. For example, due to the density difference between glands and connective tissues (inside and outside of a gland), the fixation and sectioning procedures may result in large white artifacts outside the glands. The images given in (b) and (c) contain such kind of artifacts. It is common for gland segmentation algorithms to identify some of these large white artifacts as false glands. These are the images consisting of (a)-(b) normal glands and (c) cancerous glands.

In response to these issues, this paper introduces an iterative attention learning model based on adaptive boosting. This model, which we call AttentionBoost, proposes to learn multiple attentions directly on image data at the same time as it learns the network weights. To this end, AttentionBoost first designs a multi-stage system that contains a fully convolutional segmentation network in each stage. Then, it proposes to modulate the attention of each segmentation network for each training image, based on the pixel-wise errors of the previous stage networks, by introducing a new loss adjustment method for a dense prediction model. This method is inspired by the Adaboost algorithm [12] and adjusts the loss weight of each pixel prediction separately with respect to how confident the previous stage networks are on their correct/incorrect predictions for the same pixel. By doing so, the proposed AttentionBoost model enables to assign different attention levels to different pixels of the same image, according to the difficulty level of learning these pixels, as well as to adaptively select/learn what image parts (e.g., gland boundaries and artifacts) need more attention during the network training. This also forces the next stages to give more attention to learning the pixels incorrectly segmented by the previous stage networks. With this adaptive loss adjustment, AttentionBoost end-to-end trains its multi-stage network and combines the outputs of all stages to obtain the final segmentation. Using gland instance segmentation as a showcase application, our experiments demonstrate that this type of attention learning improves segmentation results not only for the boundary pixels but also for other hard-to-learn pixels, mostly corresponding to false positives emerged as a result of noise and artifacts.

Ii Related Work

The proposed AttentionBoost

model mainly differs from the related networks in the following aspects: The literature contains single attention models that externally define what to attend before the network training starts 

[8, 9, 10]. These attention points are manually determined as boundary pixels, assuming that these pixels are hard to learn. On the other hand, AttentionBoost is an error-driven multi-attention model and adaptively learns what to attend directly on image data without making any prior assumption.

AttentionBoost is also different than the iterative methods that have been proposed to correct the mistakes of a single model and refine its results. The basic idea of these methods is to decompose a segmentation task into iterative stages where image features are learned together with high-level context features from the previous map to improve the result at the current stage [13, 14, 15, 16, 17]. For that, these methods give an input image and a predicted label map from the previous stage to the next stage iteratively, starting with a null label map [14, 15] or a segmentation map obtained from another model [16, 17], and use the last predicted map after some number of iterations. As opposed to the proposed AttentionBoost model, these methods learn the same task and use the same objective (loss) function in every stage, which does not explicitly force the network to change its attention to learning incorrectly segmented pixels but expects the network to implicitly learn how to correct its mistakes. On the other hand, although AttentionBoost uses the same segmentation task definition in all stages, since it adaptively changes the objective function from one stage to another, it can be considered that AttentionBoost learns a different subtask in each of these stages.

The literature also consists of studies that use different weight contributions in their loss functions. However, almost all of these studies address the class-imbalance problem. To this end, they calculate a constant weight for each class, typically inversely proportional to its pixel frequency, and use this constant weight for all predictions of the pixels belonging to the same class [18, 19, 20]. Different than these studies, instead of just calculating such constant weights based on the class pixel frequencies, AttentionBoost learns how to adjust the weights in the loss function on image data with the ability to give different weights for the pixel-wise predictions of the same class. There exists only a single study that attempts to learn the loss weights on image data for object detection [21]

. However, this previous study neither constructs multiple networks nor trains them iteratively, but it rather focuses on training a single stage network. Each epoch of this training updates the loss weight for each object to be detected separately and the next epoch uses the same updated weight for all pixels in the bounding box of the same object. Such an approach may increase the importance of learning misdetected and most probably harder-to-learn objects in the later epochs. However, since the use of a single network requires using the same network weights for all types of object detections and since the common type of (in)correctly detected objects may still dominate the loss function, this makes harder to explicitly focus on multiple detection subtasks with different difficulty levels at the same time. On the other hand, the proposed

AttentionBoost model enables to define multiple stages, each of which can contain a network with different attention (by adaptively changing the loss function). This, in turn, allows each stage to focus on a different aspect of the segmentation task. Additionally, this previous study [21] uses the same loss weight for all pixels of the same object (bounding box) without considering their pixel-wise contributions. On the contrary, AttentionBoost updates the loss weight for each pixel separately, according to the difficulty of learning this pixel.

In the literature, there also exist studies that combine the Adaboost algorithm [12] with a neural network architecture [22, 23, 24, 25, 26]. However, these studies do not involve a dense prediction task using an FCN, but they rather focus on the task of classifying an image instance. Therefore, they use the same attention for each image by either arranging different training sets for each learner or arranging loss weights for the training instances of each learner. These non-dense prediction models, which have been designed for a classification task, are beyond the scope of this paper. This paper uses the idea to adjust the loss weights of pixel-wise predictions in a dense prediction model for a segmentation task.

Iii Methodology

The AttentionBoost model proposes to train a multi-stage network that adjusts (learns) the attention of each of its stages automatically and to combine the outputs of all these stages for obtaining a final segmentation. To this end, it introduces an attention learning mechanism for a dense prediction model. This mechanism relies on devising a new loss adjustment method, in which the loss contribution of each pixel prediction at each stage is adjusted depending on the confidence levels of the correct/incorrect predictions of the previous stages.

The motivation behind designing such a multi-stage network is as follows: A network is trained as to optimize its objective function, and thus, the definition of this function greatly affects the network’s outputs. When there exist imbalanced data distributions and when all data points contribute to the objective function evenly, the network is biased to learning the most common patterns in the data. In this case, learning less common patterns will require adjustments in the objective function. However, making adjustments for many different patterns may not be that easy for a model that trains a single network with a single objective function. On the other hand, when the model allows training multiple (sub)networks that may use different objective (loss) functions, it is easier to make such adjustments since this gives the model an opportunity to modulate each network’s attention to a different goal.

With this motivation, this paper designs a multi-stage network architecture, each stage of which trains a network with a different loss function. To do so, it iteratively takes an image and a probability map from the previous stage as the input, adjusts its loss function according to this probability map, and outputs a new probability map for the next stage. The architecture of this multi-stage network is illustrated in Fig. 2 and its details are given in the following subsections.

Fig. 2: An overview of the proposed multi-stage network architecture that consists of four segmentation networks (FCNs). The -th stage network inputs an original image and a probability map estimated by the previous stage and outputs a new probability map for the next stage. While end-to-end training the multi-stage network, the loss contribution map for the -th stage is modulated by and , as given in Eqns. 2 and 3. In order to illustrate how this multi-stage network iteratively corrects its errors for an unseen image, this figure shows the posterior maps and loss contribution maps calculated for a test set image. Note that the loss contribution maps of this test set image are calculated just for a demonstration purpose since these maps are only calculated for the training images during the network training. In the illustration of the contribution maps, the whiter the color of a pixel is, the higher it contributes to the corresponding loss function. The posterior maps include the probability of each pixel belonging to the foreground object. In these maps, posteriors between 1 and 0.5 are shown with increasing tints of red and posteriors between 0 and 0.5 are shown with increasing tints of blue; posteriors close to 0.5 seem whitish.

Iii-a Attention Learning

Let be an image in the training set , be a pixel in the training image , and be the ground truth for this pixel. Here if the pixel belongs to a foreground object and otherwise. Then, the loss function for the -th stage network is defined as

(1)

where is the foreground probability for pixel estimated by the -th stage network and is the contribution of this pixel prediction to the loss function . The attention learning mechanism of the AttentionBoost model proposes to iteratively learn these contributions , for each pixel and for each stage

, at the same time as learning the network weights by backpropagation. In particular, this mechanism decreases the loss contributions for correctly estimated pixels and increases them for incorrectly estimated ones, in the framework of adaptive boosting.

To this end, it defines the coefficient that controls how much to update the current loss contribution for the next stage. That is, this coefficient is used to calculate as follows, provided that the initial loss contributions are selected with respect to the class pixel frequencies. Note that one may also select the same for all pixels.

(2)
(3)

The term in Eqn. 3 quantifies how confident the -th stage network is on its estimation for pixel . Since , the resulting coefficient will converge to its minimum value of if the current network correctly estimates pixel and if it is very confident on this correct estimation. In this case, the loss contribution becomes smaller, which forces the next stage network to decrease its attention to learning this pixel . On the other hand, if the current network incorrectly estimates but if it is very confident on this incorrect estimation, will converge to its maximum value of . This time, the loss contribution becomes larger, which forces the next stage network to increase its attention to learning pixel . Thus, the coefficients, which are calculated based on the estimations of the current stage network, are used to modulate the attention of the next stage network.

Here it is worth to noting that after calculating the loss contributions using Eqn. 2, these contributions are normalized for the correctly estimated pixels of a training image and its incorrectly estimated pixels separately, such that for all correctly estimated pixels and for all incorrectly estimated pixels . This allows the next stage networks not to completely give up their attentions to learning the correctly segmented pixels. This is important since the output maps of all stages will be aggregated at the end to obtain the final segmentation (Sec. III-D).

Iii-B Base Model for Each Stage

This work uses the same FCN architecture for the networks in all of its stages111AttentionBoost does not require all networks to be the same. However, we select the same architecture for all networks for the sake of simplicity.. The FCN at the -th stage takes a normalized RGB image as an input together with the probability map that is estimated for this image by the previous stage network and outputs the probability map . In order to employ the same base model for all stages, a null map is used for where for all pixels.

The FCN architecture used as the base model consists of an encoder and a decoder path that are connected by symmetric connections (see Fig. 3). This architecture is similar to the one proposed in [8] where extra dropout layers [27] are added to reduce overfitting. This base model has the convolution layers with filters and pooling/upsampling layers with

filters. It uses the sigmoid activation function at its last layer and the ReLu activation function elsewhere. Note that by using this model, our multi-stage network is fit on the memory of the GPU during end-to-end training of its four networks, and thus, its training takes faster.

Fig. 3: Architecture of the FCN used as the base model. This architecture consists of an encoder and a decoder path that are connected by symmetric connections, similar to [8]. Each box represents a feature map with its dimensions and number of channels being indicated in order on its right. Each arrow corresponds to an operation which is distinguishable by its color.

Iii-C Multi-Stage Network Training

During the network training, the normalized RGB images in the training set are fed to the network together with their ground truth segmentation maps and the overall multi-stage network is trained in an end-to-end manner using the backpropagation algorithm. At each epoch, the forward pass calculates the loss contributions for each training image from the first stage to the last one iteratively, as described in Sec. III-A. Then, the loss functions are updated according to the calculated loss contributions and the backward pass updates the network weights by differentiating the updated loss functions.

Iii-D Gland Segmentation

After training its multi-stage network, for a given image , the AttentionBoost model aggregates the probability maps estimated by all of the stages by taking their average. Then, it first identifies the “certain” foreground and background regions on this average map and grows these regions onto the “uncertain” pixels. Here we use such an approach to alleviate the negative effects of noisy pixels that may arise in the average map due to the aggregation. This approach first classifies each pixel with a label as follows, based on its average probability and a confidence parameter .

(4)

Then, it identifies foreground and background seed regions by finding connected components of the foreground pixels and the background pixels, separately. After eliminating the seeds smaller than an area threshold and assigning the pixels of these eliminated seeds to the uncertain class, it grows the remaining ones onto the uncertain pixels with respect to their average probabilities. Each grown foreground seed region is considered as a gland in the final segmentation map. At the end, to smooth their boundaries, a majority filter with a size of is applied on the segmented glands.

Here we use a simple approach that calculates the average over the probability maps of all stages and then uses a region growing algorithm on this average map. One may consider designing and using more sophisticated approaches to process these probability maps. This can be considered as future research work of this study.

Iv Experiments

Iv-a Dataset

We test our model on a dataset of 200 microscopic images of colon biopsy samples obtained from the Pathology Department Archives of Hacettepe University School of Medicine. These samples are hematoxylin-and-eosin stained tissue sections containing normal and cancerous (colon adenocarcinomatous) glands. Their images are taken using a Nikon Coolscope Digital Microscope with a objective lens. The image resolution is .

The dataset is divided into training, validation, and test sets. The training images are used by the backpropagation algorithm to learn the weights of the proposed multi-stage network and the validation images are used for early stopping of the backpropagation algorithm. Both the training and validation images are employed to select the confidence parameter , the area threshold , and the majority filter size used by the gland segmentation step. This parameter selection is explained in Sec. IV-D. The test images are used neither for network training nor for parameter selection; they are used only for the evaluation purpose. Table I presents the number of images and the number of glands for each set.

Iv-B Implementation Details

The multi-stage network containing four FCNs is implemented in Python using the Keras deep learning framework. The network is trained on the GPU (GeForce GTX 1080 Ti). It is trained from scratch using randomly initialized network weights and with an early stopping approach based on the loss calculated for the validation images. The batch size is 1 and the drop-out factor is 0.2. The learning rate and the momentum value are adaptively adjusted using the AdaDelta optimizer 

[28].

Number of images Number of glands
Training Validation    Test Training Validation    Test
Normal 40 10 50 570 174 621
Cancerous 40 10 50 321 49 367
Total 80 20 100 891 223 988
TABLE I: Number of images and number of glands in the training, validation, and test sets.

Iv-C Evaluation

Segmentation results are quantitatively assessed using three criteria: 1) the object-level F-score to assess what percentage of gland objects are detected correctly, 2) the object-level Dice index to assess how accurately the pixels of the segmented gland objects overlap with those of their matching (maximally overlapping) ground truth objects, and 3) the Hausdorff distance to assess the shape similarity between the segmented gland objects and their matching ground truth objects. Note that these measures were also used in the GlaS Challenge Contest 

[29].

Iv-C1 F-score

A segmented gland object is considered as true positive (TP) if it intersects with at least 50 percent of a ground truth object, and as false positive (FP) otherwise. A ground truth object is considered as false negative (FN) if at least its 50 percent does not intersect with any segmented gland object. The object-level F-score is defined as:

F-score (5)
precision
recall

Iv-C2 Dice index

Let be a set of segmented gland objects in all images of a given dataset and be a set of ground truth objects in these images. To calculate the object-level Dice index on these two sets, the objects in and are first matched: Each is matched with a ground truth object that maximally overlaps . Similarly, each is matched with a segmented gland object that maximally overlaps . Then, by accumulating the Dice indices calculated for all matching object pairs, the object-level Dice index is defined as follows:

(6)

where and . Here is the Dice index of a pair of objects and , one from the segmented gland objects and the other from the ground truth objects. Note that if there is no matching ground truth object of a segmented gland object (or vice versa), the contribution of this object to the Dice index is zero.

Iv-C3 Hausdorff distance

Likewise, the objects in and are matched to calculate the object-level Hausdorff distance. Each is matched with that maximally overlaps . If there is no overlap, is the ground truth object that has the minimum Hausdorff distance from . Similarly, each is matched with that maximally overlaps . If there is no overlap, is the segmented gland object that has the minimum Hausdorff distance from . Then, by accumulating the Hausdorff distances calculated for all matching object pairs, the object-level Hausdorff distance is defined as follows:

(7)

is the Hausdorff distance between a pair of objects and , one from the segmented gland objects and the other from the ground truth objects. Note that gives the maximum of the minimum distances calculated from every pixel of the object to any pixel of the object .

Iv-D Parameter Selection

AttentionBoost uses three external parameters in its gland segmentation step. These are the confidence parameter to identify certain pixels for region growing, the area threshold to eliminate small regions, and the majority filter size to control how much to smooth gland boundaries. The grid search is used to select their values. For that, all combinations of , , are considered and the one that yields the highest Dice index for the training and validation images is selected. The test set images are not used in this selection at all. The selected values are , and . Sec. V-A will discuss the effects of this parameter selection to the model’s performance in detail. Note that the same procedure is used to select the external parameters of the comparison methods.

Iv-E Comparisons

We compare our model with three approaches implemented based on the previously reported dense prediction models [8, 9, 14]. The first two, the BoundaryAttentionWithLossAdjustment and BoundaryAttentionWithMultiTask methods, are single-stage models that give specific attention to predicting gland boundaries. However, as opposed to our proposed model, which automatically learns multiple attentions directly on image data, these comparison methods require a prior definition of what to attend and include this definition in their system design. We use these comparison methods to explore the benefits of our proposed multi-attention learning. The last comparison method, MultiStageWithoutAdaptiveBoosting, is a multi-stage model, each stage of which also takes an input image and a segmentation (probability) map from the previous stage and produces another segmentation map for the next stage. However, different than our model, it always uses the same objective (loss) function at all of its stages. It neither explicitly forces its network to modulate its attention to learning incorrectly predicted pixels nor employs adaptive boosting for this purpose. We use this last comparison method to understand the effectiveness of using adaptive boosting in a dense prediction model. The details of these three comparison methods are given below. Note that for fair comparisons, all these methods use the same FCN architecture, which is given in Fig. 3, in their base models.

Normal glands Cancerous glands All glands
 F-score    Dice Hausdorff  F-score    Dice Hausdorff  F-score    Dice Hausdorff
AttentionBoost 95.39 94.58 25.89 91.76 92.50 42.74 94.03 93.56 34.12
BoundaryAttentionWithLossAdjustment 89.39 86.36 71.16 87.57 90.66 55.09 88.69 88.46 63.29
BoundaryAttentionWithMultiTask 95.59 92.48 33.51 84.14 89.84 46.05 91.13 91.20 39.61
MultiStageWithoutAdaptiveBoosting 88.50 84.04 86.08 90.60 91.66 50.37 89.31 87.77 68.62
TABLE II: Quantitative results of the proposed AttentionBoost model and the comparison methods obtained on the test set images.

Iv-E1 BoundaryAttentionWithLossAdjustment

It gives specific attention to learning boundary pixels by increasing the importance of their correct prediction. For that, it adjusts the loss contributions of all pixels based on their distances to the boundary of the closest gland instances, as explained in [8]. Note that this method relies on the U-Net model that uses loss adjustments in its training. The pixels predicted as gland by this trained network typically form undersegmented components for multiple gland instances that are close to each other; some of these instances are connected to each other by narrow bridges. Thus, to improve the results of this comparison method, the gland pixels are postprocessed as follows: They are first eroded by a disk structuring element, eroded components smaller than a threshold are eliminated, and the remaining components are dilated by using the same structuring element. Here the size of the structuring element and the threshold are selected using the grid search on the training and validation images (see Sec. IV-D).

Iv-E2 BoundaryAttentionWithMultiTask

This method gives specific attention to learning boundary pixels by designing a multi-task architecture, similar to the DCAN model proposed in [9]. This architecture defines an additional task for boundary prediction and concurrently learns it together with the main task of gland segmentation. After training its network, the BoundaryAttentionWithMultiTask method locates glands in an image by subtracting the predicted boundary pixels from the predicted gland pixels and applying postprocessing. The postprocessing includes finding large connected components on the subtracted map and dilating them with a disk structuring element. Likewise, the area threshold and structuring element size are selected by the grid search.

Iv-E3 MultiStageWithoutAdaptiveBoosting

It uses the same multi-stage network of the proposed AttentionBoost model and iteratively trains this network as proposed in [14]. However, it uses the same loss function at all of its stages and does not use adaptive boosting at all. After its training, the segmentation map produced by its last stage is taken and postprocessed to locate glands in a given image. Its postprocessing procedure is the same with that of the BoundaryAttentionWithLossAdjustment method. The parameters used in this procedure are also selected by the grid search.

BoundaryAttention BoundaryAttention Multi-Stage
  Images Ground truths AttentionBoost (with loss adjustment) (with multi-task) (no adaptive boosting)
  
  
  
  
  
  
  (a) (b) (c) (d) (e) (f)
Fig. 4: (a) Example images containing normal (first three rows) and cancerous (last three rows) glands. (b) Ground truths. (c) Results of the proposed AttentionBoost model. (d) Results of the BoundaryAttentionWithLossAdjustment method, which gives specific attention to learning boundaries by changing the loss contributions of the boundary pixel predictions [8]. (e) Results of the BoundaryAttentionWithMultiTask method, which gives specific attention to learning boundaries by defining an additional task [9]. (f) Results of the MultiStageWithoutAdaptiveBoosting method, which uses a multi-stage network without adaptive boosting (without learning and adaptively changing the loss contributions) [14]. Note that these are the test set images; they are not used in any part of network training or parameter selection.

V Results

Table II reports the quantitative results of our proposed AttentionBoost model as well as those of the comparison methods. It presents the results obtained on all of the test set images as well as those obtained on the test set images containing normal and cancerous glands, separately. These results show that AttentionBoost is more successful at detecting and segmenting glands (higher F-score and Dice index values) as well as it yields more accurate gland shapes (lower Hausdorff distances). This is attributed to the ability of our model to automatically learn what to attend in images as well as to focus on different types of mistakes. To explore this further, we examine the following types of mistakes the methods make in their segmentations, visually (Fig. 4) and quantitatively (Table III).

  • Undersegmented ground truth objects: A ground truth object is considered as undersegmented if a segmented gland object intersects with at least 50 percent of but also intersects with at least 50 percent of another ground truth object . This mistake type commonly occurs when a method cannot correctly predict the labels of pixels close to the gland boundaries. As also mentioned in the introduction, this is the mistake type that most of the previous methods have attempted to solve by either adjusting the weights of the boundary pixels in the loss function [8] or defining boundary prediction as an additional task in a multi-task architecture [9, 10].

  • False positives: A segmented gland object is considered as false positive if it does not intersect with at least 50 percent of any ground truth object . In our experiments, we observe this mistake type due to two main reasons. The first one is to segment non-gland regions as gland objects. These non-gland regions are typically located around white artifacts, which are usually formed in tissues as a result of the tissue preparation (fixation and sectioning) procedures. Such an example can be seen in the first row of Fig. 4(d). The second reason is to oversegment small objects in a gland, usually close to its boundary. Two such examples (two small oversegmented objects) can be seen in the third row of Fig. 4(c). To distinguish these two sorts of false positives, we call a false segmented object if it does not intersect with at least 50 percent of any and if any does not intersect with at least 50 percent of . On the other hand, we call it a small oversegmented object, again if it does not intersect with at least 50 percent of any but if a ground truth object intersects with at least 50 percent of .

  • False negatives: A ground truth object is considered as false negative (missing object) if at least its 50 percent does not intersect with any segmented gland object .

Undersegmented False Small Missing
ground truth objects   segmented objects oversegmented objects ground truth objects
AttentionBoost 60 15 27 42
BoundaryAttentionWithLossAdjustment 222 46 15 20
BoundaryAttentionWithMultiTask 80 55 50 30
MultiStageWithoutAdaptiveBoosting 215 16 16 31
TABLE III: Number of the types of mistakes that the proposed AttentionBoost model and the comparison methods make on the test set images.

The number of the types of mistakes that the methods make on the test set images are reported in Table III and the visual results on exemplary test set images are provided in Fig. 4. These results demonstrate that the proposed AttentionBoost model leads to the best results both for undersegmentations, which emerge as a result of incorrectly classifying boundary pixels, and for false segmented objects, which are incorrectly located because of not differentiating true gland pixels from those that belong to non-gland regions mostly containing noise and artifacts. These are the two most common mistake types for this gland segmentation problem and our proposed model improves segmentation results for both at the same time, in contrast to its counterparts, which are good at either one mistake type or the other. This improvement is attributed to the following: AttentionBoost is a multi-stage and an error-driven multi-attention learning model, each stage of which is able to give a different level of attention to learning different parts (pixels) of an image. This enables each stage to produce a segmentation (posterior) map complementary to those of the other stages. The maps of different stages are complementary on the incorrect predictions, especially for hard-to-learn pixels, since it is usually quite difficult for a single network to produce the correct predictions for all such pixels. By having such complementary maps, errors in one map may be compensated by another. Thus, when these maps are aggregated, it is expected to obtain more robust predictions. This can also be seen in Fig. 5 that provides the posterior maps produced for two exemplary test set images. Note that AttentionBoost misses slightly more ground truth objects. However, in our experiments, we observe that most of them correspond to small ground truth objects close to image edges. The one at the upper-right corner of the image shown in the last row of Fig. 4(b) is an example of such small objects.

  
  
  (a) (b) (c) (d) (e) (f)
Fig. 5: (a) Posterior map generated by the first stage. (b) Posterior map generated by the second stage. (c) Posterior map generated by the third stage. (d) Posterior map generated by the fourth stage. (e) Average posterior map obtained by aggregating the posterior maps of all stages. (f) Posterior map produced by the ground truth segmentation. These maps include the pixel posteriors where 1 indicates that a pixel belongs to the gland class and 0 indicates that it belongs to the background. Posteriors between 1 and 0.5 are shown with increasing tints of red and posteriors between 0 and 0.5 are shown with increasing tints of blue. Note that in these images posteriors close to 0.5 seem whitish.
(a) (b) (c)
Fig. 6: Test set F-scores, Dice indices, and Hausdorff distances as a function of the model parameters: (a) confidence parameter , (b) area threshold , and (c) majority filter size .

When these results are compared with those of the other methods, we have the following observations: First, MultiStageWithoutAdaptiveBoosting, which is also a multi-stage model but uses the same loss function in all of its stages, is successful to eliminate false positives. However, it cannot sufficiently improve boundary pixel prediction throughout its stages, which leads to a significantly higher number of undersegmentations. This suggests the benefits of automatically adjusting the loss functions of consecutive stages via adaptive boosting. Second, BoundaryAttentionWithMultiTask, which designs a multi-task architecture that includes an additional task to give specific attention to boundary pixel prediction, gives relatively better results for undersegmentations. On the other hand, this method is effective for this specific mistake type at the expense of locating more false positives, as also seen in Fig 4(e). This indicates the effectiveness of learning multiple attentions directly on image data instead of externally defining specific attention type beforehand. The proposed AttentionBoost model adaptively learns multiple attentions by designing a multi-stage network and modulating the attention of each stage by adaptive boosting. Last, BoundaryAttentionWithLossAdjustment is less successful for reducing both undersegmented ground truth objects and false segmented glands. Most probably, it tends to locate glands more than necessary, which also results in missing only a small number of ground truth objects.

V-a Parameter Analysis

AttentionBoost has three external parameters used in its gland segmentation step: confidence parameter , area threshold , and filter size . We analyze the effects of these parameters on the model’s performance. To this end, for each parameter, we fix the selected values of the other two parameters and measure the test set F-score, Dice index, and Hausdorff distance as a function of the parameter of interest. These analyses are depicted in Fig. 6.

The gland segmentation step inputs the average probability map for an image and locates gland objects on this map. For that, it first identifies certain foreground and background pixels, from which the gland objects and background are grown. The confidence parameter determines which pixels are to be considered as certain, as given in Eqn. 4. When this parameter is selected too large, only the pixels for which is very close to 1 are selected for the foreground and those for which is very close to 0 are selected for the background. Such average posteriors can only be obtained when the networks at all stages give the same output with high confidence. However, this is not an expected output of our multi-stage network, especially for hard-to-learn pixels, since it is designed with the purpose of correcting the mistakes of one stage in another. Thus, larger values result in selecting a smaller number of certain foreground pixels, which decreases the number of gland objects to be grown. This, in turn, greatly lowers the model’s performance (lower F-scores, lower Dice indices, and higher Hausdorff distances). On the other hand, when this parameter is selected too small, almost all pixels are considered as certain. This also lowers the performance, by leading to more undersegmented gland objects, since pixels whose is around 0.5 are typically found on gland boundaries and these pixels are considered as certain when smaller values are used. This analysis is depicted in Fig. 6(a).

The area threshold is used to eliminate small certain seed regions, from which the gland objects and the background are grown. Too small values cannot eliminate noisy gland objects, which leads to false positives. On the other hand, too large values also eliminate small true glands, which this time leads to false negatives. Both of them lower the F-score. Here it is worth to noting that this parameter only slightly affects the Dice index and Hausdorff distance. The reason is that: Both of these measures are weighted averages of the Dice indices and Hausdorff distances calculated on individual gland objects, where the weights are determined by the areas of these objects (see Eqns. 6 and 7). Since this elimination is typically applicable to small-sized glands, it does not change these measures too much. This analysis is depicted in Fig. 6(b).

The last parameter is the filter size of the majority filter, which is applied on the grown gland objects to smooth their boundaries. Although it improves the appearance of the glands boundaries, this parameter does not change the number of the detected glands or does not change their areas too much. Thus, it only very slightly affects the performance measures, as shown in Fig. 6(c).

Vi Conclusion

This paper presents an error-driven multi-attention learning model for image segmentation. This model, which we call AttentionBoost, relies on designing a multi-stage network and adaptively learning what image parts (pixels) each stage needs to attend and the level of this attention directly on image data. To this end, it introduces a new loss adjustment mechanism that uses adaptive boosting for a dense prediction model for the first time. This mechanism modulates the attention of each stage to correct the mistakes of its previous stages, by adjusting the loss weight of each pixel separately according to how confident the previous stages are on their predictions for this pixel. We tested our model for the problem of gland instance segmentation in histopathological images. Our experiments revealed that the proposed AttentionBoost model, which enables to learn different attentions for different pixels at the same stage as well as to learn multiple attentions for the same pixel at different stages, leads to more accurate segmentation results compared to the existing approaches.

For an unseen image, AttentionBoost obtains the probability map by averaging those estimated by all stages of the multi-stage network. Then, it applies a simple seed-controlled region growing algorithm on the average map. One future research direction is to investigate more sophisticated ways of combining the probability maps of different stages. For example, one can train another neural network that inputs these probability maps and outputs the final segmentation. This work used gland instance segmentation as a showcase application. Applying this model for other instance segmentation problems is considered as another future research direction of this study.

References

  • [1] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in

    Proc. European Conf. Computer Vision

    , 2014, pp. 818–833.
  • [2]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in

    Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
  • [3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [4] C. Szegedy et al., “Going deeper with convolutions,” in

    Proc. IEEE Conf. Comp. Vis. Pattern Recognit.

    , Jun. 2015, pp. 1–9.
  • [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. Comp. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
  • [6] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. IEEE Conf. Comp. Vis. Pattern Recognit., Jun. 2015, pp. 3431–3440.
  • [7] G. Litjens et al.,, “A survey on deep learning in medical image analysis,” Med. Image Anal., vol. 42, pp. 60–88, 2017.
  • [8] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput. Assist. Intervent., 2015, pp. 234–241.
  • [9] H. Chen, X. Qi, L. Yu, Q. Dou, J. Qin, and P.-A. Heng, “DCAN: Deep contour-aware networks for object instance segmentation from histology images,” Med. Image Anal., vol. 36, pp. 135–146, 2017.
  • [10] Y. Xu et al., “Gland instance segmentation by deep multichannel side supervision,” in Proc. Int. Conf. Med. Image Comput. Comput. Assist. Intervent., 2016, pp. 496–504.
  • [11] Y. Xu et al., “Gland instance segmentation using deep multichannel neural networks,” in IEEE Trans. Biomed. Eng., vol. 64, no. 12, pp. 2901–2912, 2017.
  • [12] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” J. Comput. Syst. Sci., vol. 55, no. 1, pp. 119–139, 1997.
  • [13] Z. Tu and X. Bai, “Auto-context and its application to high-level vision tasks and 3D brain image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 10, pp. 1744–1757, 2010.
  • [14] K. Li, B. Hariharan, and J. Malik, “Iterative instance segmentation,” in Proc. IEEE Conf. Comp. Vis. Pattern Recognit., Jun. 2016, pp. 3659–3667.
  • [15] H. Shen, R. Wang, J. Zhang, and S. J. McKenna, “Boundary-aware fully convolutional network for brain tumor segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput. Assist. Intervent., 2017, pp. 433–441.
  • [16] S. Gidaris and N. Komodakis, “Detect, replace, refine: Deep structured prediction for pixel wise labeling,” in Proc. IEEE Conf. Comp. Vis. Pattern Recognit., Jun. 2017, pp. 5248–5257.
  • [17] A. Romero, M. Drozdzal, A. Erraqabi, S. Jegou, and Y. Bengio, “Image segmentation by iterative inference from conditional score estimation,” arXiv preprint arXiv:1705.07450, 2017.
  • [18] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” in Proc. IEEE Int. Conf. Comp. Vis., 2015, pp. 2650–2658.
  • [19] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, 2017.
  • [20] C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. J. Cardoso, “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations,” Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Springer, 2017, pp. 240–248.
  • [21] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” in Proc. IEEE Int. Conf. Comp. Vis., 2017, pp. 2980–2988.
  • [22] H. Schwenk and Y. Bengio, “Boosting neural networks,” Neural Comput., vol. 12, no. 8, pp. 1869–1887, 2000.
  • [23] D. Medera and S. Babinec, “Incremental learning of convolutional neural networks,” in Proc. Int. Joint Conf. Comput. Intell., 2009, pp. 547–550.
  • [24]

    Y. Gao, W. Rong, Y. Shen, and Z. Xiong, “Convolutional neural network based sentiment analysis using adaboost combination,” in

    Proc. Int. Joint Conf. Neural Networks, 2016, pp. 1333–1338.
  • [25] L. Wang, B. Zhang, J. Han, L. Shen, and C.-S. Qian, “Robust object representation by boosting-like deep learning architecture,” Signal Process. Image Comm., vol. 47, pp. 490–499, 2016.
  • [26] S. Han, Z. Meng, A.-S. Khan, and Y. Tong, “Incremental boosting convolutional neural network for facial action unit recognition,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 109–117.
  • [27] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp.1929–1958, 2014.
  • [28] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
  • [29] K. Sirinukunwattana et al.,, “Gland segmentation in colon histology images: The GlaS Challenge Contest,” arXiv preprint arXiv:1603.00275v2, 2016.