Notes to self: Self-attention is a mechanism to identify auxiliary objects to help improve detection accuracy of target structures. These are objects that generated from the saliency map of the region surrounding the target structures, and correspond to those that are highly likely to have an interaction with the target. Auxiliary objects appear with the target. Our goal here is to identify the auxiliary candidates that increase the detection performance of the target structures.
Computed tomography (CT) is the standard imaging modality used in radiation treatment planning. However, low soft-tissue contrast in CT , restricts the segmentation accuracy that can be achieved for various soft-tissue organ-at-risk structures. Advanced methods developed for medical images typically combine features computed from deeper layers [2, 3] to focus and improve segmentation.
Recent developments in self attention networks  enables aggregation of long-range contextual information that has been shown to produce more accurate segmentations in real-world [5, 6] images. Self attention aggregates features from all pixels within a context such as the whole image features in order to build support for a particular voxel. Such feature aggregation requires intensive computations to model the long-range dependencies [7, 8]. Therefore, we developed a new approach that employs local computations using 2D self attention blocks.
Our approach uses the insight that normal organs exhibit regularity in their spatial location and relation with respect to one another. Our approach exploits this regularity to simplify feature aggregation by processing information from local self attention blocks to model the pixel context. These blocks pass information between each other as the image (or its feature representation) is scanned in raster-scan order to then derive the attention flow. We also developed a stacked attention formulation by adding additional attention layers that improves inference by both increasing the contextual view and by capturing multiple information from different parts of the image .
Prior works attempted to reduce the computational burden for self attention by modeling relations between objects in an image , by successive pooling , and by aggregating information spatially and from features channels . The approach in  reduced computations by considering only the pixels lying in the horizontal and vertical directions of each pixel. However, this approach also ignores relations between pixels that occur in diagonal orientations.
Our approach improves on the segmentation and computational performance of prior methods as shown by our results. Figure 1 shows an example case with self attention map generated for a pixel (indicated by a yellow dot) randomly placed within the submandibular glands (Figure 1A) using the non-local network , herefrom referred as SA (Figure 1B), and our method using single layer of attention block (SAB) (Figure 1C) and two layers or dual attention blocks (DAB) (Figure 1D). As seen, the attention maps derived from our approach tends to focus within local structures and also captures information from relevant structures. For reproducible research, we will make the code available upon acceptance.
Given an image , the goal of the proposed method is to produce a segmentation corresponding to one or more structures in a computationally fast manner and with little memory requirement. We achieve this by using local attention blocks. An attention block is a region within which local self attention is computed. Addition of multiple attention layers enables attention to flow and increases the contextual field, thereby, modelling long-range contextual information (Figure 2).
2.1 Self Attention
In standard self attention , given an image represented by feature , where , , denote the size of feature channel, width and height, three feature embeddings of size , where , corresponding to query , key , and value are computed by projecting the feature through convolutional filters (Figure 2 C). The attention map is calculated by taking a matrix product of the query and the key features as:
where is the result of softmax computation and measures the impact of feature at position on the feature at position , whereby, similar features give rise to higher attention values. Thus attention for each feature roughly corresponds to its similarity to all other features in the context. The attention aggregated feature representation is computed by multiplying with value as:
Computing a global self attention map requires HWHW computations to cover an image feature of size , which in turn leads to time and space complexity of (), where (). This can quickly become prohibitive even for standard medical image volumes. Our approach simplifies these computations as described in the following subsection.
2.2 Local block-wise self attention
Our approach is similar to that in  for image generation. However, instead of employing an encoder for each pixel, which ignores the correlation between spatially adjacent pixels, we used 2D convolutions to jointly encode multiple pixels. Our approach reduces the computations compared with global self-attention methods by focusing feature aggregation to within fixed size attention blocks. When using non-overlapping attention blocks, an image of size and represented by features is divided into () blocks by scanning in a raster-scan order, where
is the size of the attention block. Non-overlapping attention blocks may result in block-like artifacts on the attention map. Therefore, we use overlapping attention blocks with stride(Figure 2(b)). Overlapping strides also facilitates information flow when stacking attention layers (red arrow in Figure 2d) as described below. The number of computations required to calculate attention of a single block is , which corresponds to 111
We assume an image feature is divisible otherwise padding is requiredfor the whole image.
Using attention blocks restricts the contextual field (within which feature is aggregated) to a region for one attention layer. Therefore, we increase the contextual field to by adding a second attention layer (layer 2 in Figure 2d). We call single layer attention as single attention block (SAB) and two layer attention as dual attention block (DAB). Each addition of attention layer doubles the computation cost for each block by , where is the number of attention layers and .
2.3 Implementation and Network Structure
All networks were implemented using the Pytorch library[paszke2017automatic] and trained on Nvidia GTX 1080Ti with 12GB memory. The ADAM algorithm [kingma2014adam] with an initial learning rate of 2e-4 was used during training. We implemented the attention method using U-net
. We modified U-net to include batch normalization after each convolution layer to speed up convergence.
3 Datasets and evaluation metrics
A total of 96 head and neck CT datasets were analyzed. All networks were trained using 48 patients from internal archive and tested on 48 patients from the external public domain database for computational anatomy (PDDCA). Training was performed by cropping the images to the head area and using image patches resized to 256 256, resulting in a total of 8000 training images. Models were constructed to segment multiple organs present in both datasets that included parotid glands, submandibular glands, brain stem, and manidble. Segmentation accuracy was evaluated by comparing with expert delineations using the Dice similarity coefficient (DSC).
consisting of CONV, BN, Relu is treated as a layer.of U-net with feature size 64128128 (CHW). Details of attention block placement are listed in supplementary document. For equitable comparison, other modules or methods were also implemented on the penultimate layer. We set self attention feature channel embedding =C2=32. Ablation tests were conducted to study the influence of (a) single vs. multiple attention layers, (b) placement of attention blocks (penultimate vs. last layer), (c) attention block sizes (B=24, 36, 48), and (d) overlapping (=) vs. non-overlapping (=) attention blocks.
5.1 Segmentation accuracy
shows a comparison of the segmentation accuracy achieved by the various methods using mean and standard deviation, computational complexity, number of model parameters with % increase in number of parameters compared with standard U-net (), the computing time measured as an average during training, and the % increase in computations over U-net (). Our method (DAB) produced the most accurate segmentation for all the analyzed organs. It required fewer computations and parameters compared with all except the criss-cross (CCA) method. DAB was also the fastest to compute among the self attention methods. The multiplication by 2 for complexity in DAB and CCA are due to the addition of a second attention layer. Figure 3 shows two representative examples with the algorithm and expert delineation. The arrows indicate problematic segmentations. As seen, whereas U-net and SA resulted in over-segmentation of the parotid gland (top row) and multiple methods resulted in under-segmentation of the right parotid gland (bottom row), DAB closely approximated the expert delineation.
5.2 Ablation experiments
Attention layers: As shown in Table 1(a) and 1(b), addition of attention block layers in the penultimate layer improves the segmentation performance with little additional computational time. Placement of attention in the last layer slightly increased accuracy. Besides CCA, its infeasible to add single attention in the last layer using other methods due to memory limitations.
Attention block size: There is only a minimal difference in the segmentation accuracy by increasing the block sizes (see Table 1(c),1(d)).
Overlapping vs. non-overlapping attention blocks: The difference of overlapping and non-overlapping become more obvious, namely with higher accuracy achieved with a smaller block size for overlapping blocks, especially when attention blocks were placed in the last layer.
5.3 Qualitative results using attention map
Figure 4 shows attention maps for representative examples computed by using SAB and DAB placed on the penultimate layer. As shown, changing the number of attention layers from one (SAB) to two (DAB) changes and increases the context of the structures involved, from local in SAB to adjacent and relevant structures in DAB. For comparison, the attention map computed using the SA method  is also shown. As shown, the SA  map includes all portions of the image, which does not lead to improved performance (Table 1).
We developed a computationally efficient approach that achieved similar to better performance than state-of-the-art self attention methods for normal organ segmentation in head and neck CT scans. While computing a global attention such as in prior methods [8, 13] clearly have the benefit of incorporating information for structures that have large variability in their location such as in scene parsing, such methods do not necessarily lead to improved performance for normal organ segmentation compared to block-wise attention.
We developed a novel block-wise self attention approach for segmenting normal organ structures from head and neck CT scans. Our results show that our method achieves computationally more efficient and better segmentation than multiple state-of-the art self attention methods.
-  Whitfield, G.A., Price, P., Price, G.J., Moore, C.J.: Automated delineation of radiotherapy volumes: are we going in the right direction? The British journal of radiology 86(1021) (2013) 20110718–20110718
Dou, T., Zhang, L., Zheng, H., Zhou, W.:
Local and non-local deep feature fusion for malignancy characterization of hepatocellular carcinoma.In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer (2018) 472–479
Oktay, O., Schlemper, J., Folgoc, J., Lee, M., Heinrich, M., Misawa, K., Mori,
K., McDonagh, S., Hammerla, N., Kainz, B., Glocker, B., Rueckert, D.:
Attention u-net: Learning where to look for the pancreas.
In: Proc. Machine Learning in Medical Imaging. (2018)
-  Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, N., Ku, A.: Image transformer. arXiv preprint arXiv:1802.05751 (2018)
-  Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. arXiv preprint arXiv:1811.11721 (2018)
-  Fu, J., Liu, J., Tian, H., Fang, Z., Lu, H.: Dual attention network for scene segmentation. arXiv preprint arXiv:1809.02983 (2018)
-  Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems. (2017) 5998–6008
-  Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks.
-  Wang, Y., Zhou, Y., Shen, W., Park, S., Fishman, E., Yuille, A.: Abdominal multi-organ segmentation with organ-attention networks and statistical fusion. CoRR abs/1804.08414 (2018)
-  Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2018) 3588–3597
-  Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018)
-  Raudaschl, P.F., Zaffino, P., Sharp, G.C., Spadea, M.F., Chen, A., Dawant, B.M., Albrecht, T., Gass, T., Langguth, C., Lüthi, M., et al.: Evaluation of segmentation methods on head and neck ct: Auto-segmentation challenge 2015. Medical physics 44(5) (2017) 2020–2036
-  Zhao, H., Zhang, Y., Liu, S., Shi, J., Change Loy, C., Lin, D., Jia, J.: Psanet: Point-wise spatial attention network for scene parsing. In: Proceedings of the European Conference on Computer Vision (ECCV). (2018) 267–283
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Springer (2015) 234–241