We propose the autofocus convolutional layer for semantic segmentation with the objective of enhancing the capabilities of neural networks for multi-scale processing. Autofocus layers adaptively change the size of the effective receptive field based on the processed context to generate more powerful features. This is achieved by parallelising multiple convolutional layers with different dilation rates, combined by an attention mechanism that learns to focus on the optimal scales driven by context. By sharing the weights of the parallel convolutions we introduce only a small number of trainable parameters. The proposed autofocus layer can be easily integrated into existing networks to cope with biological variability among tasks. We evaluate our models on the challenging tasks of multi-organ segmentation in pelvic CT and brain tumor segmentation in MRI and achieve very promising performance.READ FULL TEXT VIEW PDF
Semantic segmentation is a fundamental problem in medical image analysis. Automatic segmentation systems can improve clinical pipelines, facilitating quantitative assessment of pathology, treatment planning and monitoring of disease progression. They can also facilitate large-scale research studies, by extracting measurements from magnetic resonance images (MRI) or computational tomography (CT) scans of large populations in an efficient and reproducible manner.
For high performance, segmentation algorithms are required to use multi-scale context , while still aiming for pixel-level accuracy. Multi-scale processing provides detailed cues, such as texture information of a structure, combined with contextual information, such as a structure’s surroundings, which can facilitate decisions that are ambiguous when based only on local context. Note that such a mechanism is also part of the human visual system, via foveal and peripheral vision.
A large volume of research has sought algorithms for effective multi-scale processing. An overview of traditional approaches can be found in 
. Contemporary segmentation systems are often powered by convolutional neural networks (CNNs). The various network architectures proposed to effectively capture image context can be broadly grouped into three categories. The first type creates an image pyramid at multiple scales. The image is down-sampled and processed at different resolutions. Farabetet al. trained the same filters to perform on all such versions of an image to achieve scale invariance . In contrast, DeepMedic  proposed learning dedicated pathways for several scales, to enable 3D CNNs to extract more patterns from a larger context in a computationally efficient manner. The second type uses an encoder that gradually down-samples to capture more context, followed by a decoder that learns to upsample the segmentations, combining multi-scale context using skip connections . Later extensions include U-net , which used a larger decoder to learn upsampling features instead of segmentations as in . Learning to upsample with a decoder, however, increases model complexity and computational requirements, when downsampling may not be even necessary. Finally, driven by this idea, [3, 16] proposed dilated convolutions to process greater context without ever downsampling the feature maps. Taking it further, DeepLab  introduced the module Atrous Spatial Pyramid Pooling (Aspp), where dilated convolutions with varying rates are applied in parallel to capture multi-scale information. The activations from all scales are naively fused via summation or concatenation.
We propose the autofocus layer, a novel module that enhances the multi-scale processing of CNNs by learning to select the ‘appropriate’ scale for identifying different objects in an image. Our work on autofocus shares similarities with Aspp in that we also use parallel dilated convolutional filters to capture both local and more global context. The crucial difference is that instead of naively aggregating features from all scales, the autofocus layer adaptively chooses the optimal scale to focus on in a data-driven, learned manner. In particular, our autofocus module uses an attention mechanism  to indicate the importance of each scale when processing different locations of an image (Fig. 1). The computed attention maps, one per scale, serve as filters for the patterns extracted at that scale. Autofocus also enhances interpretability of a network as the attention maps reveal how it locally ‘zooms in or out’ to segment different context. Compared to the use of attention in , our solution is modular and independent of architecture.
We extensively evaluate and compare our method with strong baselines on two tasks: multi-organ segmentation in pelvic CT and brain tumor segmentation in MRI. We show that thanks to its adaptive nature, the autofocus layer copes well with biological variability in the two tasks, improving performance of a well-established model. Despite its simplicity, our system is competitive with more elaborate pipelines, demonstrating the potential of the autofocus mechanism. Additionally, autofocus can be easily incorporated into existing architectures by replacing a standard convolutional layer.
As they are fundamental to our work, we first present the basics of dilated convolutions [3, 16] while introducing notation. The standard 3D dilated convolutional layer at depth with dilation rate can be represented as a mapping , where and
are input and output tensors withchannels (feature maps) of size (
). For neurons in, the size of their receptive field on the input image can be controlled via . For dilated convolution layers with kernel size , can be derived recursively as follows:
denotes the stride of the receptive field at layer, which is a product of the strides of kernels in preceding layers. It can be observed from Eqn (1) that greater context can be captured by increasing dilation but in less detail as the input signal is probed more sparsely. Thus greater leads to a ‘zoom out’ behavior. Usually, the dilation rate
is a hyperparameter that is manually set and fixed for each layer. Standard convolution is a special case when. Below we describe the autofocus mechanism that adaptively chooses the optimal dilation rate for different areas of the input.
Unambiguously classifying different objects in an image is likely to require different combinations of local and global information. For example, large structures may be better segmented by processing a large receptive fieldat the expense of fine details, while small objects may require focusing on high resolution local information. Consequently, architectures that statically define multi-scale processing may be suboptimal. Our adaptive solution, the autofocus module, is summarized in Fig. 1 and formalized in the following.
Given activations of the previous layer , we capture multi-scale information by processing it in parallel via convolutional layers with different dilation rates . They produce tensors (Fig. 1(b)), each set to have same number of channels . They detect patterns at different scales which we merge in a data-driven manner by introducing a soft attention mechanism .
Within the module we construct a small attention network (Fig. 1(a)) that processes . In this work it consists of two convolutional layers. The first, , applies 333 kernels, produces half the number of channels than those in. The second, , applies 111 filters and produces a tensor with channels, one per scale. It is followed by an element-wise softmax that normalizes the activations for each voxel to add up to one. Let this normalized output be . Formally:
In the above, is an attention map that corresponds to the -th scale. For any specific spatial location (voxel), the corresponding values from the attention maps can be interpreted as how much focus to put on each scale. Thus the final output of the autofocus layer is computed by fusing the outputs from the parallel dillated convolutions as follows:
where is an element-wise multiplication. Note that the attention weights are shared across all channels of tensor for scale . Since the attention maps are predicted by a fully convolutional network, different attention is predicted for each voxel, driven by the image context for the optimal choice of scale (Fig. 1(c)).
The increase in representational power offered by each autofocus layer naturally comes with computational requirements as the module is based in parallelism of dilated convolutional layers. Therefore an appropriate balance should be sought, which we investigate in Sec. 3 with very promising results.
The size of some anatomical structures such as bones and organs may vary, while the overall appearance is rather similar. For others, size may correlate with appearance. For instance, the texture of large developed tumors differs from early-stage small tumors. This suggests that scale invariance could be leveraged to regularize learning but must be done appropriately. We make the parallel filters in an autofocus layer share parameters. This makes the number of trainable parameters independent of , with only the attention module adding parameters over a standard convolution. As a result, each parallel filter seeks patterns with similar appearance but of different sizes. Hence, the network is adaptively scale-invariant – the attention mechanism chooses the scale in a data-driven manner, unlike Farabet et al. , whose network learns shared filters between different scales but naively concatenates all their responses.
The proposed autofocus layer can be integrated into existing architectures to improve their multi-scale processing capabilities by replacing standard or dilated convolutions. To demonstrate this, we chose DeepMedic (Dm)  with residual connections  as a starting point. Dm uses different pathways with high and low resolution inputs for multi-scale processing. Instead, we keep only its high-resolution pathway and seek to empower it with our method. First, we enhance it with standard dilated convolutions with rate 2 in its last 6 hidden layers to enlarge its receptive field, arriving at the Basic model that serves as another baseline. We now define a family of AFNets by converting the last hidden layers of Basic to autofocus layers—denoted as “Afn-”, where . Fig. 2 shows AFNet-4. The proposed AFNets are trained end-to-end.
We extensively evaluate AFNets on the tasks of multi-organ and brain tumor segmentation. Specifically, on both tasks we perform: (1) a study where we successively add autofocus to more layers of the Basic network to explore its impact, and (2) comparison of AFNets with baselines. Finally, (3) we evaluate on the public benchmark BRATS’15 and show that our method competes with state-of-the-art pipelines regardless its simplicity, showing its potential.
We compare AFNets with the previously defined Basic model to show the contribution of autofocus layer over standard dilated convolutions. Similarly, we compare with DeepMedic , denoted as Dm, to compare our adaptive multi-scale processing with the static multi-scale pathways. Finally, we place an Aspp module  on top of Basic, comparison of which against Afn-1 shows contribution of the attention mechanism. Aspp-c and Aspp-s represent fusion of Aspp
activations via concatenation and summation respectively. Source codes and pretrained models in PyTorch framework are online available at:https://github.com/yaq007/Autofocus-Layer.
Material: We use two databases of pelvic CT scans, collected from patients diagnosed with prostate cancer in different clinical centers. The first, referred to as Add, contains 86 scans with varying number of 512x512 slices and 3mm inter-slice spacing. Uw
consists of 34 scans of 512x512 slices with 1mm inter-slice spacing. Expert oncologists manually delineated in all images the following structures: prostate gland, seminal vesicles (SV), bladder, rectum, left femur and right femur. Each scan is normalized so that its intensities have zero mean and unit variance. We also re-sampleUw to the spacing of Add. To produce a stringent test of the models’ generalization, we train them for this multi-class problem using the Add data and then evaluate them on Uw data.
Configuration details: Basic, Aspp and Afn
models were trained with the ADAM optimizer for 300 epochs to minimize the soft dice loss. Each batch consists of 7 segments of size . The learning rate starts at 0.001 and is reduced to 0.0001 after 200 epochs. We use dilation rates , , and () for both Aspp and the autofocus modules. It takes around 20 hours to train an AFNet with 2 NVIDIA TITAN X GPUs. Performance of DeepMedic was obtained by training the public software  with default parameters, but without augmentation and by sampling each class equally, similar to other methods.
Material: The training database of BRATS’15  consists of multi-modal MR scans of 274 cases, along with corresponding annotations of the tumors. We normalize each scan so that intensities belonging to the brain have zero mean and unit variance. For our ablation study, we train all models on the same 193 subjects and evaluate their performance on 54 subjects. The subsets were chosen randomly, including both high and low grade gliomas. Results on the remaining 23 cases aren’t reported as they were used for configuration during development. Following standard protocol, we report performance for segmenting the whole tumor, core and enhancing tumor. Finally, to compare with other methods, we train AFNet-6 on all 274 images, segment the 110 test cases of BRATS’15 (no annotations publicly available) and submit predictions for online evaluation.
Ablation study on BRATS’15 training database via cross-validation on 54 random held-out cases. Dice scores shown in format mean(standard deviation).
|Models||Afn-6||peres1†*||bakas1†||kamnk1/Dm ||kayab1* ||isenf1* |
Results from the ablation study on the cervical CT database and the BRATS database are summarized in Table 1 and Table 2 respectively. We observe the following: (a) Building Afn-1 by converting the last layer of Basic to autofocus improves performance, while (b) the gains surpass those by the popular Aspp for most classes of the tasks. It is important to note that Aspp adds multiple parallel convolutional layers without sharing weights between them. This incurs a large increase in the number of parameters, and is therefore partly the reason for improvements of Aspp over Basic (see Table 3). (c) Converting more layers of the Basic baseline to autofocus layers tends to improve performance. An exception is Afn-4 vs. Afn-5/6 on the Uw dataset. We speculate that this is due to randomness in training and suboptimal optimization. (d) Empowering the high-resolution pathway of DeepMedic with adaptive autofocus quickly outperforms the gains from the static second pathway on pelvic scan and brain tumor segmentation except for the enhancing tumor. We speculate that gains are more profound in the former task due to the greater variation in the size of structures, where the adaptive nature of autofocus shines. Finally we note that by sharing weights across scales, AFNets have small number of trainable parameters, shown in Table 3, which could enable rapid learning from little data, which is however left for future work. On the downside, the multiple scales on each autofocus layer increase memory and computation requirements.
Performance on test data of BRATS’15 obtained via the online evaluation platform is shown on Table 4, along with other top published methods. Afn-6 compares favorably to the semi-automatic methods that topped the BRATS’15 challenge [2, 14], as well as DeepMedic with the second static lower-resolution pathway. Note that in  high and low grade gliomas were separated by visual inspection and then passed to an appropriately specialized CNN, giving them an advantage over other methods. Our model is only surpassed by the pipelines of  and , who both used ensembles of CNNs with deep supervision and more aggressive data augmentation. The promising performance obtained by our simple method indicates the potential of the autofocus layer, which can be adopted in more elaborate systems.
We proposed an autofocus convolutional layer for segmentation of biomedical images. An autofocus layer can adapt the network’s receptive field at different spatial locations in a data-driven manner. Our extensive evaluation of AFNet
s shows that they cope well with biological variability in different tasks and generalize well on both MR and CT images. We have shown that the autofocus convolutional layer can be integrated into existing network architectures to substantially increase their representational power with only a small increase in model parameters. In addition, the interpretability of autofocus layers can leverage understanding of deep learning systems. Investigating the potential of autofocus modules in regression problems would be interesting future work.
G.W.C. and Y.Q. were partially supported by Guangzhou Science and Technology Planning Project (Grant No. 201704030051).
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. ICLR (2015)
Bakas, S., et al.: Glistrboost: combining multimodal mri segmentation, registration, and biophysical tumor growth modeling with gradient boosting machines for glioma segmentation. In: MICCAI BraTS Challenge. Springer (2015)
Galleguillos, C., Belongie, S.: Context based object categorization: A critical survey. Computer vision and image understanding 114(6), 712–722 (2010)