ComBiNet
ComBiNet: Compact Convolutional Bayesian Neural Network for Image Segmentation
view repo
Fully convolutional U-shaped neural networks have largely been the dominant approach for pixel-wise image segmentation. In this work, we tackle two defects that hinder their deployment in real-world applications: 1) Predictions lack uncertainty quantification that may be crucial to many decision making systems; 2) Large memory storage and computational consumption demanding extensive hardware resources. To address these issues and improve their practicality we demonstrate a few-parameter compact Bayesian convolutional architecture, that achieves a marginal improvement in accuracy in comparison to related work using significantly fewer parameters and compute operations. The architecture combines parameter-efficient operations such as separable convolutions, bi-linear interpolation, multi-scale feature propagation and Bayesian inference for per-pixel uncertainty quantification through Monte Carlo Dropout. The best performing configurations required fewer than 2.5 million parameters on diverse challenging datasets with few observations.
READ FULL TEXT VIEW PDFComBiNet: Compact Convolutional Bayesian Neural Network for Image Segmentation
Image segmentation is the pixel-level computer vision task of segregating an image into discrete regions semantically. Among various algorithms, convolutional neural networks (CNNs) have been key to this task, demonstrating outstanding performance
[10, 8, 22, 11, 12, 9, 29, 15, 1]. CNNs are able to express predictions as pixel-wise output masks by learning appropriate feature representations in an end-to-end fashion, while allowing processing inputs with various size. This is especially useful in inferring object support relationships for robotics, autonomous driving or healthcare, as well as scene geometry [12, 16, 23].A practical drawback of regular CNNs is that they are unable to capture their uncertainty which is crucial for many safety-critical applications [6]. Bayesian CNNs [5]
adopt Bayesian inference to provide a principled uncertainty estimation on top of the segmentation masks. However, as the research community seeks to improve accuracy and better capture information in a wider range of applications, potential CNN architectures become deeper and further connection-wise complicated
[8, 11, 14, 29]. As a result they are increasingly more compute and memory demanding and a regular modern CNN architecture cannot be easily adopted for Bayesian inference. As an analytical prediction of uncertainty is not tractable with such architectures, it is required to approximate it through Monte Carlo sampling with multiple runs through the network. The increased runtime costs, primarily due to sampling, has been a limiting factor of Bayesian CNNs in real-world image segmentation.To address the aforementioned issues of lacking uncertainty quantification in regular CNNs and extensive execution cost, the contribution of this work is in improving the hardware performance of Bayesian CNNs for image segmentation, while also considering an efficient pixel-wise uncertainty quantification. Our approach builds on recent successes to improve software-hardware performance [8, 11, 3, 22, 9, 12, 27] and extends these into a novel Bayesian CNN architectural template as shown in Figure 1. Specifically, we focus on few-parameter/few-operation models which decrease the runtime cost of each feedforward pass, and present a compact design named ComBiNet. Monte Carlo Dropout [6] is used for Bayesian inference. We demonstrate ComBiNet’s superior performance on the few-samples video-based CamVid [2] dataset. We also achieve fine performance on a bacteria segmentation task from a database of darkfield microscopy images [20]. On the account of the results obtained, we demonstrate designs that achieve accuracy comparable to the state-of-the-art [1, 12, 11, 7, 17, 26, 28, 19], but requiring only a fraction of the parameters or operations in comparison.
CNN-based architectures for image segmentation comprise of an encoder-decoder network, which first encodes the input into features and an upsampler that then recovers the output from the features as the decoder [15, 1]. The decoder is usually hierarchically opposite to the encoder, although both consist of multiple levels of computationally-expensive convolutions. Based on this encoder-decoder structure, the input is thereby refined to obtain the segmentation mask.
Long et al. [15] first proposed the idea of Fully Convolutional Network (FCN) for this task, which outputs a segmentation mask in any given spatial dimensionality. Further improvements were achieved using bipolar interpolation and skip connections [8]. However, FCN is limited to few-pixel local information and therefore prone to lose global semantic context. SegNet [1] was the first CNN trained end-to-end for segmentation. The novelty of the architecture was in eliminating the need for learning to upsample using fixed bi-linear interpolation for resolution recovery. Ronneberger et al. [22] introduced a contracting and expansive pathway to better capture context and improve localisation precision, forming the characteristic ”U”-shaped network.
Atrous convolutions [3, 29] have also been key to recent advancements, as they allow increasing the receptive field without changing the feature map resolution. Multiple such convolutional layers, that can accept the input in parallel, allow us to better account for multi-scale contextual information across images. This is termed Atrous Spatial Pyramid Pooling (ASPP) [29].
The downsampling of input images in deep classification networks can be hardware-inefficient, and several works have addressed this in the context of embedded vision applications. MobileNets [9] introduced the idea of factorising the standard convolution into depth-wise/kernel-wise separable convolutions, formed of a depth-wise convolution layer that filters the input and a convolution that combines these to create new features. In [21], the authors employed kernel-wise separable convolutions to construct a compact model with the objective of enabling efficient real-time semantic segmentation. ESPNet [17] used a hierarchical pyramid of dilated and convolutions to reduce the architecture size. Nekrasov et al. [19] developed an automatic way to find extremely light-weight architectures for image segmentation.
Bayesian neural networks [7, 5, 16, 13, 12]
assign a probability distribution on the network weights instead of point estimates, to provide uncertainty measurements in the predictions. Employing this Bayesian mathematical grounding for CNNs enables us to obtain both the mask and uncertainty associated with it in the context of image segmentation. To the best of our knowledge, there are only two works focusing on Bayesian CNNs in image segmentation for robust uncertainty quantification. Both of these approaches use Monte Carlo Dropout (MCD)
[6], in which Gal and Ghahramani cast dropout [24] training in a NN as Bayesian inference without the need for additional parametrisation. In [12] the authors searched and utilised dropout positioning in a SegNet [1]. In [7] the authors learned the dropout rates with respect to a DenseNet-like architecture [11].In comparison to the related work, our work repurposes existing approaches [8, 11, 3, 22, 9, 12] to construct hardware-efficient networks by decreasing the number of parameters and multiply-add-accumulate (MAC) operations while also providing improvements in accuracy. Furthermore, unlike previous hardware-efficient works, we use Bayesian inference for uncertainty quantification through MCD. Finally, it is important to emphasise that while many previous image-classification approaches rely on pre-trained networks or fine-tuning to improve their results [1, 12, 26, 3, 19], this procedure is completely avoided in our work.
The network architecture of ComBiNet is presented in Figure 1, which is based on a ”U”-net-like architecture [11, 22] that divides itself into upsampling and downsampling paths as briefly described in Section 2. Skip connections connecting the paths preserve sharp edges by reducing the coarseness of masks and as a result contextual information from the input images can be preserved. The general building unit of the network is a Repeat block. A Dense block at the bottom of the network is used to capture global image features in addition to the optional ASPP blocks that are placed with skip connections. The input is processed using a 2D convolution Pre-processing block, while the output is processed through a 2D convolution Post-processing block.
Repeat blocks have the dual purpose of extracting features, through the Dense block, and extracting contextual information, through an optional ASPP block. Each block spatially downsamples the input by a factor of and later upsamples it back to the block’s input resolution.
The Repeat block is reusable, such that multiple blocks can be appended to one another to extract contextually richer features. The output of the Downsampling block is the input into the next Repeat block. This means the features and the input are processed at different spatial sizes. It is important to highlight there is a connection between the input of the Repeat block and the output of its encoding Dense block, prior to Downsampling. The input is concatenated to the output of the block, without being processed through the feature extracting Dense block, to enable propagation of local and global contextual information.
The Dense block is inspired by [10, 11] and shown expanded in Figure 1 on the right. It is a gradual concatenation of previous features allowing for feature-map changes processed through a Basic Layer (BL). A BL accepts inputs from all previous layers in a Dense block. The output channel number of the BL is restricted to a growth rate of , which is constant for all BLs in the network, to avoid exponential increase in the channels propagated. More intuitively, it regulates the amount of new information each layer can contribute to the global state. For similar reasons, the output of the Dense block does not automatically include the original input, unless considering the downsampling path. The Dense block can have an arbitrary number of BLs and their counts are increased towards smaller spatial input size. Efficient gradient and feature propagation is ensured by concatenations between all previous stages and the current stage. Details of the serially connected individual operations of BL, Downsampling and Upsampling are given below.
Downsample: Batch normalisation; ReLU; Convolution; Dropout; Max-pooling with stride 1; Blur with stride 2
Upsample: Bi-linear interpolation; Convolution
The BL first performs batch normalisation (BN) which pre-processes the inputs coming from the different BLs. This operation is followed by ReLU and a completely separable convolution for feature extraction. It consists of serially connected convolutions with the output channel size same as the input, while being channel-wise separated, followed by a reshaping pointwise convolution. We use completely separable convolutions for their parameter and MAC operation count efficiency. In particular, when paired with an appropriate , BL can be an extremely compact feature extractor. Additionally, we empirically observed that it is important to include BN between the spatially separated and pointwise parts of the completely separable convolution. The convolution is followed by a 2D dropout to provide regularisation and perform Bayesian inference [5].
The Downsampling extracts coarse semantic features. The combined operations include BN, ReLU, convolution, dropout and max-pooling with stride 1 and blurring with stride 2. We used additional blurring with max-pooling to preserve shift-invariance of convolutions [27].
The Upsampling uses the parameter-less bi-linear interpolation to save computational and memory resources. Furthermore, it also preserves shift invariance of objects in the input images and avoids aliasing [27]. We add a 2D convolution to the output of the interpolation to refine the upsampled features.
ASPP [29, 3], as briefly introduced in Section 2, has been successfully used in various segmentation models to capture contextual information. It consists of atrous (dilated) convolutions which enables the preservation of shift-invariance while at the same time increasing the receptive field and enhancing the robustness to augmentations [27]. Specifically, it is composed of convolutions interleaved with BN and ReLU to extract information over a wide spatial range though setting wider dilation rates in convolutions. Global average pooling and convolution are used for global feature aggregation at the given scale. Each part accepts inputs from all channels, downscales them such that the output is only channels. These are concatenated with all other in the channel-dimension and refined to the output channel dimension by convolution. Finally, we regularise by applying dropout. In our work we use ASPP blocks in all Repeat blocks, except the first and the last one. We also changed the original ordering of the dilated convolutions to place BN first, instead of the convolution, for better regularisation. We kept the partial channel numbers to to limit computation.
MCD [6, 5] provides a scalable way to learn a predictive distribution, by applying dropout [24] to the output of convolutions at both training and test time. This leads to Bayesian inference over the network’s weights. The sampled distribution provided by the dropout is used to sample models from the learnt variational posterior distribution. Although this can be achieved without additional parameters, it requires sampling and repeating feedforward steps through the network with the same input. The repeated steps linearly increase the compute demand and hence it is of further importance that the network is hardware efficient both in terms of memory consumption as well as the number of operations for the individual runs. A pixel-wise entropy can be derived, based on the repeated runs, that quantifies uncertainty as . The is the pixel-wise mean of the softmax outputs across the runs with respect to output classes. The dropout rate presents a trade-off between data fit and uncertainty estimation. For convenience of hardware implementation, we use a dropout rate of 0.05 across the entire network for all experiments.
This Section first discusses our experimentation settings and then presents an assessment of the results on the CamVid and bacteria datasets. We did not perform pre-training on additional image data or post-training fine-tuning. We introduce three ComBiNet models: ComBiNet-51, ComBiNet-62, ComBiNet-87 with the aim to trade-off computational complexity, accuracy and uncertainty quantification capabilities. We evaluated uncertainty through the mean per-pixel entropy of networks trained on CamVid or bacteria with respect to a random subset of 250 PascalVOC images [4]. The number of MACs was calculated with respect to input size and . We initalised the weights of all ComBiNets with respect to the He-Uniform initialisation [8]
. To train, we used Adam for 800 epochs with an initial learning rate of
and an exponential decrease with a factor . We trained ComBiNets with respect to a batch size of and with BN applied to each batch individually, as we found it essential to not use train-time statistics during evaluation. We set for the quantitative and qualitative software evaluation. For quantitative evaluation we measured the standard per-pixel mean intersection over union (mIoU), entropy, MACs and number of trainable parameters. We repeated each experiment 3 times from which we report mean anda single standard deviation in following Tables.
The CamVid road scenes dataset [2] originates from fully segmented videos from the perspective of a driving car. It consists of 367 frames for training, 101 frames for validation and 233 frames for testing of RGB images with a input resolution. There are 11 manually labelled classes that include roads, cars, signs etc. and a background that is usually ignored during training and evaluation. To augment the dataset we carried out channel-wise normalisation and the following randomly: re-scale inputs between a factor of to ; change aspect ratios between to ; crop with a square size of
; horizontal flips; and random colour changes with respect to contrast, saturation and hue for training. We used the combo loss function
[25], and weighted it proportionally to class-pixels in the images as CamVid is unbalanced. A weight decay of was applied.We summarise the performance of the different ComBiNets in Table 1, comparing to the other state-of-the-art segmentation networks that include those focused on hardware efficiency with respect to their number of parameters and those considering Bayesian inference. The results show all ComBiNets obtained competitive results on mIoU with significantly fewer parameters and MACs. One result that stands out is [30] which used video, fine-tuning and an overparametrised architecture. ComBiNet-87 is the most accurate of ComBiNets with approximately fewer parameters and MACs than its current equivalent with . ComBiNet-51 is the most hardware efficient with fewer parameters and fewer MACs than the Bayesian SegNet when , while achieving an accuracy that is still close to the related works. We also compared the entropy pixel-wise, in which ComBiNets are marginally better in comparison to [12, 7]. In Figures 2 and 3 we demonstrate the qualitative results. In general, the model is more uncertain in the objects that are more distant, occluded or surrounded by the background class (black), which was ignored during training and evaluation. The results of the segmentation showed that the most problematic classes were fence and sign/symbol, whilst roads and the sky were most accurately distinguished. Figure 2 demonstrates on one sample that the model is accurate also in comparison to the related work consisting of a non-Bayesian or a Bayesian model.
Method | mIoU [%] | Params [M] | MACs [G] | Entropy [nats] |
SegNet [1] | 55.6 | 29.7 | - | - |
Bayesian SegNet [12] | 63.1 | 29.7 | 30.8 | 0.68 |
DenseNet-103 [11] | 66.9 | 9.4 | 24.9 | - |
DenseNet-103 + CD [7] | 67.4 | 9.4 | 24.9 | 0.47 |
ESPNet [17] | 55.6 | 0.36 | - | - |
BiSeNet [26] | 65.6 | 5.8 | - | - |
ICNet [28] | 67.1 | 6.7 | - | - |
Compact Nets [19] | 63.9 | 0.28 | - | - |
DeepLab-v3+-ResNet50 [3] | 57.6 | 16.6 | 13.2 | - |
Video-WideResNet38 [30] | 79.8 | 137.1 | - | - |
ComBiNet-51 | 66.10.3 | 0.7 | 4.2 | 0.690.02 |
ComBiNet-62 | 66.90.2 | 1.3 | 7.9 | 0.680.01 |
ComBiNet-87 | 67.90.1 | 2.3 | 9.4 | 0.650.02 |
The bacteria dataset [20] comprises of 366 darkfield microscopy images with manually annotated masks for segmentation. The task is to detect bacteria of the phylum Spirochaetes in blood. This therefore leads to a problem of segmenting two classes corresponding to the bacteria and red blood cells - Spirochaetes and Erythrocytes respectively. This is a challenging task due to both the nature of the problem, a heavily unbalanced dataset, and the collection methodology which results in considerable noisy RGB input images of varying sizes from to pixels. We randomly split the dataset into sizes 219, 73, 74 images for training, validation and test respectively. We then apply the same augmentations as those mentioned in Section 4.1 for the CamVid dataset, extended further with vertical flips. We train with respect to the Combo loss function and added a log-dice coefficient. Weight decay was set to .
Method | mIoU [%] | Params [M] | MACs [G] | Entropy [nats] |
Bayesian SegNet [12] | 76.1 | 29.7 | 30.8 | 0.19 |
DenseNet-103 + CD [7] | 75.8 | 9.4 | 24.9 | 0.32 |
U-Net [22] | 71.4 | 31.0 | 41.9 | - |
DeepLab-v3+-ResNet50 [3] | 80.4 | 16.6 | 13.2 | - |
ComBiNet-51 | 82.30.4 | 0.7 | 4.2 | 0.180.02 |
ComBiNet-62 | 83.00.4 | 1.3 | 7.9 | 0.160.01 |
ComBiNet-87 | 82.30.2 | 2.3 | 9.4 | 0.160.02 |
Table 2 shows that all ComBiNets obtain better accuracy with significantly fewer parameters and MACs. ComBiNet-51 is the most hardware efficient with fewer parameters and fewer MACs than DenseNet when
. We note that ComBiNet-87 achieves a worse accuracy than ComBiNet-62 in our experiments with this dataset, showing that a bigger network is not always the best. All ComBiNets infer that all unrecognisable objects should be classified as a background resulting in smaller entropy than the related work. The qualitative evaluation of Figures
3 and 4 demonstrates the ability of the architecture to segment noisy images, while comparing it to DenseNet with Concrete Dropout (CD) [7] and Bayesian SegNet. We further depict the corresponding predictive uncertainty of this sample in Figure 5, which helps us understand the portions of the image where the architecture was less certain in its given predictions. It can be seen that the network is uncertain about suspicious bacteria bodies, which can further help practitioners to better understand their samples.With respect to the qualitative results in Figure 3 along with the quantified uncertainty measured by per-pixel information entropy we observe that, due to the skip connections and gradual downsampling and upsampling, the model retains sharp edges and detail in the predictions. It is important to highlight the result that in sections of the images that were misclassified, we also observe that the model was more uncertain.
The main bottleneck of this work lies in its use of MCD for Bayesian inference, as it requires multiple feedforward runs, but no extra network weights, to obtain an uncertainty estimate in the output mask. These runs multiply the MAC cost and hence represents a trade-off between hardware demand and quality of approximation of the predictive distribution. For this reason lowering MACs at the individual feedforward pass level was the focus of this work. Additionally, in hardware it is possible to simply parallelise these runs [18]. Lastly, if uncertainty estimation is not needed, the presented networks can still guarantee high accuracy with respect to weight averaging, disabling dropout and setting , which was relatively lower by approximately one standard deviation as shown in the Tables 1 and 2 for CamVid or bacteria respectively.
We propose a compact Bayesian architecture, ComBiNet, that re-purposes hardware efficient operations for the task of image segmentation. We demonstrated that good accuracy along with predictive uncertainties can be achieved with significantly fewer parameters and MACs, lowering hardware resources and computational costs. We show that ComBiNet performs well with an imbalanced dataset, as well as the established CamVid dataset, showing higher uncertainty in misclassified sections. Furthermore, it was not necessary to perform any pre-training or post-training fine-tuning to reach the observed accuracy. For the future, we would like to measure and optimise the architectures with respect to other hardware performance metrics such as power consumption or structured instance-wise uncertainty estimation instead of pixel-wise.
Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding
. arXiv preprint arXiv:1511.02680. Cited by: §1, §1, §2, §2, §4.1, Table 1, Table 2.Concrete problems for autonomous vehicle safety: advantages of bayesian deep learning
. In IJCAI, IJCAI’17, pp. 4745–4753. Cited by: §1, §2.The journal of machine learning research
15 (1), pp. 1929–1958. Cited by: §2, §3.2.