ComBiNet: Compact Convolutional Bayesian Neural Network for Image Segmentation

04/14/2021 ∙ by Martin Ferianc, et al. ∙ Imperial College London UCL 11

Fully convolutional U-shaped neural networks have largely been the dominant approach for pixel-wise image segmentation. In this work, we tackle two defects that hinder their deployment in real-world applications: 1) Predictions lack uncertainty quantification that may be crucial to many decision making systems; 2) Large memory storage and computational consumption demanding extensive hardware resources. To address these issues and improve their practicality we demonstrate a few-parameter compact Bayesian convolutional architecture, that achieves a marginal improvement in accuracy in comparison to related work using significantly fewer parameters and compute operations. The architecture combines parameter-efficient operations such as separable convolutions, bi-linear interpolation, multi-scale feature propagation and Bayesian inference for per-pixel uncertainty quantification through Monte Carlo Dropout. The best performing configurations required fewer than 2.5 million parameters on diverse challenging datasets with few observations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 9

page 10

Code Repositories

ComBiNet

ComBiNet: Compact Convolutional Bayesian Neural Network for Image Segmentation


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image segmentation is the pixel-level computer vision task of segregating an image into discrete regions semantically. Among various algorithms, convolutional neural networks (CNNs) have been key to this task, demonstrating outstanding performance 

[10, 8, 22, 11, 12, 9, 29, 15, 1]. CNNs are able to express predictions as pixel-wise output masks by learning appropriate feature representations in an end-to-end fashion, while allowing processing inputs with various size. This is especially useful in inferring object support relationships for robotics, autonomous driving or healthcare, as well as scene geometry [12, 16, 23].

Figure 1: ComBiNet an U-Net [22, 11]

like architecture consisting of Repeat blocks with different input scales and dilation rates in an Atrous Spatial Pyramid Pooling (ASPP) module. The block contains Dense feature extracting blocks, Downsampling to reduce the spatial dimensionality by

and Upsampling for restoring it back after processing the features from a lower dimensionality. The context is transferred through an optional ASPP module and concatenated (C), pairing the same spatial resolution. On the right is the detail of the Dense block consisting of Basic Layers (BLs). The arrows represent data flow.

A practical drawback of regular CNNs is that they are unable to capture their uncertainty which is crucial for many safety-critical applications [6]. Bayesian CNNs [5]

adopt Bayesian inference to provide a principled uncertainty estimation on top of the segmentation masks. However, as the research community seeks to improve accuracy and better capture information in a wider range of applications, potential CNN architectures become deeper and further connection-wise complicated 

[8, 11, 14, 29]. As a result they are increasingly more compute and memory demanding and a regular modern CNN architecture cannot be easily adopted for Bayesian inference. As an analytical prediction of uncertainty is not tractable with such architectures, it is required to approximate it through Monte Carlo sampling with multiple runs through the network. The increased runtime costs, primarily due to sampling, has been a limiting factor of Bayesian CNNs in real-world image segmentation.

To address the aforementioned issues of lacking uncertainty quantification in regular CNNs and extensive execution cost, the contribution of this work is in improving the hardware performance of Bayesian CNNs for image segmentation, while also considering an efficient pixel-wise uncertainty quantification. Our approach builds on recent successes to improve software-hardware performance [8, 11, 3, 22, 9, 12, 27] and extends these into a novel Bayesian CNN architectural template as shown in Figure 1. Specifically, we focus on few-parameter/few-operation models which decrease the runtime cost of each feedforward pass, and present a compact design named ComBiNet. Monte Carlo Dropout [6] is used for Bayesian inference. We demonstrate ComBiNet’s superior performance on the few-samples video-based CamVid [2] dataset. We also achieve fine performance on a bacteria segmentation task from a database of darkfield microscopy images [20]. On the account of the results obtained, we demonstrate designs that achieve accuracy comparable to the state-of-the-art [1, 12, 11, 7, 17, 26, 28, 19], but requiring only a fraction of the parameters or operations in comparison.

2 Related Work

CNN-based architectures for image segmentation comprise of an encoder-decoder network, which first encodes the input into features and an upsampler that then recovers the output from the features as the decoder [15, 1]. The decoder is usually hierarchically opposite to the encoder, although both consist of multiple levels of computationally-expensive convolutions. Based on this encoder-decoder structure, the input is thereby refined to obtain the segmentation mask.

Long et al. [15] first proposed the idea of Fully Convolutional Network (FCN) for this task, which outputs a segmentation mask in any given spatial dimensionality. Further improvements were achieved using bipolar interpolation and skip connections [8]. However, FCN is limited to few-pixel local information and therefore prone to lose global semantic context. SegNet [1] was the first CNN trained end-to-end for segmentation. The novelty of the architecture was in eliminating the need for learning to upsample using fixed bi-linear interpolation for resolution recovery. Ronneberger et al. [22] introduced a contracting and expansive pathway to better capture context and improve localisation precision, forming the characteristic ”U”-shaped network.

Atrous convolutions [3, 29] have also been key to recent advancements, as they allow increasing the receptive field without changing the feature map resolution. Multiple such convolutional layers, that can accept the input in parallel, allow us to better account for multi-scale contextual information across images. This is termed Atrous Spatial Pyramid Pooling (ASPP) [29].

The downsampling of input images in deep classification networks can be hardware-inefficient, and several works have addressed this in the context of embedded vision applications. MobileNets [9] introduced the idea of factorising the standard convolution into depth-wise/kernel-wise separable convolutions, formed of a depth-wise convolution layer that filters the input and a convolution that combines these to create new features. In [21], the authors employed kernel-wise separable convolutions to construct a compact model with the objective of enabling efficient real-time semantic segmentation. ESPNet [17] used a hierarchical pyramid of dilated and convolutions to reduce the architecture size. Nekrasov et al. [19] developed an automatic way to find extremely light-weight architectures for image segmentation.

Bayesian neural networks [7, 5, 16, 13, 12]

assign a probability distribution on the network weights instead of point estimates, to provide uncertainty measurements in the predictions. Employing this Bayesian mathematical grounding for CNNs enables us to obtain both the mask and uncertainty associated with it in the context of image segmentation. To the best of our knowledge, there are only two works focusing on Bayesian CNNs in image segmentation for robust uncertainty quantification. Both of these approaches use Monte Carlo Dropout (MCD) 

[6], in which Gal and Ghahramani cast dropout [24] training in a NN as Bayesian inference without the need for additional parametrisation. In [12] the authors searched and utilised dropout positioning in a SegNet [1]. In [7] the authors learned the dropout rates with respect to a DenseNet-like architecture [11].

In comparison to the related work, our work repurposes existing approaches  [8, 11, 3, 22, 9, 12] to construct hardware-efficient networks by decreasing the number of parameters and multiply-add-accumulate (MAC) operations while also providing improvements in accuracy. Furthermore, unlike previous hardware-efficient works, we use Bayesian inference for uncertainty quantification through MCD. Finally, it is important to emphasise that while many previous image-classification approaches rely on pre-trained networks or fine-tuning to improve their results [1, 12, 26, 3, 19], this procedure is completely avoided in our work.

3 ComBiNet

The network architecture of ComBiNet is presented in Figure 1, which is based on a ”U”-net-like architecture [11, 22] that divides itself into upsampling and downsampling paths as briefly described in Section 2. Skip connections connecting the paths preserve sharp edges by reducing the coarseness of masks and as a result contextual information from the input images can be preserved. The general building unit of the network is a Repeat block. A Dense block at the bottom of the network is used to capture global image features in addition to the optional ASPP blocks that are placed with skip connections. The input is processed using a 2D convolution Pre-processing block, while the output is processed through a 2D convolution Post-processing block.

3.1 Repeat Block

Repeat blocks have the dual purpose of extracting features, through the Dense block, and extracting contextual information, through an optional ASPP block. Each block spatially downsamples the input by a factor of and later upsamples it back to the block’s input resolution.

The Repeat block is reusable, such that multiple blocks can be appended to one another to extract contextually richer features. The output of the Downsampling block is the input into the next Repeat block. This means the features and the input are processed at different spatial sizes. It is important to highlight there is a connection between the input of the Repeat block and the output of its encoding Dense block, prior to Downsampling. The input is concatenated to the output of the block, without being processed through the feature extracting Dense block, to enable propagation of local and global contextual information.

3.1.1 Basic Building Blocks

The Dense block is inspired by [10, 11] and shown expanded in Figure 1 on the right. It is a gradual concatenation of previous features allowing for feature-map changes processed through a Basic Layer (BL). A BL accepts inputs from all previous layers in a Dense block. The output channel number of the BL is restricted to a growth rate of , which is constant for all BLs in the network, to avoid exponential increase in the channels propagated. More intuitively, it regulates the amount of new information each layer can contribute to the global state. For similar reasons, the output of the Dense block does not automatically include the original input, unless considering the downsampling path. The Dense block can have an arbitrary number of BLs and their counts are increased towards smaller spatial input size. Efficient gradient and feature propagation is ensured by concatenations between all previous stages and the current stage. Details of the serially connected individual operations of BL, Downsampling and Upsampling are given below.

  • Basic Layer

    : Batch normalisation; ReLU;

    Completely separable convolution; Dropout

  • Downsample: Batch normalisation; ReLU; Convolution; Dropout; Max-pooling with stride 1; Blur with stride 2

  • Upsample: Bi-linear interpolation; Convolution

The BL first performs batch normalisation (BN) which pre-processes the inputs coming from the different BLs. This operation is followed by ReLU and a completely separable convolution for feature extraction. It consists of serially connected convolutions with the output channel size same as the input, while being channel-wise separated, followed by a reshaping pointwise convolution. We use completely separable convolutions for their parameter and MAC operation count efficiency. In particular, when paired with an appropriate , BL can be an extremely compact feature extractor. Additionally, we empirically observed that it is important to include BN between the spatially separated and pointwise parts of the completely separable convolution. The convolution is followed by a 2D dropout to provide regularisation and perform Bayesian inference [5].

The Downsampling extracts coarse semantic features. The combined operations include BN, ReLU, convolution, dropout and max-pooling with stride 1 and blurring with stride 2. We used additional blurring with max-pooling to preserve shift-invariance of convolutions [27].

The Upsampling uses the parameter-less bi-linear interpolation to save computational and memory resources. Furthermore, it also preserves shift invariance of objects in the input images and avoids aliasing [27]. We add a 2D convolution to the output of the interpolation to refine the upsampled features.

3.1.2 Atrous Spatial Pyramid Pooling (ASPP)

ASPP [29, 3], as briefly introduced in Section 2, has been successfully used in various segmentation models to capture contextual information. It consists of atrous (dilated) convolutions which enables the preservation of shift-invariance while at the same time increasing the receptive field and enhancing the robustness to augmentations [27]. Specifically, it is composed of convolutions interleaved with BN and ReLU to extract information over a wide spatial range though setting wider dilation rates in convolutions. Global average pooling and convolution are used for global feature aggregation at the given scale. Each part accepts inputs from all channels, downscales them such that the output is only channels. These are concatenated with all other in the channel-dimension and refined to the output channel dimension by convolution. Finally, we regularise by applying dropout. In our work we use ASPP blocks in all Repeat blocks, except the first and the last one. We also changed the original ordering of the dilated convolutions to place BN first, instead of the convolution, for better regularisation. We kept the partial channel numbers to to limit computation.

3.2 Bayesian Inference

MCD [6, 5] provides a scalable way to learn a predictive distribution, by applying dropout [24] to the output of convolutions at both training and test time. This leads to Bayesian inference over the network’s weights. The sampled distribution provided by the dropout is used to sample models from the learnt variational posterior distribution. Although this can be achieved without additional parameters, it requires sampling and repeating feedforward steps through the network with the same input. The repeated steps linearly increase the compute demand and hence it is of further importance that the network is hardware efficient both in terms of memory consumption as well as the number of operations for the individual runs. A pixel-wise entropy can be derived, based on the repeated runs, that quantifies uncertainty as . The is the pixel-wise mean of the softmax outputs across the runs with respect to output classes. The dropout rate presents a trade-off between data fit and uncertainty estimation. For convenience of hardware implementation, we use a dropout rate of 0.05 across the entire network for all experiments.

4 Experiments

This Section first discusses our experimentation settings and then presents an assessment of the results on the CamVid and bacteria datasets. We did not perform pre-training on additional image data or post-training fine-tuning. We introduce three ComBiNet models: ComBiNet-51, ComBiNet-62, ComBiNet-87 with the aim to trade-off computational complexity, accuracy and uncertainty quantification capabilities. We evaluated uncertainty through the mean per-pixel entropy of networks trained on CamVid or bacteria with respect to a random subset of 250 PascalVOC images [4]. The number of MACs was calculated with respect to input size and . We initalised the weights of all ComBiNets with respect to the He-Uniform initialisation [8]

. To train, we used Adam for 800 epochs with an initial learning rate of

and an exponential decrease with a factor . We trained ComBiNets with respect to a batch size of and with BN applied to each batch individually, as we found it essential to not use train-time statistics during evaluation. We set for the quantitative and qualitative software evaluation. For quantitative evaluation we measured the standard per-pixel mean intersection over union (mIoU), entropy, MACs and number of trainable parameters. We repeated each experiment 3 times from which we report mean and

a single standard deviation in following Tables.

Figure 2: Qualitative evaluation. (from left) The first column depicts the input image, the second column is the ground-truth segmentation mask, the remaining columns are with respect to predictions of ComBiNet-87, DenseNet-103 + CD and DeepLab-v3+-ResNet50.

4.1 CamVid

The CamVid road scenes dataset [2] originates from fully segmented videos from the perspective of a driving car. It consists of 367 frames for training, 101 frames for validation and 233 frames for testing of RGB images with a input resolution. There are 11 manually labelled classes that include roads, cars, signs etc. and a background that is usually ignored during training and evaluation. To augment the dataset we carried out channel-wise normalisation and the following randomly: re-scale inputs between a factor of to ; change aspect ratios between to ; crop with a square size of

; horizontal flips; and random colour changes with respect to contrast, saturation and hue for training. We used the combo loss function 

[25], and weighted it proportionally to class-pixels in the images as CamVid is unbalanced. A weight decay of was applied.

We summarise the performance of the different ComBiNets in Table 1, comparing to the other state-of-the-art segmentation networks that include those focused on hardware efficiency with respect to their number of parameters and those considering Bayesian inference. The results show all ComBiNets obtained competitive results on mIoU with significantly fewer parameters and MACs. One result that stands out is [30] which used video, fine-tuning and an overparametrised architecture. ComBiNet-87 is the most accurate of ComBiNets with approximately fewer parameters and MACs than its current equivalent with . ComBiNet-51 is the most hardware efficient with fewer parameters and fewer MACs than the Bayesian SegNet when , while achieving an accuracy that is still close to the related works. We also compared the entropy pixel-wise, in which ComBiNets are marginally better in comparison to [12, 7]. In Figures 2 and 3 we demonstrate the qualitative results. In general, the model is more uncertain in the objects that are more distant, occluded or surrounded by the background class (black), which was ignored during training and evaluation. The results of the segmentation showed that the most problematic classes were fence and sign/symbol, whilst roads and the sky were most accurately distinguished. Figure 2 demonstrates on one sample that the model is accurate also in comparison to the related work consisting of a non-Bayesian or a Bayesian model.

Method mIoU [%] Params [M] MACs [G] Entropy [nats]
SegNet [1] 55.6 29.7 - -
Bayesian SegNet [12] 63.1 29.7 30.8 0.68
DenseNet-103 [11] 66.9 9.4 24.9 -
DenseNet-103 + CD [7] 67.4 9.4 24.9 0.47
ESPNet [17] 55.6 0.36 - -
BiSeNet [26] 65.6 5.8 - -
ICNet [28] 67.1 6.7 - -
Compact Nets [19] 63.9 0.28 - -
DeepLab-v3+-ResNet50 [3] 57.6 16.6 13.2 -
Video-WideResNet38 [30] 79.8 137.1 - -
ComBiNet-51 66.10.3 0.7 4.2 0.690.02
ComBiNet-62 66.90.2 1.3 7.9 0.680.01
ComBiNet-87 67.90.1 2.3 9.4 0.650.02
Table 1: Comparison with respect to other networks on the CamVid test dataset, notes training and testing with respect to images instead of , denotes Bayesian approaches. Arrows denote desired trends. - denotes not reported. were replicated in this work and not officially reported.

4.2 Bacteria

The bacteria dataset [20] comprises of 366 darkfield microscopy images with manually annotated masks for segmentation. The task is to detect bacteria of the phylum Spirochaetes in blood. This therefore leads to a problem of segmenting two classes corresponding to the bacteria and red blood cells - Spirochaetes and Erythrocytes respectively. This is a challenging task due to both the nature of the problem, a heavily unbalanced dataset, and the collection methodology which results in considerable noisy RGB input images of varying sizes from to pixels. We randomly split the dataset into sizes 219, 73, 74 images for training, validation and test respectively. We then apply the same augmentations as those mentioned in Section 4.1 for the CamVid dataset, extended further with vertical flips. We train with respect to the Combo loss function and added a log-dice coefficient. Weight decay was set to .

Method mIoU [%] Params [M] MACs [G] Entropy [nats]
Bayesian SegNet [12] 76.1 29.7 30.8 0.19
DenseNet-103 + CD [7] 75.8 9.4 24.9 0.32
U-Net [22] 71.4 31.0 41.9 -
DeepLab-v3+-ResNet50 [3] 80.4 16.6 13.2 -
ComBiNet-51 82.30.4 0.7 4.2 0.180.02
ComBiNet-62 83.00.4 1.3 7.9 0.160.01
ComBiNet-87 82.30.2 2.3 9.4 0.160.02
Table 2: Comparison with respect to other networks on the bacteria test dataset, denotes Bayesian approaches. Arrows denote desired trends. were replicated in this work and not officially reported.
Figure 3: Qualitative evaluation. (from left) The first column depicts the input image, the second column are the ground-truth segmentation masks, the third column are the predictions and the fourth column are the per-pixel uncertainties measured through the predictive entropy of ComBiNet-87 and ComBiNet-62. The two top rows are with respect to CamVid models and the bottom two rows are with respect to bacteria models.

Table 2 shows that all ComBiNets obtain better accuracy with significantly fewer parameters and MACs. ComBiNet-51 is the most hardware efficient with fewer parameters and fewer MACs than DenseNet when

. We note that ComBiNet-87 achieves a worse accuracy than ComBiNet-62 in our experiments with this dataset, showing that a bigger network is not always the best. All ComBiNets infer that all unrecognisable objects should be classified as a background resulting in smaller entropy than the related work. The qualitative evaluation of Figures 

3 and 4 demonstrates the ability of the architecture to segment noisy images, while comparing it to DenseNet with Concrete Dropout (CD) [7] and Bayesian SegNet. We further depict the corresponding predictive uncertainty of this sample in Figure 5, which helps us understand the portions of the image where the architecture was less certain in its given predictions. It can be seen that the network is uncertain about suspicious bacteria bodies, which can further help practitioners to better understand their samples.

Figure 4: Qualitative evaluation. (from left) The first column depicts the input image, the second column is the ground-truth segmentation mask, the remaining columns are with respect to predictions of ComBiNet-62, DenseNet-103 + CD and Bayesian SegNet.
Figure 5: Qualitative evaluation of predictive uncertainty. (from left) The first column depicts the input image, the second column is the ground-truth segmentation mask and the remaining columns are with respect to predictions of ComBiNet-62, DenseNet-103 + CD and Bayesian SegNet.

4.3 Discussion

With respect to the qualitative results in Figure 3 along with the quantified uncertainty measured by per-pixel information entropy we observe that, due to the skip connections and gradual downsampling and upsampling, the model retains sharp edges and detail in the predictions. It is important to highlight the result that in sections of the images that were misclassified, we also observe that the model was more uncertain.

The main bottleneck of this work lies in its use of MCD for Bayesian inference, as it requires multiple feedforward runs, but no extra network weights, to obtain an uncertainty estimate in the output mask. These runs multiply the MAC cost and hence represents a trade-off between hardware demand and quality of approximation of the predictive distribution. For this reason lowering MACs at the individual feedforward pass level was the focus of this work. Additionally, in hardware it is possible to simply parallelise these runs [18]. Lastly, if uncertainty estimation is not needed, the presented networks can still guarantee high accuracy with respect to weight averaging, disabling dropout and setting , which was relatively lower by approximately one standard deviation as shown in the Tables 1 and 2 for CamVid or bacteria respectively.

5 Conclusion

We propose a compact Bayesian architecture, ComBiNet, that re-purposes hardware efficient operations for the task of image segmentation. We demonstrated that good accuracy along with predictive uncertainties can be achieved with significantly fewer parameters and MACs, lowering hardware resources and computational costs. We show that ComBiNet performs well with an imbalanced dataset, as well as the established CamVid dataset, showing higher uncertainty in misclassified sections. Furthermore, it was not necessary to perform any pre-training or post-training fine-tuning to reach the observed accuracy. For the future, we would like to measure and optimise the architectures with respect to other hardware performance metrics such as power consumption or structured instance-wise uncertainty estimation instead of pixel-wise.

References

  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (12), pp. 2481–2495. Cited by: §1, §1, §2, §2, §2, §2, Table 1.
  • [2] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla (2008) Segmentation and recognition using structure from motion point clouds. In ECCV, pp. 44–57. Cited by: §1, §4.1.
  • [3] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pp. 801–818. Cited by: §1, §2, §2, §3.1.2, Table 1, Table 2.
  • [4] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §4.
  • [5] Y. Gal and Z. Ghahramani (2015) Bayesian convolutional neural networks with bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158. Cited by: §1, §2, §3.1.1, §3.2.
  • [6] Y. Gal and Z. Ghahramani (2015) Dropout as a bayesian approximation. arXiv preprint arXiv:1506.02157. Cited by: §1, §1, §2, §3.2.
  • [7] Y. Gal, J. Hron, and A. Kendall (2017) Concrete dropout. In NeurIPS, pp. 3581–3590. Cited by: §1, §2, §4.1, §4.2, Table 1, Table 2.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §1, §1, §2, §2, §4.
  • [9] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1, §1, §2, §2.
  • [10] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In CVPR, pp. 4700–4708. Cited by: §1, §3.1.1.
  • [11] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio (2017) The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. In CVPR workshops, pp. 11–19. Cited by: Figure 1, §1, §1, §1, §2, §2, §3.1.1, §3, Table 1.
  • [12] A. Kendall, V. Badrinarayanan, and R. Cipolla (2015)

    Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding

    .
    arXiv preprint arXiv:1511.02680. Cited by: §1, §1, §2, §2, §4.1, Table 1, Table 2.
  • [13] F. Liang, Q. Li, and L. Zhou (2018) Bayesian neural networks for selection of drug sensitive genes. Journal of the American Statistical Association 113 (523), pp. 955–972. Cited by: §2.
  • [14] C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and L. Fei-Fei (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In CVPR, pp. 82–92. Cited by: §1.
  • [15] J. Long, E. Shelhamer, T. Darrell, and U. Berkeley (2014) Fully convolutional networks for semantic segmentation. arXiv preprint arXiv:1411.4038. Cited by: §1, §2, §2.
  • [16] R. McAllister, Y. Gal, A. Kendall, M. Van Der Wilk, A. Shah, R. Cipolla, and A. Weller (2017)

    Concrete problems for autonomous vehicle safety: advantages of bayesian deep learning

    .
    In IJCAI, IJCAI’17, pp. 4745–4753. Cited by: §1, §2.
  • [17] S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi (2018) Espnet: efficient spatial pyramid of dilated convolutions for semantic segmentation. In ECCV, pp. 552–568. Cited by: §1, §2, Table 1.
  • [18] T. Myojin, S. Hashimoto, and N. Ishihama (2020) Detecting uncertain bnn outputs on FPGA using monte carlo dropout sampling. In ICANN, pp. 27–38. Cited by: §4.3.
  • [19] V. Nekrasov, C. Shen, and I. Reid (2020) Template-based automatic search of compact semantic segmentation architectures. In The IEEE Winter Conference on Applications of Computer Vision, pp. 1980–1989. Cited by: §1, §2, §2, Table 1.
  • [20] L. Nguyen (2020) Bacteria detection with darkfield microscopy. Note: data retrieved from work at Hochschule Heilbronn, www.kaggle.com/longnguyen2306/bacteria-detection-with-darkfield-microscopy/metadata Cited by: §1, §4.2.
  • [21] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo (2017) Erfnet: efficient residual factorized convnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems 19 (1), pp. 263–272. Cited by: §2.
  • [22] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: Figure 1, §1, §1, §2, §2, §3, Table 2.
  • [23] J. Ruiz-del-Solar, P. Loncomilla, and N. Soto (2018) A survey on deep learning methods for robot vision. arXiv preprint arXiv:1803.10862. Cited by: §1.
  • [24] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.

    The journal of machine learning research

    15 (1), pp. 1929–1958.
    Cited by: §2, §3.2.
  • [25] S. A. Taghanaki, Y. Zheng, S. K. Zhou, B. Georgescu, P. Sharma, D. Xu, D. Comaniciu, and G. Hamarneh (2019) Combo loss: handling input and output imbalance in multi-organ segmentation. Computerized Medical Imaging and Graphics, pp. 24–33. Cited by: §4.1.
  • [26] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018) Bisenet: bilateral segmentation network for real-time semantic segmentation. In ECCV, pp. 325–341. Cited by: §1, §2, Table 1.
  • [27] R. Zhang (2019) Making convolutional networks shift-invariant again. arXiv preprint arXiv:1904.11486. Cited by: §1, §3.1.1, §3.1.1, §3.1.2.
  • [28] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia (2018) Icnet for real-time semantic segmentation on high-resolution images. In ECCV, pp. 405–420. Cited by: §1, Table 1.
  • [29] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In CVPR, pp. 2881–2890. Cited by: §1, §1, §2, §3.1.2.
  • [30] Y. Zhu, K. Sapra, F. A. Reda, K. J. Shih, S. Newsam, A. Tao, and B. Catanzaro (2019) Improving semantic segmentation via video propagation and label relaxation. In CVPR, pp. 8856–8865. Cited by: §4.1, Table 1.