1 Introduction
Nondestructive 3D imaging techniques allow scientists to study the interior of objects which cannot otherwise be observed. For example, radiologists use Xray Computed Tomography (CT) to measure organ perfusion and Magnetic Resonance Imaging (MRI) to diagnose prostate carcinoma, among other applications [Buzug, Nitz]. In addition to medical applications, CT scans are used in manufacturing to identify defects before a part is deployed in a production environment and to certify physical properties of materials. A critical step in the analysis of CT scans is segmentation, wherein an analyst labels each voxel in a scan (e.g., as a tumor in the medical case or as a defect in the manufacturing case). However, due to the noise and artifacts found in CT scans along with human error, these segmentations are often expensive, irreproducible, and unreliable [Martinez]. Deep learning models such as convolutional neural networks (CNNs) have revolutionized the automated segmentation of 3D imaging by providing a fast, accurate solution to many challenges in segmentation.
For use with highconsequence part certification, segmentation must include uncertainty quantification (UQ). When deploying critical parts, such as those in cars and airplanes, analysts must provide accurate safety confidence intervals. Recent research casts deep neural networks as probabilistic models in order to obtain uncertainty measurements. Two common UQ architectures are Monte Carlo dropout networks (MCDNs) [Gal] and variational inferencebased Bayesian neural networks (BNNs) [Blundell]. MCDNs are easy to implement and enable UQ in the output space with little computational cost. In contrast, BNNs measure uncertainty in the weight space, resulting in mathematicallygrounded, comprehensive UQ at the cost of at least double the number of trainable parameters and increased convergence time [Gal]
. These difficulties combined with the curse of dimensionality have prevented the successful implementation of variational inferencebased BNNs in 3D domains.
Our contributions combine the previously distinct subfields of volumetric segmentation and UQ with our novel 3D Bayesian CNN (BCNN) architecture, which effectively predicts binary segmentations of CT scans of engineering materials in addition to generating interpretable, comprehensive uncertainty maps. In contrast to the theory that a variational inferencebased Bayesian architecture is computationally infeasible [Gal, Lak], especially in 3D, we show via experimental results on CT scan datasets of lithiumion battery electrode materials and laserwelded metals that our BCNN outperforms the regularlyadapted MCDN. As shown in Figure 1
, the BCNN segmentation results in a continuous uncertainty map with gradients that enable uncertainty quantification in numerical simulations that are sensitive to variances in geometry. To the best of our knowledge, our BCNN is the first variational inferencebased model designed for segmentation and UQ in a 3D domain.
2 Related Work
In this section, we describe recent publications in volumetric segmentation and UQ which enabled the success of our BCNN.
2.1 Volumetric Segmentation
The problem of volumetric segmentation has seen much highimpact work in the past three years. The 2D Fully Convolutional Network [Long] and UNet [Ronneberger] led Milletari et al. [Milletari] to propose the first 3D CNN for binary segmentation of MRI images, called VNet. At around the same time, Çiçek et al. [Cicek] proposed 3D UNet, a direct extension of the UNet to a 3D domain. While VNet was designed for binary segmentation of the human prostate and 3D UNet was designed for binary segmentation of the kidney of the Xenopus, they both employ an encoderdecoder architecture inspired by UNet [Milletari, Cicek]. In this technique, a 3D volume is mapped to a latent space via successive convolutional and pooling layers; this latent representation is then upsampled and convolved until it reaches the size of the original volume and outputs the resulting pervoxel segmentation [Ronneberger].
While most volumetric segmentation work pertains to the medical field, 3D materials segmentation is also an active area of research due to the importance of quality segmentations in physics simulations. In 2018, Konopczyński et al. [Kono] employed fully convolutional networks to segment CT scan volumes of short glass fibers, outperforming traditional nondeep learning techniques and achieving the first accurate results in lowresolution fiber segmentation. More recently, MacNeil et al. [MacNeil2019] proposed a semisupervised algorithm for segmentation of woven carbon fiber volumes from sparse input.
2.2 Uncertainty Quantification
While deep learning models often outperform traditional statistical approaches in terms of accuracy and generalizability, they do not have builtin uncertainty measurements like their statistical counterparts. Gal and Ghahramani [Gal]
showed that predictive probabilities (i.e., the softmax outputs of a model) are often erroneously interpreted as an uncertainty metric. Instead, recent work has cast neural networks as Bayesian models via approximating probabilistic models
[Gal] or utilized variational inference to learn the posterior distribution of the network weights [Blundell].2.2.1 Monte Carlo Dropout Networks
Gal and Ghahramani [Gal] showed that a neural network with dropout applied before every weight layer (an MCDN) is mathematically equivalent to an approximation to Damianou and Lawrence’s [Damianou] deep Gaussian process. Specifically, one can approximate a deep Gaussian process with covariance function by placing a variational distribution over each component of a spectral decomposition of the covariance function; this maps each layer of the deep Gaussian process to a layer of hidden units in a neural network. By averaging stochastic forward passes through the dropout network at inference time, one obtains a Monte Carlo approximation of the intractable approximate predictive distribution of the deep Gaussian process [Gal]
; thus the voxelwise standard deviations of the predictions are usable as an uncertainty metric.
One of the top benefits of the MCDN is its ease of implementation; as an architectureagnostic technique which is dependent only on the dropout layers, Monte Carlo dropout can easily be added to very large networks without an increase in parameters. As a result, MCDNs have been implemented with good results in several different applications. In particular, Liu et al. [Liu] successfully implemented a 3DMCDN for UQ in binary segmentations of MRI scans of the amygdala, and Martinez et al. [Martinez] used VNet with Monte Carlo dropout for UQ in binary segmentations of CT scans of woven composite materials.
While the MCDN is one of the most common UQ architectures used in deep learning, its statistical soundness has been called into question. Osband [Osband] argues that Monte Carlo dropout provides an approximation to the risk of a model rather than its uncertainty (in other words, that it approximates the inherent stochasticity of the model rather than the variability of the model’s posterior belief). Osband [Osband] also shows that the posterior distribution given by dropout does not necessarily converge as more data is gathered; instead, the posterior depends only on the interaction between the dropout rate and the model size.
2.2.2 Bayesian Neural Networks
Another approach to UQ in deep neural networks is Bayesian learning via variational inference (a BNN). Instead of point estimates, the network learns the posterior distribution over the weights given the dataset, denoted
, given the prior distribution . However, calculating the exact posterior distribution is intractable due to the extreme overparametrization found in neural networks. Previous work by Hinton and Van Camp [Hinton] and Graves [Graves] proposed variational learning as a method to approximate the posterior distribution. Variational learning finds the parameters of the distribution via the minimization of the variational free energy cost function, often called the expected lower bound (ELBO). It consists of the sum of the KullbackLeibler (KL) divergence and the negative loglikelihood (NLL), which Blundell et al. [Blundell] explains as embodying a tradeoff between satisfying the complexity of the dataset (represented by the KL term) and satisfying the simplicity prior (represented by the NLL term):(1) 
Blundell et al. [Blundell]
proposed the Bayes by Backprop algorithm, which combines variational inference with traditional backpropagation to find the best approximation to the posterior in a computationally feasible manner. Bayes by Backprop works by using the gradients calculated in backpropagation to “scale and shift” the variational parameters of the posterior, thus updating the posterior with minimal additional computation
[Blundell].One challenge associated with probabilistic weights is that all examples in a minibatch typically have similarly sampled weights, limiting the variance reduction effect of large minibatches [Wen]. Kingma and Welling [KingmaWelling] introduced local reparametrization, which greatly reduces the variance of stochastically sampled weights by transforming global weight uncertainty into independent local noise across examples in the minibatch. In a similar vein, Wen et al. [Wen] proposed the Flipout estimator, which empirically achieves ideal variance reduction by sampling weights pseudoindependently for each example. An important difference is that the local reparametrization works only for fully connected networks, while Flipout can be used effectively in fullyconnected, convolutional, and recurrent networks [Wen].
To the best of our knowledge, there are no implementations of a 3D BCNN with variational inference such as ours in the literature. In the 2D domain, Shridhar et al. [Shridhar] proposed a 2D BCNN with variational inference that extended local reparametrization to convolutional networks with a doubleconvolution approach and Softplus normalization [Shridhar]. In contrast, we employ the Flipout estimator [Wen], which Shridhar et al. do not reference. Furthermore, Ovadia et al. [Ovadia] showed that 2D BCNNs with Flipout and variational inference are effective for UQ on the MNIST and CIFAR10 datasets, but they found it was difficult to get BCNNs to work with complex datasets. We provide a solution via our 3D BCNN which is effective across multiple highcomplexity datasets with tens of millions of voxels.
3 Methodology
In this section, we present our BCNN architecture and describe our reasoning behind several design decisions.
3.1 Architecture
In Figure 2, we present a schematic representation of our BCNN architecture. Similarly to VNet, we employ an encoderdecoder architecture. The encoder half (left) of the network compresses the input into a latent space while the decoder half (right) decompresses the latent representation of the input into a segmentation map. We do not include stochastic layers in the encoder half of the network to maximize the amount of information transfer between the original volume and the latent space.
The encoder half of the network is comprised of four stages, each with two convolutional layers and normalization layers followed by a max pooling layer to reduce the size of the input. Thus, after each layer, our volume’s depth, height, and width are halved while its channels are doubled, reducing the size of our volume by a factor of four.
The decoder half of the network consists of three stages, corresponding to the first three layers of the encoder half. First, we upsample the output of the previous layer and apply convolutional and normalization layers to double our volume’s depth, height, and width while halving its channels. We then concatenate this volume with the prepooling output of the corresponding encoder layer; this skip connection assists in featureforwarding through the network. Then, we apply two more convolutional and normalization layers. At the end of the third stage, we apply a final convolutional layer as well as a sigmoid activation. This results in a volume of the same size as the input representing a binary segmentation probability map.
In the decoder half of the network, we implement volumetric convolutional layers with distributions over the weights. Each Bayesian convolutional layer is initialized with a prior and employs the aforementioned Flipout estimator to approximate the distribution during forward passes [Wen]. Our implementation draws from the Bayesian Layers library [Tran]
included in TensorFlow Probability
[Dillon], which keeps track of losses representing the KL divergence of the layer’s posterior distribution with respect to its prior. Our BCNN has 1,924,964 trainable parameters, while its Monte Carlo dropout counterpart has 1,403,059.3.2 Design Decisions
Since training volumes can be quite large, our batch size is constrained by the amount of available GPU memory, resulting in a batch size too small for batch normalization to accurately compute batch statistics. Thus, we implement a recent technique proposed by Wu and He
[Wu] called group normalization, which normalizes groups of channels and is shown to have accurate performance independent of batch size. Proper normalization was observed to be a critical factor in the convergence of our model; by tuning the number of groups used in the group normalization layers, we found that our model converged most reliably when using 4 groups.At each downward layer , we apply filters. This was found to be more effective than a more simple model with filters and a more complex model with filters. We hypothesize that some minimum amount of learned parameters was necessary to produce accurate segmentations, but that with filters, the overparameterization made training significantly more difficult.
We tested many priors, including scale mixture [Blundell], spikeandslab [Mitchell]
, and a normal distribution with increased variance, but found that a standard normal prior provided the best balance between weight initialization and weight exploration. Skip connections were found to slightly increase the accuracy of our predictions by forwarding finegrained features that otherwise would have been lost in the encoder half of the network. We experimented with both max pooling and downward convolutional layers and observed negligible difference.
4 Experiments
In this section, we describe our datasets and detail our training and testing procedures.
4.1 Datasets
Two 3D imaging datasets are used to test our BCNN. The first is a series of CT scans of graphite electrodes for lithiumion batteries, which we refer to as the Graphite dataset [Mueller2018, Pietsch2018]. This material consists of nonspherical particles (dark objects in the images) that are coated onto a substrate and calendared to densify. The academically manufactured (“numbered”) electrodes [Mueller2018] were imaged with 325 nm resolution and a domain size of m. The commercial (“named”) electrodes [Pietsch2018] were imaged at 162.5 nm resolution and a domain size of m. Eight samples were studied, each with approximately one billion voxels. Each volume was handsegmented using commercial tools [Norris2020]; these manual segmentations were used for training and testing. We trained our BCNN on the GCA400 volume and tested on the remaining seven electrodes.
Laserwelded metal joints comprise a second dataset, which we refer to as the Laser Weld dataset. To generate these volumes, two metal pieces are put into contact and joined with an incident laser beam. The light regions of the resulting scans represent voids or defects in the weld. The Laser Weld dataset consists of CT scans of nine laserwelded metal joint examples, each with tens of millions of voxels. Similarly to the battery materials, these volumes were manually segmented and used for training and testing. We trained a separate BCNN on samples S2, S24, and S25, testing on the remaining six heldout volumes.
For both datasets, we normalized each CT scan to have voxel values with zero mean and unit variance. Additionally, each CT scan was large enough to require that we process subvolumes of the 3D image rather than ingesting the entire scan as a whole into the neural network on the GPU.
4.2 Training
We use the Adam optimizer [KingmaBa] with learning rate initialized at for the Graphite dataset and
for the Laser Weld dataset; this difference is necessary because the volumes in the Graphite dataset are significantly larger than those of the Laser Weld dataset. Our learning rate exponentially decays after the tenth epoch
as detailed in Equation 2; this decay was necessary for the reliable convergence of our model, likely due to its stochastic nature.(2) 
We use the aforementioned Bayes by Backprop algorithm to train our BCNN, minimizing the variational free energy cost function as stated in Equation 1. Graves [Graves] notes that variational free energy is amenable to minibatch optimization by scaling the cost for minibatch as:
(3) 
The factor divides the KL divergence penalty such that it is distributed evenly over each minibatch; without this scaling, the KL divergence term dominates the equation, causing the model to converge to a posterior with suboptimal accuracy.
We parallelized our model and trained on two NVIDIA Tesla V100 GPUs with 32GB of memory each. For our BCNN, one epoch of 1331 chunks of size took approximately 17 minutes and 30 seconds with a maximum batch size of 3. We trained each model for 21 epochs on the Graphite dataset; for the Laser Weld dataset, the MCDN converged much faster, so we trained our BCNN for 27 epochs and the MCDN for 10 epochs.
4.3 Testing
We computed 48 Monte Carlo samples on each test volume to obtain a distribution of sigmoid values for each voxel. The Monte Carlo dropout technique is justified in representing uncertainty as the standard deviation of the sigmoid values because it approximates a Gaussian process [Gal]
; however, the BCNN does not guarantee adherence to a normal distribution. Thus, in order to effectively compare the outputs of both networks, we represent confidence intervals on the segmentation as the second and eighth deciles of the sigmoid values, and uncertainty as the difference. We compare our results against an MCDN of identical architecture to our BCNN except with regular convolutional layers instead of Bayesian convolutional layers and spatial dropout
[Tompson] applied at the end of each stage prior to upsampling.5 Results
In this section, we present inference results of our BCNN and compare its performance with the MCDN.
Sample  Method  Accuracy  UQ Mean () 

I  MCDN  
BCNN (ours)  
III  MCDN  
BCNN (ours)  
IV  MCDN  
BCNN (ours)  
GCA2000  MCDN  
BCNN (ours)  
25R6  MCDN  
BCNN (ours)  
E35  MCDN  
BCNN (ours)  
Litarion  MCDN  
BCNN (ours) 
5.1 Graphite Dataset
Figure 3 shows a successful segmentation and uncertainty measurements on the GCA2000 sample from the Graphite dataset. Our BCNN provides an equivalent or better segmentation than the MCDN and produces a usable, credible uncertainty map. Figure 1 shows a zoomedin portion of the III sample uncertainty map which highlights the continuity and visual gradients captured in our BCNN uncertainty map, while the MCDN produces uninterpretable voxelbyvoxel uncertainty measurements. We hypothesize that this is an advantage of our BCNN measuring the uncertainty in the weight space, rather than in the output space like the MCDN.
Table 1
lists a selection of descriptive statistics regarding model performance on the Graphite dataset. Our BCNN achieves a higher segmentation accuracy than the MCDN on the numbered datasets but slightly lower accuracy on the named datasets. The manual labels resulted from thresholding techniques and are known to be contain inaccuracies, especially at particle boundaries. As such, we conclude that the accuracy performance of our BCNN is similar to that of the MCDN with respect to these labels, but further assessments against refined labels are left for future work.
5.2 Laser Weld Dataset
Figure 4 shows a successful segmentation and uncertainty measurements on the S33 sample from the Laser Weld dataset. Note that the BCNN uncertainty map captures the uncertainty gradient (corresponding to the gray portion of the CT scan slice) at the top left and bottom left of the segmentation, while the MCDN uncertainty map displays a straight line. The uncertainty map contains lowuncertainty artifacts on the lines where the chunks were separated from the original CT scan volume; however, this is of little consequence because since the artifacts are significantly more lowintensity than the measured uncertainty, so they can easily be removed via a thresholding algorithm.
Table 2 lists a selection of descriptive statistics regarding model performance on the Laser Weld dataset. Note that it is slightly more difficult for our BCNN to produce accurate segmentations on the Laser Weld dataset than the Graphite dataset. While the accuracy of our BCNN prediction is usually less than a percentage point away from the MCDN and outperforms it on the S26 sample, our BCNN experiences a clear segmentation failure along the edges of the material in the S4 sample. Figure 5 shows a failure case of our BCNN on the S4 sample. Note however that the most intense areas of the uncertainty map correspond to the incorrect areas of the segmentation, indicating a successful uncertainty measurement. Thus, we use the uncertaintybased domainshift algorithm proposed by Martinez et al. [Martinez] to refine the segmentation and achieve an accuracy on par with the MCDN, also shown in Figure 5.
Sample  Method  Accuracy  UQ Mean () 

S1  MCDN  
BCNN (ours)  
S4  MCDN  
BCNN (ours)  
S15  MCDN  
BCNN (ours)  
S26  MCDN  
BCNN (ours)  
S31  MCDN  
BCNN (ours)  
S32  MCDN  
BCNN (ours)  
S33  MCDN  
BCNN (ours) 
5.3 Validation
Validation of UQ results is a difficult subject, and there has not been much work on determining whether a model’s UQ is justified given the data. For validating our BCNN, the most relevant work in this area is due to Mukhoti and Gal [Mukhoti]. They define two desiderata for quality uncertainty maps:

A high probability of being accurate when the model is certain, denoted .

A high probability of being uncertain when the model is inaccurate, denoted .
They estimate these quantities by evaluating accuracy and uncertainty using an sliding patch; if the patch accuracy is equal to or above a certain threshold, the entire patch is labeled accurate, and if the patch uncertainty is equal to or above a certain threshold, the entire patch is labeled uncertain. They define a metric called PAvPU (Patch Accuracy vs. Patch Uncertainty), which encodes the above two desiderata in addition to penalizing patches which are simultaneously accurate and uncertain [Mukhoti].
We implement PAvPU to validate our uncertainty results using a patch with accuracy threshold and uncertainty threshold equal to the mean of the uncertainty map. We detail our results in Table 5.3. In particular, note that our BCNN consistently outperforms the MCDN in both conditional probabilities, even doubling the score. Thus, we conclude that our BCNN is more effective than the MCDN in encoding the relationship between uncertainty and accuracy.
As PAvPU was designed for use with 2D semantic segmentations and not for 3D binary segmentations, it may not be sufficient to characterize the improvement in UQ achieved by the BCNN. Furthermore, the PAvPU calculation involves a penalty for patches which are accurate and uncertain, which may not necessarily be a detrimental characteristic of the segmentation [Mukhoti]. This is the term that most significantly affects the PAvPU values where MCDN achieves a better result than our BCNN: our BCNN simply measures more uncertainty than the MCDN. Additionally, introducing this penalty term encodes the goal of training a network which is not simultaneously uncertain and accurate; however, in the Bayesian view, uncertainty and accuracy are not mutually exclusive because uncertainty quantifies the proximity of a sample to the training distribution rather than confidence in a correct segmentation. We leave the development of a more relevant uncertainty metric as future work.
Sample  Method  PAvPU  

Litarion, Slice 324  MCDN  
(Graphite)  BCNN  
GCA2000, Slice 212  MCDN  
(Graphite)  BCNN  
III, Slice 64  MCDN  
(Graphite)  BCNN  
S1, Slice 176  MCDN  
(Laser Weld)  BCNN  
S26, Slice 596  MCDN  
(Laser Weld)  BCNN 
5.4 Advantages for Material Simulations
The objective of performing UQ on materials datasets is to obtain uncertainties which can inform and propagate throughout simulations involving said materials. For example, when simulating the performance of a sample from the Graphite dataset to bound its various physical properties, it is crucial to know the contact points of the material; the uncertainty maps generated by our BCNN represent confidence intervals on the segmentation, so we can infer the probability of a certain contact point occurring in the CT scanned material.
The voxelbyvoxel nature of the uncertainty maps given by the MCDN produce very jagged, unrealistic confidence intervals with little physical meaning. In contrast, the continuity and visual gradients of the uncertainty map generated by our BCNN enable better approximations to the actual geometric uncertainty in both the Graphite and Laser Weld materials. Our BCNN allows us to smoothly probe the uncertainty when performing simulations and justify each error bound we obtain with interpretable uncertainty maps, a major advantage when performing simulations for highconsequence scenarios.
6 Conclusion
In this work, we present a novel 3D Bayesian convolutional neural network (BCNN) for uncertainty quantification of binary segmentations, the first variationalinference based architecture to do so. By measuring uncertainty in the weight space, our BCNN provides interpretable, comprehensive UQ in 3D segmentations and outperforms the stateoftheart Monte Carlo dropout technique. We present results in the material simulations domain, including segmentation of battery and laser weld CT scans. Our BCNN produces uncertainty maps which capture continuity and visual gradients, outperforms Monte Carlo dropout networks (MCDNs) on recent uncertainty metrics, and achieves equal or better segmentation accuracy than MCDNs in most cases. Future investigation will likely include extending our BCNN to semantic segmentation and medical applications and comparing our results with other UQ techniques such as Lakshminarayanan’s [Lak] deep ensembles.
7 Acknowledgements
We’d like to thank Kellin Rumsey for his advice on effectively comparing the uncertainty outputs of the MCDN and our BCNN. We would also like to thank Kyle Karlson for providing the Laser Weld dataset and Chance Norris for curating and segmenting the Graphite dataset.
References
6 Conclusion
In this work, we present a novel 3D Bayesian convolutional neural network (BCNN) for uncertainty quantification of binary segmentations, the first variationalinference based architecture to do so. By measuring uncertainty in the weight space, our BCNN provides interpretable, comprehensive UQ in 3D segmentations and outperforms the stateoftheart Monte Carlo dropout technique. We present results in the material simulations domain, including segmentation of battery and laser weld CT scans. Our BCNN produces uncertainty maps which capture continuity and visual gradients, outperforms Monte Carlo dropout networks (MCDNs) on recent uncertainty metrics, and achieves equal or better segmentation accuracy than MCDNs in most cases. Future investigation will likely include extending our BCNN to semantic segmentation and medical applications and comparing our results with other UQ techniques such as Lakshminarayanan’s [Lak] deep ensembles.
7 Acknowledgements
We’d like to thank Kellin Rumsey for his advice on effectively comparing the uncertainty outputs of the MCDN and our BCNN. We would also like to thank Kyle Karlson for providing the Laser Weld dataset and Chance Norris for curating and segmenting the Graphite dataset.
References
7 Acknowledgements
We’d like to thank Kellin Rumsey for his advice on effectively comparing the uncertainty outputs of the MCDN and our BCNN. We would also like to thank Kyle Karlson for providing the Laser Weld dataset and Chance Norris for curating and segmenting the Graphite dataset.
Comments
There are no comments yet.