Log In Sign Up

Back to the Roots: Reconstructing Large and Complex Cranial Defects using an Image-based Statistical Shape Model

by   Jianning Li, et al.

Designing implants for large and complex cranial defects is a challenging task, even for professional designers. Current efforts on automating the design process focused mainly on convolutional neural networks (CNN), which have produced state-of-the-art results on reconstructing synthetic defects. However, existing CNN-based methods have been difficult to translate to clinical practice in cranioplasty, as their performance on complex and irregular cranial defects remains unsatisfactory. In this paper, a statistical shape model (SSM) built directly on the segmentation masks of the skulls is presented. We evaluate the SSM on several cranial implant design tasks, and the results show that, while the SSM performs suboptimally on synthetic defects compared to CNN-based approaches, it is capable of reconstructing large and complex defects with only minor manual corrections. The quality of the resulting implants is examined and assured by experienced neurosurgeons. In contrast, CNN-based approaches, even with massive data augmentation, fail or produce less-than-satisfactory implants for these cases. Codes are publicly available at


page 1

page 3

page 5

page 6


Easter2.0: Improving convolutional models for handwritten text recognition

Convolutional Neural Networks (CNN) have shown promising results for the...

Automatic Foot Ulcer segmentation Using an Ensemble of Convolutional Neural Networks

Foot ulcer is a common complication of diabetes mellitus; it is associat...

Synthetic Aperture Radar Image Change Detection via Layer Attention-Based Noise-Tolerant Network

Recently, change detection methods for synthetic aperture radar (SAR) im...

Adversarial Shape Learning for Building Extraction in VHR Remote Sensing Images

Building extraction in VHR RSIs remains to be a challenging task due to ...

Return of the Devil in the Details: Delving Deep into Convolutional Nets

The latest generation of Convolutional Neural Networks (CNN) have achiev...

Shape-Tailored Deep Neural Networks

We present Shape-Tailored Deep Neural Networks (ST-DNN). ST-DNN extend c...

TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization

The dominant CNN-based methods for cross-view image geo-localization rel...

Code Repositories


Back to the Roots: Reconstructing Large and Complex Cranial Defects using an Image-based Statistical Shape Model

view repo

I Introduction

Before deep learning gained wide popularity [egger2021deep], statistical shape model (SSM) and its variants (e.g., active shape models (ASMs) [cootes1992active], active blobs [sclaroff1998active] and active appearance models (AAMs) [cootes2001active]) were broadly adopted in medical reconstruction and segmentation tasks, such as the reconstruction of craniofacial defects [SemperHogg2017VirtualRO, Fuessinger2019VirtualRO, Fuessinger2017PlanningOS, Lamecker2008VariationalAS, pimentel2020automated, ZhangKun2014] and human rib cage [dworzak20103d], and the segmentation of hip joints [kainmueller2009articulated] and organs [lamecker2002statistical]. In contrast to the latent shape features learned by deep neural nets, which are difficult to interpret, SSM offers the option to express a shape in an explicit manner, by linearly combining the mean shape and the principal modes of shape variations of a given shape pool. Surface meshes were common choices for anatomical shape representation in many SSM-based studies. Establishing dense point correspondence among the meshes is deemed the most demanding part in building an SSM, especially when the medical images, from which the meshes are derived, are of high resolution [heimann2009statistical]. Methods that establish points correspondence automatically are typically based on a mesh-to-mesh registration procedure (e.g., Iterative Closest Point [besl1992method]), where the meshes are registered to a reference mesh through a similarity transformation (scaling, rotation and translation). However, popular state of the art segmentation methods and various medical applications are image-based [ronneberger2015u, zhou2018unet++, li2018h, milletari2016v, pepe2020detection]. It is therefore desirable to circumvent the image-to-mesh conversion procedure and build an SSM directly on images [reyneke2018construction, grauman2003inferring, bharath2018radiologic].

Automatic cranial implant design is another typical application that uses images as the initial shape representation [li2021synthetic]. Existing deep learning-based methods usually train a deep neural net on hundreds of skull images with either synthetic defects [morais2019automated, li2021automatic, li2020baseline, li2021autoimplant] or clinical defects [kodym2021deep], depending on the availability of the clinical images. These approaches are data- and computation- intensive, and most importantly, the quality of their reconstructions for large and complex defects, which are common in cranioplasty, remains inadequate for clinical use [ellis2021qualitative, mahdi2021u]. The failure can largely be attributed to domain shift: the synthetic defects in the training set have different distribution to that of the test set. Augmenting the training set intensively is a potentially practical and effective solution to the problem [wodzinski2021improving]. However, current development in data augmentation-enhanced deep learning is still evaluated as substandard by clinical experts [ellis2021qualitative, wodzinski2021improving].

A method that is independent from and insensitive to the defects may optimally avoid the domain shift problem in cranial defect reconstruction. To this end, we propose an SSM-based method for automatic cranial implant design. Unlike previous mesh-based SSM for craniofacial defect reconstruction [Fuessinger2019VirtualRO, Fuessinger2017PlanningOS, pimentel2020automated], our SSM is built directly on the volumetric skull images. We show that dense point correspondence among the training skull images can simply be achieved through an image registration and warping step, and the mean shape and shape variations of the skulls can be calculated thereafter. Besides the expected robustness against large and complex cranial defects, another favorable property of an SSM-based method is that the skull shapes can be expressed mathematically and explicitly, unlike deep learning-based approaches that try to learn an implicit and latent shape representation.

The proposed SSM is evaluated on both synthetic defects and irregularly shaped clinical defects from three cranial implant design tasks. We show that even though deep learning still beats SSM on synthetic defect reconstruction, the performance of SSM is superior and stable when it comes to large and complex clinical defects.

Figure 2: A: the input defective skull. B: the mean skull ( (30)). C: the subtraction between B and A. D: the reference skull . E: the subtraction between D and A.

Ii Method

Ii-a Overview

The main workload of building an SSM lies in finding the mean shape and the primary shape variations of a given shape pool, as specified by Equation 1:


is the weight of variation . is the number of shape variations chosen for reconstructing the new shape . Let be a shape pool containing binary volumetric images , in which the image () is selected as a reference. is the image dimension. The non-zero voxels in these images constitute the geometry of the shapes and are regarded by default as the shape landmarks. Therefore, establishing the point correspondences between all the images in the shape pool can be achieved by simply registering () to the reference image :


is a transformation that warps the images in into the space of : , . Let , be the set of warped shapes. The mean shape of is calculated as:


To extract the shape variations

, principal component analysis (PCA)

[peason1901lines] is used. Let , () be the set of chosen shape variations. Transforming into the PCA space can be achieved via:


. The is given by the PCA function from the scikit-learn package and is computed from the training set . Therefore, we can calculate the variation matrix as follows111Note that we compute explicitly based on Equation 5:


is a pseudo inverse of . Given a test shape , it is first registered to the reference image and then mapped into the PCA space defined by the shape pool :


. We rescale to [0,1] via: . Given , and , the new shape can be computed according to Equation 1. In our work, is chosen to be a similarity transformation. The reconstructed shapes can be warped to their original space via an inverse transformation .

Ii-B Volumetric Shape Completion

Ii-B1 Shape Warping

An intuitive way to complete a defective shape is to warp it to the space of a complete shape , which can be achieved through a registration process (i.e., ). Since the anatomical landmarks of the two shapes are aligned because of registration, a following subtraction operation 222Between two volumes of the same dimension. between the two shapes can yield the missing portion of the defective shape :


The addition of and produces the complete shape corresponding to . By inverting the registration, we can obtain the complete shape corresponding to in its original space:


The concept is similar to that of a template-based shape completion approach [kraevoy2005template], in which the missing part of a defective shape is taken from a complete template shape. The choice of the template shape affects the authenticity of . Optimally, a shape that is general and representative of the shape class should be chosen as the template, to ensure that the registration error between and is small and the missing part is taken from anatomically close regions on the template. The template shape can be from a single image like , or the mean shape of a shape pool, as specified in Equation 3.

Ii-B2 SSM for Volumetric Shape Completion

If the shape pool consists of complete shapes, while the test shape refers to a defective shape, applying Equation 1 - Equation 6 would give the complete counterpart corresponding to . In this sense, SSM can be used for shape completion tasks. In [yu2021pca], the authors used PCA for skull shape completion and showed that, by applying PCA and an inverse PCA consecutively to a defective skull, a complete skull can be obtained. Equation 1 and Equation 6 give the mathematical explanation: the PCA computes the skull shape variations from the training samples and the weights from the warped defective skull , while the inverse PCA, according to the the implementation of inverse_transform from the scikit-learn package, computes:


which is equivalent to Equation 1. Incorporating Equation 6 into Equation 9 we get:


In both [yu2021pca] and Equation 1, the principal components of a defective skull are used as the weights of the shape variations. An obvious shortcoming is that, if the defects are too large, the principal components computed from a defective skull might not reflect the true distribution of the shape variations of a complete skull. For example, given a defective skull whose facial bone far outweighs the cranium due to a large cranial defect, the weight of the variation concerning the facial area () would overwhelm the other variations, resulting in an inappropriate reconstruction of the region of interest (ROI, i.e., the cranium) for the cranial implant design task. To address this problem, we first use a shape completion network (denoted as DL)333The shape completion network refers to in [li2020baseline, li2021learning] to generate the complete counterpart of the defective skull and use the completed skull to calculate :


However, whether using Equation 11 for weight calculation has an positive effect on the results than using Equation 6 depends on the quality of the DL reconstructions.

Iii Experiment and Results

Iii-a Dataset and Metrics

We evaluated our method on three datasets: the 11 clinical cases of defective skulls from Tasks 2 of the AutoImplant II challenge444, the 29 craniotomy skulls from MUG500+ [li2021mug500+], and the 110 test skulls with synthetic defects from Task 3 of AutoImplant II. To conform to the evaluation scheme of the AutoImplant II challenge, we measure the reconstruction accuracy using dice similarity coefficient (DSC), border DSC and 95 percentile Hausdorff distance (HD95).

The complete skulls from the training set of Task 3 were used as the shape pool . The image dimension is 555Except the last two cases from Task 2. ( differs for different images). As calculating (Equation 4 and 5) from high resolution images is a computationally expensive process, we only used (out of 100) complete skulls for experiments involving . The reference skull is chosen to be case 001.nrrd in the Task 3 training set and . All the training and test samples are registered to 001.nrrd through a similarity transformation .

Iii-B Reconstruction of Synthetic Cranial Defects

The 110 test skulls from Task 3 contain synthetic defects. In this experiment, we evaluate different methods of creating a skull template for shape warping: averaging 30 complete skulls ( (30)), averaging 50 complete skulls ( (50)) and using only a single skull (). The 30 skulls are a subset of the 50 skulls. We also evaluate how the weights of shape variations affect the reconstruction accuracy of SSM (Equation 1): computed via Equation 6 (SSM (30)) and computed via Equation 11 (SSM (30) + DL). The CNN-based shape completion network - DL is taken from [li2021learning, li2020baseline] 666[li2021learning, li2020baseline] did not report bDSC and HD95. We calculate the two metrics in this paper based on the prediction files (implants) from [li2021learning, li2020baseline].. Besides, shape reconstruction using only the shape variations is also evaluated: () and . For the former, the is set to one. For the latter, is calculated regularly based on Equation 6. The DSC, bDSC and HD95 for these shape completion methods are reported in Table  I and Figure  3.

Methods Scores DSC bDSC HD95 (mm)
(30) 0.7840 0.8265 3.1989
(50) 0.7853 0.8287 3.2447
0.7854 0.8285 3.1700
SSM (30) 0.7832 0.8255 3.2157
SSM (30) + DL 0.7830 0.8253 3.2170
DL [li2021learning, li2020baseline] 0.8058 0.7638 13.2891
() 0.7054 0.7403 3.6783
0.7064 0.7411 3.6601
L. Yu. et. al. [yu2021pca] 0.7728 0.7716 3.6803
D. G. Ellis, [ellis2020deep] 0.9440 - -
M. Wodzinski et. al. [wodzinski2021improving] 0.9329 0.9530 1.4781
Table I: Quantitative results (mean DSC, bDSC and HD95) on the 110 test cases of Task 3.
Figure 3: Boxplots of DSC, bDSC and HD95. 0: SSM (30), 1: SSM (30) + DL, 2: (), 3: , 4: (50), 5: (30), 6: , 7: DL, 8: [yu2021pca].
Figure 4: ( axis) computed based on Equation 6 and Equation 11 for different shape variations ( axis refers to the index of variation). The averaged weights of the 110 test cases: is used for this plot.

The results bear important implications: (1) Using a single skull image as the template for shape warping can produce reasonable results, qualitatively and quantitatively (third row, Table  I, and Figure  2 E). However, it should be noted that the conclusion applies only to the cranial implant design task, where the ROI to reconstruct is the structurally uncomplicated cranium. The single image-based template would likely to fail for reconstructing individual-specific facial bones, which contain complex and subtle structures. (2) Shape template derived from a single shape () produces comparable results to that from a mean shape averaged from 30 ( (30)) or 50 ( (50)) images. Figure  2 gives an example of the results obtained using shape warping. We can see that (Figure  2 B) shows no noticeable difference on the cranium compared to (Figure  2 D). As a result, subtracting the input from the templates (Equation 7) produces similar implants. The main difference lies in the facial area and the interior subtle structures (Figure  2 C and E). Applicable to both statements (1) and (2), the reconstruction accuracy depends largely on how well the target matches with the template on the ROI (e.g., cranium or facial bones) during the warping and registration process. It is relatively easier to register among different craniums than different facial bones. For a facial reconstruction task (e.g., facial implant design), a mean shape, as illustrated in Figure  2 (B), possesses significantly more facial landmarks than a single image, which might potentially make the facial registration more accurate. (3) The weights of the shape variations affect the accuracy of cranium reconstruction. An analysis of the shape variation matrix reveals that, around 53% of the variations are related to the full skull and the remaining are associated with the facial area only, which do not contribute to the cranium reconstruction. The weight distribution (calculated based on Equation 6) of is shown in Figure  4. We can see that the largest three weights in correspond to the variations: (facial bones), (full skull) and (full skull). The results presented in the () and rows of Table  I only show marginal differences, meaning that the two full skull-related variations ( and ) already carries the information necessary for reconstructing a complete skull. Figure  4 also shows the weight distribution calculated based on Equation 11. The variations corresponding to the three largest weights are (facial bones), (full skull) and (full skull). Compared to ’SSM (30)’, the deviation in weight distribution for ’SSM (30) + DL’ leads to marginal decline in the quantitative scores, meaning that a better network or a training method should be chosen in order that the DL can benefit ’SSM (30)’. (4) In comparison to the deep learning-based approaches [ellis2020deep, wodzinski2021improving], the shape warping- and SSM-based methods achieve inferior results on synthetic defects quantitatively. However, it should be noted that both [ellis2020deep] and [wodzinski2021improving] used an intensively augmented dataset during training, while only 30 images were used to build the SSM.

Figure 5: Manually extract the implants from the subtraction results using 3D Slicer.
Figure 6: Exemplary results (from (50)) on the craniotomy skulls from the MUG500+ dataset. The 29 generated implants can be downloaded at


Figure 7: Qualitative comparison between different implant design methods on Task 2@AutoImplant II. (A) [wodzinski2021improving] (B) [yu2021pca] (C) [mahdi2021u] (D) Ours. The 11 generated implants can be downloaded at


Iii-C Mug500+

This section presents the implant generation results on the craniotomy skulls from the MUG500+ dataset [li2021mug500+]. Figure  5 shows how to manually post-process an implant (Figure  5 (A)) generated by subtracting the input defective skull from the skull reconstructed by SSM (30). First, a median smoothing filter is applied to the subtraction result to (partly) disconnect the implant from the noise (Figure  5 B). The smoothing kernel size should be chosen individually. Second, the scissors functionality is used to erase the delineated area (Figure  5 C) to fully remove the noise and extract the implant. Figure  5 (D) shows the post-processing result. Step (B) and (C) is done manually using 3D Slicer ( [egger2013gbm]. Alternatively, the implant can be extracted automatically by applying morphological opening and connected component analysis consecutively to the subtraction result. However, it is recommended to follow the manual post-processing workflow in Figure  5 for an optimal outcome 777For example, in Figure  5 (C), the implant is still connected to the facial bones after smoothing. Where to erase in the connecting region should optimally be determined according to the defect shape.. Figure  6 presents the automatically generated implants for some large and complex defects in the MUG500+ dataset. The implants are generated by (50) and manually post-processed. We can see that some of the defects are large enough to cover half of the cranium or have rather irregular shapes. Nonetheless, the defects are still satisfactorily reconstructed. Notably, Figure  1 (the teaser image) shows that the method is still effective when multiple large defects exit on the craniotomy skull. The completed skulls obtained according to Equation  8 preserve the anatomical aesthetics of normal human skulls.

The MUG500+ dataset contains the manually designed implants (i.e., surface models in .stl format) for the 29 craniotomy skulls. The manual designs are converted to images (.nrrd) for a quantitative comparison with our automatic designs. The results are given in Table  II. Keep in mind that, while interpreting the scores, the quantitative results are only a reflection of how well our automatic reconstructions match with the manual designs, which is subjective and experience-dependent888For example, for large defects, different designers might decide differently for the curvature of the implants based on their own aesthetic views.. Experts’ manual and qualitative evaluation of the implants has closer correlation with the true quality of the reconstructions [ellis2021qualitative].

Methods Scores DSC bDSC HD95
(50) 0.5471 0.5761 5.0000

Table II: Quantitative results (produced by (50)) for the MUG500+ craniotomy dataset.

Iii-D Task2

To show the utility of our methods on real clinical cases, the implant designs were created for the 11 clinical defective skulls from the Task 2 of the AutoImplant II challenge. As described in Ellis et al. [ellis2021qualitative], the implant designs were quantitatively compared to reconstructions from postoperative imaging of the actual implant the patients’ received. Table  III shows these quantitative results from (50) and SSM (30), as well as from the AutoImplant II submissions. The (50) and SSM (30) had the best Hausdorff 95 scores than all other submissions but scored worse than some other submissions in the dice similarity and boundary dice similarity scores.

Methods Scores DSC bDSC HD95
(50) 0.5007 0.4449 8.2539
SSM (30) 0.5055 0.4470 7.9042
M. Wodzinski. et al. [wodzinski2021improving] 0.5241 0.4823 54.5165
L. Yu. et al. [yu2021pca] 0.5118 0.4547 8.3486
H. Mahdi. et al. [mahdi2021u] 0.3028 0.3092 71.4193

Table III: Quantitative results for Task 2 of the AutoImplant II challenge.

However, comparing the implant designs to the reconstructions of the implants from the postoperative CT imaging do not necessarily serve as a reliable metric for the quality of the implant designs [ellis2021qualitative]. For this reason, the implant designs were also qualitatively evaluated by experienced neurosurgeon, MRA. The implant designs were judged based on completeness, false positive area, fit, and overall feasibility as described in Ellis et al. [ellis2021qualitative]. As shown in Table  IV, the (50) implant designs had better overall feasibility, better fit, and less false positive area than the submissions from the Autoimplant II challenge. While none of the submissions from the Autoimplant II challenge were deemed feasible without modifications, 4 out of 11 of the (50) designs were deemed feasible with only minor flaws. Therefore, the (50) designs represent a substantial improvement in the clinical feasibility of implant designs. The main issues plaguing the (50) implant designs were that they did not always extend far enough in the superior direction to fully restore the natural skull shape and that the implants were often too thick.

Methods Scores Comp FPA Fit Feasibility
(50) 0.89 0.73 0.64 0.62
M. Wodzinski. et al. [wodzinski2021improving] 0.93 0.57 0.55 0.42
L. Yu. et al. [yu2021pca] 0.80 0.59 0.36 0.42
H. Mahdi. et al. [mahdi2021u] 0.76 0.43 0.45 0.33

Table IV: Qualitative evaluation scores for Task 2 of the AutoImplant II challenge by neurosurgeon MRA. The scores have been normalized such that 0 is the lowest possible score and 1 is the highest possible score. Completeness (Comp) evaluates the amount of the defect that the implant design covers. False positive area (FPA) evaluates the amount of amount of implant design outside of the defect area. Fit evaluates the shape of the implant design relative to the defect. Feasibility evaluates whether the implant design could be used in surgery. See Ellis et al. for the qualitative analysis methods [ellis2021qualitative].

Iv Discussion

In automatic cranial implant design, deep learning-based approaches that rely on a defect-complete or defect-implant pair for training often fail to generalize to large and complex cranial defects in the test set, since the synthetic defects used during training have different distributions to the real defects during evaluation (i.e., domain shift). One popular solution to this problem is resorting to intensive data augmentation: augment the defects [li2021automatic, kodymskullbreak] and/or augment the skull images [wodzinski2021improving, ellis2020deep]. The former tries to create realistic synthetic defects for training. Data augmentation has shown to be effective to the generalization problem for deep learning in automatic cranial implant design. However, the computational cost is substantially increased. Besides, the patient-specific cranial defects can depict considerable variations among individuals, making it impractical to cover all defect patterns through augmentation. The SSM-based approach can circumvent the defect-related generalization problem, as an SSM relies only on the complete skulls available in the training set for training. Therefore, an SSM is insensitive to and unaffected by the changes in defect patterns in the test cases. One factor affecting the performance of an SSM is the registration accuracy. Since human craniums are structurally uncomplicated and topologically stable among individuals compared to the defects and facial bones, precise registration among different craniums is highly achievable. Therefore, an SSM or simply a shape warping-based approach, though methodologically simple, is ideal for the cranial defect reconstruction task.

Furthermore, due to the black-box nature of deep learning, the shape features learned by a deep neural network are often entangled and lacking interpretability. In contrast, the basic components of an SSM - the mean shape and shape variations, are explicitly expressed and therefore highly comprehensible to humans. In clinical settings, interpretability and transparency adds to the trust in the methods.

V Conclusion

As an alternative to mesh-based SSM, we demonstrate in this paper that an SSM can be built directly on (volumetric) skull images. The image-based SSM of the skull is evaluated on three cranial defect reconstruction tasks, and the results reveal that even if state of the art deep learning-based approaches still prevail in reconstructing synthetic defects, our SSM-based method shows advantages in the reconstruction of large and complex defects that are common in cranioplasty. Besides, the SSM-based methods are not dependent on large quantities of training data as deep learning, making the proposed approach highly scalable and applicable in a clinical setting.