ssm
Back to the Roots: Reconstructing Large and Complex Cranial Defects using an Imagebased Statistical Shape Model
view repo
Designing implants for large and complex cranial defects is a challenging task, even for professional designers. Current efforts on automating the design process focused mainly on convolutional neural networks (CNN), which have produced stateoftheart results on reconstructing synthetic defects. However, existing CNNbased methods have been difficult to translate to clinical practice in cranioplasty, as their performance on complex and irregular cranial defects remains unsatisfactory. In this paper, a statistical shape model (SSM) built directly on the segmentation masks of the skulls is presented. We evaluate the SSM on several cranial implant design tasks, and the results show that, while the SSM performs suboptimally on synthetic defects compared to CNNbased approaches, it is capable of reconstructing large and complex defects with only minor manual corrections. The quality of the resulting implants is examined and assured by experienced neurosurgeons. In contrast, CNNbased approaches, even with massive data augmentation, fail or produce lessthansatisfactory implants for these cases. Codes are publicly available at https://github.com/Jianningli/ssm
READ FULL TEXT VIEW PDFBack to the Roots: Reconstructing Large and Complex Cranial Defects using an Imagebased Statistical Shape Model
Before deep learning gained wide popularity [egger2021deep], statistical shape model (SSM) and its variants (e.g., active shape models (ASMs) [cootes1992active], active blobs [sclaroff1998active] and active appearance models (AAMs) [cootes2001active]) were broadly adopted in medical reconstruction and segmentation tasks, such as the reconstruction of craniofacial defects [SemperHogg2017VirtualRO, Fuessinger2019VirtualRO, Fuessinger2017PlanningOS, Lamecker2008VariationalAS, pimentel2020automated, ZhangKun2014] and human rib cage [dworzak20103d], and the segmentation of hip joints [kainmueller2009articulated] and organs [lamecker2002statistical]. In contrast to the latent shape features learned by deep neural nets, which are difficult to interpret, SSM offers the option to express a shape in an explicit manner, by linearly combining the mean shape and the principal modes of shape variations of a given shape pool. Surface meshes were common choices for anatomical shape representation in many SSMbased studies. Establishing dense point correspondence among the meshes is deemed the most demanding part in building an SSM, especially when the medical images, from which the meshes are derived, are of high resolution [heimann2009statistical]. Methods that establish points correspondence automatically are typically based on a meshtomesh registration procedure (e.g., Iterative Closest Point [besl1992method]), where the meshes are registered to a reference mesh through a similarity transformation (scaling, rotation and translation). However, popular state of the art segmentation methods and various medical applications are imagebased [ronneberger2015u, zhou2018unet++, li2018h, milletari2016v, pepe2020detection]. It is therefore desirable to circumvent the imagetomesh conversion procedure and build an SSM directly on images [reyneke2018construction, grauman2003inferring, bharath2018radiologic].
Automatic cranial implant design is another typical application that uses images as the initial shape representation [li2021synthetic]. Existing deep learningbased methods usually train a deep neural net on hundreds of skull images with either synthetic defects [morais2019automated, li2021automatic, li2020baseline, li2021autoimplant] or clinical defects [kodym2021deep], depending on the availability of the clinical images. These approaches are data and computation intensive, and most importantly, the quality of their reconstructions for large and complex defects, which are common in cranioplasty, remains inadequate for clinical use [ellis2021qualitative, mahdi2021u]. The failure can largely be attributed to domain shift: the synthetic defects in the training set have different distribution to that of the test set. Augmenting the training set intensively is a potentially practical and effective solution to the problem [wodzinski2021improving]. However, current development in data augmentationenhanced deep learning is still evaluated as substandard by clinical experts [ellis2021qualitative, wodzinski2021improving].
A method that is independent from and insensitive to the defects may optimally avoid the domain shift problem in cranial defect reconstruction. To this end, we propose an SSMbased method for automatic cranial implant design. Unlike previous meshbased SSM for craniofacial defect reconstruction [Fuessinger2019VirtualRO, Fuessinger2017PlanningOS, pimentel2020automated], our SSM is built directly on the volumetric skull images. We show that dense point correspondence among the training skull images can simply be achieved through an image registration and warping step, and the mean shape and shape variations of the skulls can be calculated thereafter. Besides the expected robustness against large and complex cranial defects, another favorable property of an SSMbased method is that the skull shapes can be expressed mathematically and explicitly, unlike deep learningbased approaches that try to learn an implicit and latent shape representation.
The proposed SSM is evaluated on both synthetic defects and irregularly shaped clinical defects from three cranial implant design tasks. We show that even though deep learning still beats SSM on synthetic defect reconstruction, the performance of SSM is superior and stable when it comes to large and complex clinical defects.
The main workload of building an SSM lies in finding the mean shape and the primary shape variations of a given shape pool, as specified by Equation 1:
(1) 
is the weight of variation . is the number of shape variations chosen for reconstructing the new shape . Let be a shape pool containing binary volumetric images , in which the image () is selected as a reference. is the image dimension. The nonzero voxels in these images constitute the geometry of the shapes and are regarded by default as the shape landmarks. Therefore, establishing the point correspondences between all the images in the shape pool can be achieved by simply registering () to the reference image :
(2) 
is a transformation that warps the images in into the space of : , . Let , be the set of warped shapes. The mean shape of is calculated as:
(3) 
To extract the shape variations
, principal component analysis (PCA)
[peason1901lines] is used. Let , () be the set of chosen shape variations. Transforming into the PCA space can be achieved via:(4) 
. The is given by the PCA function from the scikitlearn package and is computed from the training set . Therefore, we can calculate the variation matrix as follows^{1}^{1}1Note that we compute explicitly based on Equation 5:
(5) 
is a pseudo inverse of . Given a test shape , it is first registered to the reference image and then mapped into the PCA space defined by the shape pool :
(6) 
. We rescale to [0,1] via: . Given , and , the new shape can be computed according to Equation 1. In our work, is chosen to be a similarity transformation. The reconstructed shapes can be warped to their original space via an inverse transformation .
An intuitive way to complete a defective shape is to warp it to the space of a complete shape , which can be achieved through a registration process (i.e., ). Since the anatomical landmarks of the two shapes are aligned because of registration, a following subtraction operation ^{2}^{2}2Between two volumes of the same dimension. between the two shapes can yield the missing portion of the defective shape :
(7) 
The addition of and produces the complete shape corresponding to . By inverting the registration, we can obtain the complete shape corresponding to in its original space:
(8) 
The concept is similar to that of a templatebased shape completion approach [kraevoy2005template], in which the missing part of a defective shape is taken from a complete template shape. The choice of the template shape affects the authenticity of . Optimally, a shape that is general and representative of the shape class should be chosen as the template, to ensure that the registration error between and is small and the missing part is taken from anatomically close regions on the template. The template shape can be from a single image like , or the mean shape of a shape pool, as specified in Equation 3.
If the shape pool consists of complete shapes, while the test shape refers to a defective shape, applying Equation 1  Equation 6 would give the complete counterpart corresponding to . In this sense, SSM can be used for shape completion tasks. In [yu2021pca], the authors used PCA for skull shape completion and showed that, by applying PCA and an inverse PCA consecutively to a defective skull, a complete skull can be obtained. Equation 1 and Equation 6 give the mathematical explanation: the PCA computes the skull shape variations from the training samples and the weights from the warped defective skull , while the inverse PCA, according to the the implementation of inverse_transform from the scikitlearn package, computes:
(9) 
which is equivalent to Equation 1. Incorporating Equation 6 into Equation 9 we get:
(10) 
In both [yu2021pca] and Equation 1, the principal components of a defective skull are used as the weights of the shape variations. An obvious shortcoming is that, if the defects are too large, the principal components computed from a defective skull might not reflect the true distribution of the shape variations of a complete skull. For example, given a defective skull whose facial bone far outweighs the cranium due to a large cranial defect, the weight of the variation concerning the facial area () would overwhelm the other variations, resulting in an inappropriate reconstruction of the region of interest (ROI, i.e., the cranium) for the cranial implant design task. To address this problem, we first use a shape completion network (denoted as DL)^{3}^{3}3The shape completion network refers to in [li2020baseline, li2021learning] to generate the complete counterpart of the defective skull and use the completed skull to calculate :
(11) 
We evaluated our method on three datasets: the 11 clinical cases of defective skulls from Tasks 2 of the AutoImplant II challenge^{4}^{4}4https://autoimplant2021.grandchallenge.org/, the 29 craniotomy skulls from MUG500+ [li2021mug500+], and the 110 test skulls with synthetic defects from Task 3 of AutoImplant II. To conform to the evaluation scheme of the AutoImplant II challenge, we measure the reconstruction accuracy using dice similarity coefficient (DSC), border DSC and 95 percentile Hausdorff distance (HD95).
The complete skulls from the training set of Task 3 were used as the shape pool . The image dimension is ^{5}^{5}5Except the last two cases from Task 2. ( differs for different images). As calculating (Equation 4 and 5) from high resolution images is a computationally expensive process, we only used (out of 100) complete skulls for experiments involving . The reference skull is chosen to be case 001.nrrd in the Task 3 training set and . All the training and test samples are registered to 001.nrrd through a similarity transformation .
The 110 test skulls from Task 3 contain synthetic defects. In this experiment, we evaluate different methods of creating a skull template for shape warping: averaging 30 complete skulls ( (30)), averaging 50 complete skulls ( (50)) and using only a single skull (). The 30 skulls are a subset of the 50 skulls. We also evaluate how the weights of shape variations affect the reconstruction accuracy of SSM (Equation 1): computed via Equation 6 (SSM (30)) and computed via Equation 11 (SSM (30) + DL). The CNNbased shape completion network  DL is taken from [li2021learning, li2020baseline] ^{6}^{6}6[li2021learning, li2020baseline] did not report bDSC and HD95. We calculate the two metrics in this paper based on the prediction files (implants) from [li2021learning, li2020baseline].. Besides, shape reconstruction using only the shape variations is also evaluated: () and . For the former, the is set to one. For the latter, is calculated regularly based on Equation 6. The DSC, bDSC and HD95 for these shape completion methods are reported in Table I and Figure 3.
Methods Scores  DSC  bDSC  HD95 (mm) 
(30)  0.7840  0.8265  3.1989 
(50)  0.7853  0.8287  3.2447 
0.7854  0.8285  3.1700  
SSM (30)  0.7832  0.8255  3.2157 
SSM (30) + DL  0.7830  0.8253  3.2170 
DL [li2021learning, li2020baseline]  0.8058  0.7638  13.2891 
()  0.7054  0.7403  3.6783 
0.7064  0.7411  3.6601  
L. Yu. et. al. [yu2021pca]  0.7728  0.7716  3.6803 
D. G. Ellis, et.al. [ellis2020deep]  0.9440     
M. Wodzinski et. al. [wodzinski2021improving]  0.9329  0.9530  1.4781 
The results bear important implications: (1) Using a single skull image as the template for shape warping can produce reasonable results, qualitatively and quantitatively (third row, Table I, and Figure 2 E). However, it should be noted that the conclusion applies only to the cranial implant design task, where the ROI to reconstruct is the structurally uncomplicated cranium. The single imagebased template would likely to fail for reconstructing individualspecific facial bones, which contain complex and subtle structures. (2) Shape template derived from a single shape () produces comparable results to that from a mean shape averaged from 30 ( (30)) or 50 ( (50)) images. Figure 2 gives an example of the results obtained using shape warping. We can see that (Figure 2 B) shows no noticeable difference on the cranium compared to (Figure 2 D). As a result, subtracting the input from the templates (Equation 7) produces similar implants. The main difference lies in the facial area and the interior subtle structures (Figure 2 C and E). Applicable to both statements (1) and (2), the reconstruction accuracy depends largely on how well the target matches with the template on the ROI (e.g., cranium or facial bones) during the warping and registration process. It is relatively easier to register among different craniums than different facial bones. For a facial reconstruction task (e.g., facial implant design), a mean shape, as illustrated in Figure 2 (B), possesses significantly more facial landmarks than a single image, which might potentially make the facial registration more accurate. (3) The weights of the shape variations affect the accuracy of cranium reconstruction. An analysis of the shape variation matrix reveals that, around 53% of the variations are related to the full skull and the remaining are associated with the facial area only, which do not contribute to the cranium reconstruction. The weight distribution (calculated based on Equation 6) of is shown in Figure 4. We can see that the largest three weights in correspond to the variations: (facial bones), (full skull) and (full skull). The results presented in the () and rows of Table I only show marginal differences, meaning that the two full skullrelated variations ( and ) already carries the information necessary for reconstructing a complete skull. Figure 4 also shows the weight distribution calculated based on Equation 11. The variations corresponding to the three largest weights are (facial bones), (full skull) and (full skull). Compared to ’SSM (30)’, the deviation in weight distribution for ’SSM (30) + DL’ leads to marginal decline in the quantitative scores, meaning that a better network or a training method should be chosen in order that the DL can benefit ’SSM (30)’. (4) In comparison to the deep learningbased approaches [ellis2020deep, wodzinski2021improving], the shape warping and SSMbased methods achieve inferior results on synthetic defects quantitatively. However, it should be noted that both [ellis2020deep] and [wodzinski2021improving] used an intensively augmented dataset during training, while only 30 images were used to build the SSM.
This section presents the implant generation results on the craniotomy skulls from the MUG500+ dataset [li2021mug500+]. Figure 5 shows how to manually postprocess an implant (Figure 5 (A)) generated by subtracting the input defective skull from the skull reconstructed by SSM (30). First, a median smoothing filter is applied to the subtraction result to (partly) disconnect the implant from the noise (Figure 5 B). The smoothing kernel size should be chosen individually. Second, the scissors functionality is used to erase the delineated area (Figure 5 C) to fully remove the noise and extract the implant. Figure 5 (D) shows the postprocessing result. Step (B) and (C) is done manually using 3D Slicer (https://www.slicer.org/) [egger2013gbm]. Alternatively, the implant can be extracted automatically by applying morphological opening and connected component analysis consecutively to the subtraction result. However, it is recommended to follow the manual postprocessing workflow in Figure 5 for an optimal outcome ^{7}^{7}7For example, in Figure 5 (C), the implant is still connected to the facial bones after smoothing. Where to erase in the connecting region should optimally be determined according to the defect shape.. Figure 6 presents the automatically generated implants for some large and complex defects in the MUG500+ dataset. The implants are generated by (50) and manually postprocessed. We can see that some of the defects are large enough to cover half of the cranium or have rather irregular shapes. Nonetheless, the defects are still satisfactorily reconstructed. Notably, Figure 1 (the teaser image) shows that the method is still effective when multiple large defects exit on the craniotomy skull. The completed skulls obtained according to Equation 8 preserve the anatomical aesthetics of normal human skulls.
The MUG500+ dataset contains the manually designed implants (i.e., surface models in .stl format) for the 29 craniotomy skulls. The manual designs are converted to images (.nrrd) for a quantitative comparison with our automatic designs. The results are given in Table II. Keep in mind that, while interpreting the scores, the quantitative results are only a reflection of how well our automatic reconstructions match with the manual designs, which is subjective and experiencedependent^{8}^{8}8For example, for large defects, different designers might decide differently for the curvature of the implants based on their own aesthetic views.. Experts’ manual and qualitative evaluation of the implants has closer correlation with the true quality of the reconstructions [ellis2021qualitative].
Methods Scores  DSC  bDSC  HD95 

(50)  0.5471  0.5761  5.0000 

To show the utility of our methods on real clinical cases, the implant designs were created for the 11 clinical defective skulls from the Task 2 of the AutoImplant II challenge. As described in Ellis et al. [ellis2021qualitative], the implant designs were quantitatively compared to reconstructions from postoperative imaging of the actual implant the patients’ received. Table III shows these quantitative results from (50) and SSM (30), as well as from the AutoImplant II submissions. The (50) and SSM (30) had the best Hausdorff 95 scores than all other submissions but scored worse than some other submissions in the dice similarity and boundary dice similarity scores.
Methods Scores  DSC  bDSC  HD95 

(50)  0.5007  0.4449  8.2539 
SSM (30)  0.5055  0.4470  7.9042 
M. Wodzinski. et al. [wodzinski2021improving]  0.5241  0.4823  54.5165 
L. Yu. et al. [yu2021pca]  0.5118  0.4547  8.3486 
H. Mahdi. et al. [mahdi2021u]  0.3028  0.3092  71.4193 

However, comparing the implant designs to the reconstructions of the implants from the postoperative CT imaging do not necessarily serve as a reliable metric for the quality of the implant designs [ellis2021qualitative]. For this reason, the implant designs were also qualitatively evaluated by experienced neurosurgeon, MRA. The implant designs were judged based on completeness, false positive area, fit, and overall feasibility as described in Ellis et al. [ellis2021qualitative]. As shown in Table IV, the (50) implant designs had better overall feasibility, better fit, and less false positive area than the submissions from the Autoimplant II challenge. While none of the submissions from the Autoimplant II challenge were deemed feasible without modifications, 4 out of 11 of the (50) designs were deemed feasible with only minor flaws. Therefore, the (50) designs represent a substantial improvement in the clinical feasibility of implant designs. The main issues plaguing the (50) implant designs were that they did not always extend far enough in the superior direction to fully restore the natural skull shape and that the implants were often too thick.
Methods Scores  Comp  FPA  Fit  Feasibility 

(50)  0.89  0.73  0.64  0.62 
M. Wodzinski. et al. [wodzinski2021improving]  0.93  0.57  0.55  0.42 
L. Yu. et al. [yu2021pca]  0.80  0.59  0.36  0.42 
H. Mahdi. et al. [mahdi2021u]  0.76  0.43  0.45  0.33 

In automatic cranial implant design, deep learningbased approaches that rely on a defectcomplete or defectimplant pair for training often fail to generalize to large and complex cranial defects in the test set, since the synthetic defects used during training have different distributions to the real defects during evaluation (i.e., domain shift). One popular solution to this problem is resorting to intensive data augmentation: augment the defects [li2021automatic, kodymskullbreak] and/or augment the skull images [wodzinski2021improving, ellis2020deep]. The former tries to create realistic synthetic defects for training. Data augmentation has shown to be effective to the generalization problem for deep learning in automatic cranial implant design. However, the computational cost is substantially increased. Besides, the patientspecific cranial defects can depict considerable variations among individuals, making it impractical to cover all defect patterns through augmentation. The SSMbased approach can circumvent the defectrelated generalization problem, as an SSM relies only on the complete skulls available in the training set for training. Therefore, an SSM is insensitive to and unaffected by the changes in defect patterns in the test cases. One factor affecting the performance of an SSM is the registration accuracy. Since human craniums are structurally uncomplicated and topologically stable among individuals compared to the defects and facial bones, precise registration among different craniums is highly achievable. Therefore, an SSM or simply a shape warpingbased approach, though methodologically simple, is ideal for the cranial defect reconstruction task.
Furthermore, due to the blackbox nature of deep learning, the shape features learned by a deep neural network are often entangled and lacking interpretability. In contrast, the basic components of an SSM  the mean shape and shape variations, are explicitly expressed and therefore highly comprehensible to humans. In clinical settings, interpretability and transparency adds to the trust in the methods.
As an alternative to meshbased SSM, we demonstrate in this paper that an SSM can be built directly on (volumetric) skull images. The imagebased SSM of the skull is evaluated on three cranial defect reconstruction tasks, and the results reveal that even if state of the art deep learningbased approaches still prevail in reconstructing synthetic defects, our SSMbased method shows advantages in the reconstruction of large and complex defects that are common in cranioplasty. Besides, the SSMbased methods are not dependent on large quantities of training data as deep learning, making the proposed approach highly scalable and applicable in a clinical setting.