1 Introduction
Statistical shape modeling (SSM) is an indispensable tool for the analysis of anatomy and biological structures. Such models can be viewed as a composite of two distinct steps: shape representation and shape analysis. Shape representation is a quantifiable description of the shape/structure of sample from a population of anatomies (usually given as a cohort of images or surface meshes) that is consistent with the population statistics and is easy to use for subsequent analysis. There are two prominent families of algorithms for shape representation, (i) landmarks, which express shapes as point clouds that define an explicit correspondence map from one shape to another using invariant points across populations that vary in their form, and (ii) deformation fields, which rely on transformations between images to encode implicit shape information. Shape analysis then uses these shape representations to analyze the population’s statistics; in most cases, the representation is projected onto a lowdimensional space via principal component analysis (PCA). This lowdimensional representation is used as a
shape descriptor for subsequent shape analysis. Outside of analysis of different modes of shape variations captured by this descriptor, it can also be subsequently utilized in different applications. For instance, the shape descriptor can serve as features to perform classification of different morphological classes [27], can quantify the severity of a particular deformity [5], or employed to interpret and discover shape characteristics that are associated with a particular disease [12]. We consider such downstream applications that are dependent on how well the shape descriptors characterize the given shape to showcase the efficacy of the shape descriptor.Due to their simplicity and computational efficiency, correspondencebased models are the most prominently used models for shape representation. Correspondences is a term used to describe landmarks on the anatomy that are geometrically consistent across the samples of the population. In the earliest works, [43] correspondence was achieved by handpicked landmarks corresponding to distinguishable features. The field has come a long way with many stateoftheart correspondence discovery algorithms [41, 14]. However, many of these algorithms require segmentation of the anatomy from images as well as heavy preprocessing. Such segmentation and or preprocessing often come with a significant computational overhead as well as cost human resources. Segmentation of some anatomies is prone to subjective decisions and hence requires domain expertise. These problems fail to make the automated correspondence discovery model fully endtoend, i.e., an automated pipeline that for inference just inputs images to produce shape descriptors for analysis.
In recent years, deep learning and neural networks models have had a significant impact on both image registration and shape analysis. With their ability to learn complex functions, several methods [6, 36]
have proposed learning correspondence from images, bypassing the need for segmentation and preprocessing. However, these methods are supervised and are datahungry, they require considerable training data with correspondences, which is not always possible in clinical applications. They also need anatomy segmentation and preprocessing for the training set that might not be readily available. Deep networks have also played an essential role in developing computationally fast and unsupervised learningbased algorithms for image registration (e.g.,
[3]) that perform equivalently to the stateoftheart, optimizationbased registration methods. However, transformations are not as friendly as correspondences for shape analysis; they often require the development of a fixed atlas [30]. The systems that process imagetoimage transformations express shape information in a highdimensional space. Typically for shape analysis, a lowdimensional space is preferred, and therefore, these representations are projected onto a lowdimensional space via PCA (or some equivalent for nonlinear spaces), and the modes of shape variation need to be analyzed by domain experts to check for their usability in downstream applications.To address the abovestated challenges, we propose an endtoend system for extracting a shape descriptor from only a population of input images. Ideally, this shape descriptor would not require any postprocessing for subsequent analysis. This paper proposes a selfsupervised deep learning approach for landmark discovery that uses image registration as the primary task. The proposed method alleviates the need for segmentation and heavy preprocessing (even during model training) to obtain a landmarkbased shape descriptor. The discovered landmarks are relatively low in number; hence, they can be directly used for shape analysis and bypass the postprocessing required to convert the representation into a lowdimensional space. The work presented here is an extension of the preliminary work presented in [8]. This work significantly extends and improves on the previous paper in the following ways:

Additional experiments, results, and analysis on several different datasets with associated downstream applications for shape descriptors.

We propose two different model variants that can incorporate prior information about shape into the model during training and can implicitly enforce the landmarks to encode such information.

We propose an additional image matching loss function that preserves the local structure and allows for crossmodality registration or usage of datasets with a lot of intensity variations.
2 Related Work
Since the groundbreaking work of D’Arcy Thompson [43] who utilized manually placed landmarks to study variations in shapes of fishes, statistical shape modeling (SSM) has become an indispensable tool for medical researchers and biologists. SSM finds applications in various fields such as cardiology [21], neurology [22], growth modeling [17], orthopaedics [26], and instrument design [23]. Shape representation for SSM can be achieved via explicit representation of points on surfaces [18, 40], direct usage of surface meshes or distance transforms [35] or their features [11], or, implicitly via functional maps [37] or deformation fields [4].
Correspondencebased models, or particle distribution models (PDMs) [24] place a dense set of particles onto the shapes’ surfaces. Automatic PDM algorithms rely on nonlinear optimization that reduces the complexity of the generative model [14, 18]. In most cases, PCA is used to project the high dimensional shape space to a low dimensional shape descriptor [43, 5]. Since these algorithms require heavy preprocessing/segmentation, deep learning has been used to learn correspondences directly from a population of 2D/3D images [6, 36]. These methods being supervised still require preprocessing overhead during training and also need large datasets/dataaugmentation methods to learn effectively. Both these requirements are not available in many cases, especially with medical data.
Also relevant is the work on deformable registration of images that have been used as a tool for shape representation [4] and atlas building [30]. The implicit deformation fields are hard to interpret. Therefore, having a fixed atlas to which each image in the population will be deformed helps make the fields standardized. Deep learningbased unsupervised registration (e.g., [3]) has attracted a lot of attention in recent years. This unsupervised registration framework have been extended in modeling diffeomorphic registration [16], constructing image atlas [15] and leveraging empirical information about the shape population [7].
Another relevant body of works stems from computer vision literature that uses image alignment to obtain dense feature maps
[19, 39], in a similar vein to the widely popular scaleinvariant feature transform (SIFT) [34]features. Other works focus on utilizing convolutional neural networks (CNNs) to learn surface features and use them to obtain correspondence points
[10] or as shape features for subsequent correspondence optimization [1]. All these works concentrate on discovering the surface/shape features using CNNs, whereas our work proposes an unsupervised approach for landmark discovery.3 Methods
This section covers the necessary background for statistical shape modeling and image registration, the proposed model architecture and training, loss functions and optimization, and generalized model variants.
3.1 Shape Analysis and Image Registration
Statistical shape modeling (SSM) can be broadly categorized into two parts (i) shape representation and (ii) shape analysis. Shape representation entails using the raw data (can be in the form of images, meshes, label maps, etc.) and expressing it in a usable, quantifiable form for subsequent shape analysis. Shape analysis then finds relevant statistics from the shape representation pertinent to the downstream application. To reiterate, in this paper, we refer to downstream applications as the applications that utilize the shape descriptors to perform a task on given data, for example, using shape descriptors as features for shape classification. In this paper, we restrict our shape representation to be in the form of point correspondences, which are a geometrically consistent set of 2D/3D points of the population of shapes. Hence, each shape from a population of shapes can be expressed via , where is the number of landmarks/correspondences per shape and is the space dimension ( for 2D and 3D shapes, respectively). We can use these ’s for shape analysis, which usually involves performing PCA and using the lowdimensional representation for analysis of shape modes of variation.
Landmarks also play an important role in image registration. A common underlying assumption in image registration is that a wellregistered image will also match the landmarks placed at anatomically relevant features. Furthermore, landmarks can be used to perform image registration; for instance, radial basis functions (RBF) can be used to parametrize the image deformation based on landmarks (used as control points) provided on the source and the target images. Mathematically, if
represents the image transformation, we can model the deformation as follows:(1) 
Here, the represents the RBF function used, represents the coordinates of the image, and, represents the control points. If we are given the control points on source and target images, we can solve the linear system of equations to find
and can apply the transformation to the entire image coordinate grid. The transformed coordinates is interpolated to obtain the warped image from the source image.
3.2 Model Description
Here, we propose a model to obtain anatomically relevant landmarks directly from images. To achieve this, we rely on the assumption that a good image registration between a pair of images should indicate good anatomical feature correspondence. This assumption perfectly ties in with selfsupervised learning, where we use image registration as a primary task and, in turn, obtain key landmark locations on the input images.
The resulting architecture is broken into three components, broadly described as follows:

RBF Linear Solver: We use the landmarks to construct a linear matrix and its associated output vector . For clarification, we note here that the landmarks discovered by the landmark encoder function as control points for the RBFbased registration. For the transformation parameters , we have . The RBF linear solver module consists of formulating this system of equations and solving them to find the transformation parameters. The matrix construction and linear solver are described in 0.A.

Spatial Warp Module: We use the transformation parameters and interpolate the source image and obtain the registered image . This can be easily performed using a spatial transform unit [28].
3.3 Loss Function and Regularization
The training loss function of the proposed network can be given as follows:
(2) 
The first term is the image matching or the registration loss between the target image and the registered (i.e., warped source) image. The second term is the regularization of the registration system applied on the matrix from the RBF linear solve module. We shall describe both these terms in detail in the following paragraphs.
3.3.1 Image Matching Loss
In classical literature for image registration, there is a research emphasis on interpretation as well as effect of utilizing different loss functions in context of image matching optimization [42]. In the context of deep learning, the methods are restricted to loss functions that allow for backpropagation. The two most commonly used imagematching functions used in deep leaning based image registration are the 2 and the normalized crosscorrelation (NCC) loss. These are given as follows:
(3)  
(4) 
Here, represents an image patch on image centered at voxel location with patch size of . denotes the set of all patches/possible patch centers. The function represents the normalized crosscorrelation between two patches.
Both the 2 loss and NCC work on the pixel intensities and work well when the data population has a consistent intensity profile across the population. CT scans are a good example of images with a consistent intensity profile. However, in datasets such as cardiac MRI, the intensity histograms of individual MRI scans are highly variable, and these intensitydriven loss functions will fail to capture structural matching. These losses will also fail when the input data comes from two different sources or modalities, such as two different scanners/centers or a dataset containing both T1 weighted MRI and T2 weighted MRI images. In such scenarios, pixel intensities are not the correct measure to quantify image matching, and we need losses that can capture structural correlation. Therefore, we also use the modality independent neighborhood descriptor (MIND) features to formulate a registration loss; several other recent registration works have used MIND features as loss [45]. MIND features rely on image patches; for a given image at a pixel/voxel location , its image patch is denoted as , with being the patch size. The MIND feature for an image at a pixel/voxel is given as:
(5) 
Here, is the displacement vector and
is the local variance of an image patch. The match loss function using these MIND features is given as:
(6) 
is the set of voxel locations and is the set of displacements used. We generally use a set of displacement vectors describing a local neighborhood (such as 4neighbor for 2D images or 6neighbor for 3D). The loss function is parametrized by (i) the patch size , and (ii) the distance value, i.e. . In all of our experiments using MIND loss both on 2D and 3D, we use an isotropic patch of size 3 and the displacement is kept as .
3.3.2 Regularization
The linear system required to solve the RBF warp parameters requires that matrix is a nonsingular matrix. However, the positions of the landmarks coming from the landmark encoder are arbitrary. Hence, during the optimization, the matrix can be poorly conditioned or even singular. A singular matrix has an infinite condition number, and a poorly conditioned matrix has a large condition number. Such a scenario can result in infinite number of solutions to the linear system and can cause optimization to break. To ensure stable optimization, we introduce a regularization term that minimizes the condition number of the matrix , given as:
(7) 
denotes the Frobenius norm of the matrix; this allows us to easily differentiate the regularizer. Minimization of the condition number of provides an additional benefit. We note that a poorly conditioned occurs when one or more pair of landmarks (used for the construction of ) are very close together. Hence, making the matrix wellconditioned will force the landmarks to be spread out more throughout the image, allowing us to span larger regions. This effect is showcased in Figure 2, demonstrated on diatoms, which is a simple example. One can note that the landmarks spread on the diatom boundary are achieved even with no regularization. However, we see a lot of landmarks being close together that can cause the optimization to break. This is indeed the case as the optimization without regularization is observed to break (generate NaNs) because of the singularity of the kernel matrix . With points close together, the overall shape descriptor is not informative. The regularized version spreads landmarks more uniformly over the image domain. The landmarks that are far from the boundary can be removed by redundancy removal described in the next section. The effect of regularization on the registration loss is described in the results section 4.1.
3.4 Training Procedure
The entire model is trained jointly with Adam [32], and network parameters are initialized randomly in showcased experiments. However, one can imagine a starting initialization for the output layer of the landmark encoder. One such initialization could be by using the mean landmarks from a precompute PDM on the population. The choice of hyperparameter
controls the amount of regularization and can be chosen via crossvalidation. For a given set of images, the training is performed on a set of image pairs. For smaller datasets, we use all possible image pairs, and for larger datasets, we can either randomly choose image pairs or employ a sampling heuristic. As the model is trained on pairs of images, even small datasets (as small as 50 images) can be effectively used for training the model. Such a lowresource learning is imperative for medical images where data is scarce and cannot be effectively used to train neural networks.
In this work, we treat the number of landmarks to be predicted as a hyperparameter. As the landmark locations can be arbitrary, there could be redundant landmarks in characterizing the structure of interest. To remove such landmarks, we use a simple heuristic, based on the assumption that removing such redundant landmarks will not affect the registration. Using this heuristic, we compute the change in mean registration accuracy by removing one landmark at a time (this is performed on an already trained model as postprocessing). This difference is the importance value attached to each landmark, as more significant the difference more is the importance of the landmark in performing accurate registration. We remove these landmarks in a greedy fashion, that is, we follow these steps.

For remaining set of landmarks compute the registration loss. Initially the set of remaining landmarks is same as the starting set of landmarks.

Temporarily remove one landmark at a time and compute the registration loss and its difference from one computed in step 1. Do this for all remaining landmarks.

Select the landmark with least difference (least importance) and remove it.

Repeat steps 13 till either desired number of landmarks are reached or a difference threshold is reached.
This removal allows for a smaller and more informative landmarkbased shape descriptor, and the effect is shown in Figure 2.
Regularization and redundancy removal: The regularization term tends to spread the particles evenly across the image and is applied as a soft constraint with the image matching loss. The regularization acts in conjunction with the registration loss, i.e., if a feature in an image exhibits a higher registration loss, the particles will be distributed to match that feature better. In cases where the anatomy of interest is localized with lower registration loss in that region would cause the redundancy removal to disregard the spreadout particles. In such scenarios, the registration loss must be spatiallyweighted to introduce a preference to the localized anatomy; this model variant is introduced in the following section. Furthermore, the redundancy removal needs to be applied carefully with quality control to remove particles from the region of interest. We can also modify the redundancy removal process to only look at selective regions in the image while computing the registration accuracy.
Note on Training and Inference Time: We trained the network (2D architecture described in Appendix 0.B) with 30 2D landmarks on a dataset of 100 toy images of
dimensions (that is 10000 pairs – actual training size). We utilize a single 12GB NVIDIA TITAN V GPU. The training time for one epoch with batch size of 20 is 6.2 minutes, and the inference time for a single scan is 0.4 seconds. Another important aspect is the GPU memory requirement, which for this experiment is 4600MB utilizing Pytorch deep learning framework.
3.5 Model Variants
Weak Supervision Variant: The proposed model can be easily modified for a weakly supervised learning framework to introduce a prior that informs the landmark positioning via changes to its loss function. In many cases, medical data can be presented along with some form of shape description. The most common of these forms is via segmentation of the anatomy of interest. However, in a typical scenario, only a limited number of data have segmentation associated with the image. In such cases, segmentation can be used to improve the landmark positioning by the following changes to the model:

We use the learned transformation parameters to transform a source segmentation image () to the registered segmentation ().

Introduce the matching function between the registered segmentation to the target segmentation () and optimize the model with the updated loss. This loss function will only be activated when both the source and target segmentations are present, providing weak supervision for the landmark (shape descriptor) discovery task.
The loss function can thus be expressed as:
(8) 
Here, is an indicator variable that is 1 when both and exists and zero otherwise.
There are two other aspects to note here: (i) the input to the landmark encoder are still images, and therefore during testing, we do not need an additional segmentation input, and (ii) instead of binary segmentation, any other forms of shape information that can be deformed can be used as well, such as signed distance transforms or correspondences.
Localized Variant: In some instances, the anatomical area of interest is localized, and we would like the landmarks to be expressive of that particular location. For instance, in CT scans of the Femur, the diagnosis and characterization camtype femoroacetabular impingement [38] is done by analyzing the localized region below the femoral head and how its difference from a representative femur shape or a healthy patient. Such localized characterization requires the shape descriptor to contain sufficient information about this localized anatomy of interest. To achieve this, we again assume that image registration enforces landmark location, i.e., for the landmarks to be more expressive of the localized region, we want to achieve the best image registration in that region. Therefore, we propose a simple modification to the loss function.
(9) 
Here, represents a mask representing the location of the localized region of interest. In a special case, these masks are fixed (i.e., the same) for the given dataset if the anatomy of interest occupies a common space across the image population, i.e., images are roughly aligned.
4 Results
This section shows the results of the proposed methods on different 2D/3D datasets and is divided into subsections corresponding to each dataset. We also demonstrate the usefulness of the landmarkbased shape descriptor obtained in each case paired together with a downstream application. This section also includes an analysis of regularization, redundancy removal, and the application of different proposed framework variants. In most cases, the number of epochs are chosen via early stopping based on the best validation loss.
4.1 Diatom Dataset
Diatoms are a biological group of algae found in different water bodies and fossilized deposits. Diatoms are unicellular and are categorized into different classes based on their shape/structure. Any characterization of diatoms is based on their size and shape; therefore, it is a perfect dataset for applying the proposed method and finding landmarkbased shape descriptors. The diatoms dataset ^{1}^{1}1Downloaded from the DIADIST project page: rbgweb2.rbge.org.uk/DIADIST/ contains 2D isolated cellular images from four different diatom morphologies, namely, Eunotia (68 samples), Fragilariforma (100 samples), Gomphonema (100 samples), and Stauroneis (72 samples). This dataset is collected as part of the automatic diatom identification and classification project [20]. The data is split into 80%, 10%, 10% for training, validation, and testing datasets.
We train the proposed network with 2 loss for image matching, with a regularization parameter of (found using crossvalidation as described ahead), and with 16 landmarks. We also keep four predetermined landmarks on the corners while computing the warp; these are not learned via the network. We train the network (using 2D image architecture as given in 0.B) for 20 epochs on all possible image pairs, with no additional data augmentation. As a postprocessing step, we perform the redundancy removal as described in Section 3.4 to retain 11 landmarks. Results shown in Figure 3 highlight the structural correspondence between different diatoms classes. We can notice that some of the landmarks are not precisely on the border of the shape. Such positioning of landmarks arises from the fully unsupervised training of the model with respect to the landmarks. The network has no prior on how and where to place the landmarks, and hence, the placement of the landmarks is the result of having the best possible registration loss. For instance, the landmark number 9 in Figure 3 is dictated via the registration error that placing it inside the image border in that reasonably similar intensity level will reduce the image matching loss.
Downstream Application:
These landmarks are easy to use as shapebased features of diatoms. We perform spectral clustering
[44] using these landmarks as features, and it performs well to separate the classes into clusters except for fragilariforma and eunotiathat exhibit very similar shapes when their scales match. For classification, a multilayer perceptron (with a single hidden layer) can distinguish between these four classes using these landmarks as inputs with 100% test accuracy.
Regularization parameter: Using this simple dataset, we also want to showcase the selection of regularization parameter via crossvalidation. We perform threefold crossvalidation with different lambdas and compute the average registration loss at every fold. The plot for this experiment is shown on the topright of Figure 3. It highlights that there is an optimal that minimizes the registration loss and is a notable result. Since the regularization does not act on network parameters and limit the space of solutions, there is no guarantee that the generalization in registration performance should improve. The regularization term is only introduced for optimization stability, ensuring uniform particle spread tasks that it performs well. However, empirically, we see an optimal that achieves the best generalization, showing that the proposed regularization term also improves the generalization of the network.
4.2 Metopic Craniosynostosis Dataset
Metopic Craniosynostosis [29] is a morphological condition of the skull/cranium that affects infants. It occurs due to premature fusion of the metopic suture in the skull, and the subsequent brain growth causes deformed head shapes. The morphological symptoms include a triangularshaped forehead (trigonocephaly) [31] and compensatory expansion of the back of the head. In severe cases, along with abnormal morphology, patients are affected by the increased intracranial pressure causing several neurological complications. In current practice, the severity of metopic craniosynostosis is gauged subjectively by surgeons, affecting the subsequent treatment protocol. The usual treatment entails a risky surgical procedure for severe cases or continued observation for milder ones. In recent research, the skull shape of metopic patients and its deviation from normal has been used for devising an objective severity measure [31, 6]. These methods use CT scans that underwent segmentation and/or are processed for shape representation; these steps involve manual and computational overhead.
We use the proposed method directly on the CT scans and aim to obtain a shape descriptor that can be subsequently used for severity quantification of metopic craniosynostosis. Our dataset comprises cranial CT scans of infants between 515 months of age, these scans were acquired at UPMC Children’s Hospital of Pittsburgh between 20022016. Out of which 27 are scans of patients diagnosed with metopic craniosynostosis, this diagnosis was performed by a boardcertified craniofacial plastic surgeon utilizing the CT images as well as a physical examination. The remaining 93 patients were trauma patients that underwent CT scans; however, they demonstrated no morphological abnormalities, and these scans form our control set. All the CT scans were acquired as a part of routine clinical care protocol and under an IRBapproved registry. We collect this data with HIPAA protocols and deidentification of the scans. In the following analysis, we refer to the CT scans from the control set as controls and the CT scans that are diagnosed with metopic craniosynostosis as metopic. We train the network only on the control set with 80%, 10%, 10% data split for training, validation, and testing, respectively. We use 2 as our image matching loss function, because the dataset has minimal intensity variation across different samples, and the primary structure of the cranium is welldefined. We use 100 3D landmarks (8 constant on corners), and a regularization parameter of (discovered via crossvalidation). We perform redundant landmark removal by selecting the best 50 landmarks as postprocessing. In addition to the test set, we also evaluate the model on the CT scans diagnosed with metopic craniosynostosis, these CT scans are not observed by the model during training as it is only trained on the control set. Figure 4 showcases qualitative results where we see that the registration performance between a single source image (metopic) and two targets, one metopic other from the control set.
Downstream Application:
Using the landmarks from the proposed method as a shape descriptor, we can compute the Mahalanobis distance/Zscores of each scan with respect to a population of control scans. If we consider
be the shape descriptors of N scans from the control set, whose mean and covariance is given as and respectively, the Mahalanobis distance/Zscore for a data with shape descriptor is given by:(10) 
This Zscore represents the deviation of a scan from the control population (this set does not have any symptoms of cranial deformity). Hence, a metopic skull would have a larger Zscore than a control scan. Figure 5 (left) shows that this is indeed the case. We also want to compare the efficacy of the obtained landmarks as shape descriptors with respect to dense correspondences from a points distribution model(PDM). For this, we utilize ShapeWorks [13] package, a stateoftheart method for automatic correspondence discovery. As a PDM method, ShapeWorks has been used in the context of metopic severity prediction [5]. ShapeWorks produces a set of 2048 correspondences on surfaces of cranial CT scans using their segmented images, which is represented as a lowdimensional PCA loading vector to be used as a shape descriptor. We compare the efficacy of both the shape descriptors in characterizing the severity of metopic craniosynostosis. For this, we compute the Mahalanobis Distance (or Zscore) of each shape (landmark shape descriptor) with respect to control data distribution. The Pearson’s correlation between two scores is 0.81, and its scatter plot (after normalizing each set of Zscores) is given in Figure 5. The Zscores are normalized by dividing with the maximum Zscore from the set, this allows them to lie between 0 and 1 and provide better visualization in the scatter plot, this normalization does not affect the correlation score. Such a significant correlation showcases that both methods capture similar shape information required to characterize the severity of the condition. Additionally, we compare the Zscores from the proposed method with aggregate severity scores from 16 craniosynostosis experts’ ratings. Each rating uses a Likert scale between 05, with 5 being the most severe, and only 27 metopic scans are rated. The Zscores and the aggregate ratings show a positive correlation with Pearson’s coefficient of 0.64 (see Figure 5). In comparison, Pearson’s correlation between the Mahalanobis distance from ShapeWorks and the expert ratings is only 0.28.
4.3 Camtype Femoroacetabular Impingement (camFAI)
Femoroarticular impingement (FAI) [2] is an orthopedic disorder that affects movement in the hip joint. CamFAI is a primary cause of hip osteoarthritis and is characterized by an abnormal bone growth of the the femoral head. CamFAI affects a localized region as shown in Figure 6. The deviation of a femur anatomy diagnosed with CamFAI to representative femur anatomy from a healthy patient population is of interest. This deviation can inform operative decisions and subsequent treatment planning. With the localized variant of the proposed method (see Section 3.4), we aim to discover the shape descriptor that captures the localized structure of the camFAI pathology. In this study, we use a dataset of 59 CT scans of the femur bone, out of which 50 scans are of patients without any diagnosed morphological defect in their femurs, we call this the control set. Additionally, we also have another 9 CT scans of the femur bone that are from patients diagnosed with CamFAI deformity. All data was originally collected for research purposes, and specifically for the evaluation of hip biomechanics [26, 25] at Orthopaedics Research Laboratory, School of Medicine, University of Utah. All participants provided informed consent prior to participation in this University of Utah IRBapproved study. The data contains femur scans of both left and right femur, and all the right femur have been reflected from the midsaggital plane to have consistent orientation across the dataset. We use the median CT scan of the control set to define a common/fixed mask image for the image matching loss (Eq 9). This is defined by selecting a bounding box around the anatomy of interest and then blurring it using a convolution with a Gaussian filter ( Figure 6). We apply the method with (found by crossvalidation) and train the model for 20 epochs on the joint set (controls and camFAI diagnosed scans) dataset with 80 landmarks, and perform redundancy removal by selecting the top 40 landmarks. Figure 7 showcases the landmarks on a single sourcetarget image pair and their corresponding registration output. We notice that the registration error is low in the region of interest, as stressed by the closeup section of the registration.
Downstream Application: We want to characterize how well is mean pathology showcased. In this experiment, we will use the segmentation masks of femur anatomy for each image available to us. We want to discover the mean anatomy of the femurs from the control set and mean of the femurs diagnosed with CamFAI, and we follow these steps for a given image set:

We compute the mean landmarks (using the landmarks discovered by the proposed model) on a set of CT scans.

We use these points to compute a mean image. We discover landmarks on each image, and then we register each segmentation mask to the mean landmarks, giving us a single approximation of the mean image.

We perform such approximation for all the images in a population, and taking an average of all these approximations will give us the mean segmentation image.
We follow the above steps for the control set, obtaining a representative shape for healthy femur anatomy (normal mean shape), and for the set of camFAI scans obtaining a representative femur shape with CamFAI (CamFAI mean shape). We compare the normal mean shape with CamFAI mean shape using the surfacetosurface distance from the CamFAI mean to normal; this distance is projected on the mesh of the normal mean shape in Figure 8. The negative values showcase the regions where average CamFAI pathology is outward, whereas positive values showcase it is inwards from the average normal. We see that visualized CamFAI pathology (outward regions) is similar to the clinically plausible location for the CamFAIaffected region on femur bone. We believe this behavior will be even more pronounced with more pathological scans in training. This experiment and the subsequent results are promising as the CamFAI deformity is very subtle and difficult to capture. It would require full segmentation and dense correspondences to capture such a subtle variation [2]. Furthermore, we also compute the Mahalanobis distance using the landmarks as shape descriptors and normal CT forming the base distribution to verify whether CT scans with CamFAI diagnosis deviate from the population of controls. We clarify that the number of scans is less than the number of landmarks; we need to compute PCA using landmarks with dimension as the number of scans that captures 100% of shape information. This is necessary; otherwise, the covariance is a singular matrix, and Mahalanobis distance will produce noninterpretable values. From the histogram shown in Figure 8, we see that the scans with CamFAI diagnosis significantly deviate from the control set, indicating that the shape descriptor is capturing the localized pathology well.
4.4 Cardiac LGE Dataset
This dataset consists of 3D late gadolinium enhancement (LGE) images of leftatrium (LA) for 207 patients. All these scans are from patients diagnosed with irregular heartbeats or atrial fibrillation (AF). These scans are acquired after the first ablation, a procedure used for the treatment of AF. Cardiac MR imaging was performed on AF patients presenting at the University of Utah Hospital’s Electrophysiology Clinic. Image sequences include a respiratory and ECGgated MRA, acquired during continuous gadolinium contrast agent injection (0.1 mmol/kg, Multihance [Bracco Diagnostic Inc.]), followed by a 15minute postcontrast LGE sequence. Images were acquired on either a 1.5 T or 3 T clinical MR scanner (Siemens Medical Solutions) using phasedarray receiver coils. LGEMRI scans were acquired approximately 15 minutes after the contrast agent injection and were acquired at the enddiastole phase of the cardiac cycle. The scanning protocol utilized a 3D inversion recovery, respiration navigated, ECGgated, gradient echo pulse sequence. Typical image acquisition parameters include the following: freebreathing using navigator gating, a transverse imaging volume with voxel size = 1.25 × 1.25 × 2.5 mm (reconstructed to 0.625 × 0.625 × 1.25 mm), and inversion time = 270–320 ms. Inversion times for the LGEMRI scan were identified using a TI scout scan. Other parameters for the 1.5 T scanner included a repetition time of 5.4 ms, echo time of 2.3 ms, and a flip angle of 20°. Scans performed on the 3 T scanner were done using a repetition time of 3.1 ms, echo time of 1.4 ms, and a flip angle of 14°. ECG gating was used to acquire a small subset of phase encoding views during the diastolic phase of the LA cardiac cycle.
It is a very challenging dataset due to two reasons, (i) these LGE images are low resolution and noisy, and very varied in terms of intensity profiles (see the image pair in Figure 9), and (ii) the LA shape is not very distinguishable intensitywise from the neighboring anatomies. We employ our model on this dataset to test its limits. The goal of this experiment is not to achieve accurate registration but rather to find usable shape descriptors. The downstream task for this application is the prediction of atrial fibrillation recurrence from LA shape, which is expressed via the proposed landmark shape descriptor. Due to the challenging nature of the data, we use the distance transforms via the weak supervision variant of the framework, and we use all the distance transforms (207 in number). We apply MIND loss for both the image matching terms on LGE and distance transforms, with a and 100 landmarks. Additionally, we perform a center of mass alignment on the images (using the corresponding segmentations) followed by cropping. This step is necessary to highlight the LA shape and isolate it from the neighboring anatomies. We train the network for 15 epochs and perform redundancy removal by retaining 50 landmarks per image. A single source image pair with landmarks are shown in Figure 9. The registration performance is not as good as other datasets described due to the lowquality images and highvariability nature of LGE intensities and LA shapes.
Downstream Application: The LA scans are acquired postablation, however, even after ablation, AF can occur again, this is known as AF recurrence. Therefore, we also monitor patients postablation for a recurrence of AF. The shape of LA and the leftatrium appendage is shown to be successful in predicting AF recurrence [9] . For recurrence prediction, we again use a simple multilayer perceptron with three hidden layers, which yields a test accuracy of . In comparison, we also use ShapeWorks to place a dense set of 2048 particles on these left atrium shapes perform PCA on it to reduce to a 20dimensional shape descriptor. Using this with the same MLP architecture for atrial fibrillation recurrence classification, we get a test accuracy of . This showcases that the discovered landmark shape descriptor captures almost the same amount of information captured by a dense correspondence model for this particular downstream task. A better initialization of landmarks that introduces a degree of supervision and domain expertise in the model will improve the shape descriptor. However, such experiments take away from the complete unsupervised and domainindependent framework of the model.
5 Conclusions
This paper proposes an endtoend framework of obtaining usable shape descriptor directly from a set of 2D/3D images. The proposed model is a selfsupervised network that works under the assumption that anatomically consistent landmarks will register a pair of images well under a particular class of transformations. The model consists of a landmark encoder, an RBF solver, and a spatial transformer during training. For testing, we only use the landmark encoder to obtain a set of landmarks on a given image. The methodology has been initially proposed in our previous work [8]. This paper provides detailed explanations and significantly extends it by introducing different image matching loss functions, two variants of loss functions that incorporate prior shape information, and extensive experimentation on several different 2D and 3D datasets. We find that the landmark shape descriptor obtained via the proposed model can be used directly for shape analysis and subsequent downstream tasks such as disease classification and severity quantification.
6 Acknowledgements
The National Institutes of Health supported this work under grant numbers NIBIBU24EB029011, NIAMSR01AR076120, NHLBIR01HL135568, NIBIBR01EB016701, NIBIBR21EB026061, and NIGMSP41GM103545. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. We would also like to thank Dr Jesse Goldstein, Dr Andrew Anderson, Dr Nassir Marrouche and Dr Penny Atkins for making their data available to be used in this manuscript.
Appendix 0.A RBF Solver for Image Registration
If we consider the 2D image case now with specifying a coordinate on the image grid. Let and be transformations acting on each coordinate, therefore the entire transformation is given as . The individual transformations follow the RBF equation (Eq. 1), that results in:
(11)  
(12) 
Here, and are the unknown parameters for x and y coordinates respectively. Now given a set of M control points on source and target image ( and )we have . These gives us 2M linear equations, we need six more equation to perfectly solve for equations. We arrive at this via the constraint that the RBF part of the transformation should have no linear or constant term, i.e. , and, . We have three similar equation for y coordinate. There we have a system of equation given as:
(13) 
Where, and , and
(14) 
Now for solving the transformation parameters we just solve the following linear equation with , where and .
Appendix 0.B Architectural Details
The architecture of landmark encoder used for all 2D experiments is given by bottom Figure 10
, it is comprised of four blocks, first two blocks having two convolution layers and the last two having four. Each block is followed by a max pooling with factor 2. Each convolutional layer has a kernel size of
. The activation function used throughout is ReLU, except at the output layer we use hyperbolic tangent. The output points are in the range 1 to 1 which are then scaled to the original coordinates by using the image dimensions.
For 3D we have a smaller architecture due to memory constraints, it consists of four blocks with two convolutional layers each. Each block is followed by a max pooling with factor 2. Each convolutional layer has a kernel size of . The activation function used here is a leakyReLU except at the output layer where we use hyperbolic tangent.
References

[1]
(2017)
Learning deep features for automated placement of correspondence points on ensembles of complex shapes
. In Medical Image Computing and Computer Assisted Intervention − MICCAI 2017, Cham, pp. 185–193. Cited by: §2.  [2] (2017) Quantitative comparison of cortical bone thickness using correspondencebased shape modeling in patients with cam femoroacetabular impingement. Journal of Orthopaedic Research 35 (8), pp. 1743–1753. Cited by: §4.3, §4.3.
 [3] (2019) VoxelMorph: a learning framework for deformable medical image registration. IEEE TMI: Transactions on Medical Imaging 38, pp. 1788–1800. Cited by: §1, §2.
 [4] (2005) Computing large deformation metric mappings via geodesic flows of diffeomorphisms. International journal of computer vision 61 (2), pp. 139–157. Cited by: §2, §2.
 [5] (2020) Quantifying the severity of metopic craniosynostosis: a pilot study application of machine learning in craniofacial surgery. Journal of Craniofacial Surgery 31 (3), pp. 697–701. Cited by: §1, §2, §4.2.
 [6] (2018) DeepSSM: a deep learning framework for statistical shape modeling from raw images. Shape in Medical Imaging : International Workshop, ShapeMI, MICCAI 2018 11167, pp. 244–257. Cited by: §1, §2, §4.2.

[7]
(2019)
A cooperative autoencoder for populationbased regularization of cnn image registration
. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, Cham, pp. 391–400. Cited by: §2.  [8] (2020) Selfsupervised discovery of anatomical shape landmarks. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2020, Cham, pp. 627–638. Cited by: §1, §5.
 [9] (2018) Left atrial shape predicts recurrence after atrial fibrillation catheter ablation. Journal of cardiovascular electrophysiology. Cited by: §4.4.
 [10] (2016) Learning shape correspondence with anisotropic convolutional neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp. 3197–3205. Cited by: §2.
 [11] (2005) Hippocampal shape analysis using medial surfaces. Neuroimage 25 (4), pp. 1077–1089. Cited by: §2.
 [12] (2014) Computational shape models characterize shape change of the left atrium in atrial fibrillation. Clinical Medicine Insights: Cardiology 8, pp. CMC–S15710. Cited by: §1.
 [13] (2017) ShapeWorks: particlebased shape correspondence and visualization software. In Statistical Shape and Deformation Analysis, pp. 257–298. Cited by: §4.2.
 [14] (2007) Shape modeling and analysis with entropybased particle systems. In Proceedings of Information Processing in Medical Imaging (IPMI), pp. 333–345. Cited by: §1, §2.
 [15] (2019) Learning conditional deformable templates with convolutional networks. In Advances in Neural Information Processing Systems, Vol. 32, pp. . Cited by: §2.
 [16] (2018) Unsupervised learning for fast probabilistic diffeomorphic registration. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 729–738. Cited by: §2.
 [17] (2009) Particle based shape regression of open surfaces with applications to developmental neuroimaging. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 167–174. Cited by: §2.
 [18] (200205) A minimum description length approach to statistical shape modeling. IEEE Transactions on Medical Imaging 21 (5), pp. 525–537. Cited by: §2, §2.

[19]
(2018)
Superpoint: selfsupervised interest point detection and description.
In
Proceedings of the IEEE conference on computer vision and pattern recognition workshops
, pp. 224–236. Cited by: §2.  [20] (1999) Diatom identification: a double challenge called adiac. In Proceedings 10th International Conference on Image Analysis and Processing, pp. 734–739. Cited by: §4.1.
 [21] (201304) A pointcorrespondence approach to describing the distribution of image features on anatomical surfaces, with application to atrial fibrillation. In IEEE 10th International Symposium on Biomedical Imaging, Vol. , pp. 226–229. Cited by: §2.
 [22] (2001) Shape analysis of brain ventricles using spharm. In Proceedings IEEE Workshop on Mathematical Methods in Biomedical Image Analysis (MMBIA 2001), Vol. , pp. 171–178. External Links: Document, ISSN Cited by: §2.
 [23] (2018) On the evaluation and validation of offtheshelf statistical shape modeling tools: a clinical application. In International Workshop on Shape in Medical Imaging, pp. 14–27. Cited by: §2.
 [24] (1991) Hands: A pattern theoretic study of biological shapes. Springer, New York. Note: Cited by: §2.
 [25] (2013) Threedimensional quantification of femoral head shape in controls and patients with camtype femoroacetabular impingement. Annals of biomedical engineering 41 (6), pp. 1162–1171. Cited by: §4.3.
 [26] (2013) Statistical shape modeling of cam femoroacetabular impingement. Journal of Orthopaedic Research 31 (10), pp. 1620–1626. External Links: ISSN 1554527X Cited by: §2, §4.3.

[27]
(2007)
Shape analysis using a pointbased statistical shape model built on correspondence probabilities
. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 959–967. Cited by: §1.  [28] (2016) Spatial transformer networks. External Links: 1506.02025 Cited by: item .
 [29] (2011) Craniosynostosis. European Journal of Human Genetics 19 (4), pp. 369–376. Cited by: §4.2.
 [30] (2004) Unbiased diffeomorphic atlas construction for computational anatomy. NeuroImage; Supplement issue on Mathematics in Brain Imaging 23 (Supplement1), pp. S151–S160. Cited by: §1, §2.
 [31] (2012) Interfrontal angle for characterization of trigonocephaly: part 1: development and validation of a tool for diagnosis of metopic synostosis. Journal of Craniofacial Surgery 23 (3), pp. 799–804. Cited by: §4.2.
 [32] (2017) Adam: a method for stochastic optimization. External Links: 1412.6980 Cited by: §3.4.
 [33] (2015) Siamese neural networks for oneshot image recognition. In ICML deep learning workshop, Vol. 2. Cited by: item .
 [34] (2004) Distinctive image features from scaleinvariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §2.
 [35] (2014) Personalized assessment of craniosynostosis via statistical shape modeling. Medical Image Analysis 18 (4), pp. 635–646. Cited by: §2.
 [36] (2017) Integrating statistical prior knowledge into convolutional neural networks. In Medical Image Computing and Computer Assisted Intervention, M. Descoteaux, L. MaierHein, A. Franz, P. Jannin, D. L. Collins, and S. Duchesne (Eds.), Cham, pp. 161–168. Cited by: §1, §2.
 [37] (2012) Functional maps: a flexible representation of maps between shapes. ACM Transactions on Graphics (TOG) 31 (4), pp. 30. Cited by: §2.
 [38] (2015) Review: femoroacetabular impingement. Arthritis & Rheumatology 67 (1), pp. 17–27. Cited by: §3.5.
 [39] (2017) Convolutional neural network architecture for geometric matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6148–6157. Cited by: §2.

[40]
(200003)
Parametric estimate of intensity inhomogeneities applied to MRI
. IEEE Transactions on Medical Imaging 19 (3), pp. 153–165. Cited by: §2.  [41] (200607) Framework for the statistical shape analysis of brain structures using spharmpdm. Note: http://hdl.handle.net/1926/215 Cited by: §1.
 [42] (2014) Why does mutualinformation work for image registration? a deterministic explanation. IEEE transactions on pattern analysis and machine intelligence 37 (6), pp. 1286–1296. Cited by: §3.3.1.
 [43] (1917) On growth and form. Cambridge University Press. Cited by: §1, §2, §2.
 [44] (2007) A tutorial on spectral clustering. Statistics and computing 17 (4), pp. 395–416. Cited by: §4.1.
 [45] (2020) Adversarial uniand multimodal stream networks for multimodal image registration. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 222–232. Cited by: §3.3.1.
Comments
There are no comments yet.