Characteristics of FG – direction, intensity, and duration - can be critical in therapy of congenital malformations and malocclusions. From the clinical perspective, FG can be viewed as favorable, neutral, or unfavorable. Favorable growth occurs when components of the face grow in a direction or with intensity facilitating treatment, while unfavorable growth takes place when growth characteristics hinders achievement of optimal results of treatment. Neutral growth is located between these extremes.
Over the last several decades, FG was studied primarily with two-dimensional (2D) radiography. To this end ”growth studies” i.e., collections of girls and boys in whom profile 2D radiographs of the head had been taken annually, were initiated in the USA in the 1930’s. Until the late 1980’s when it became gradually clear that exposure of humans to ionizing radiation only for scientific purposes was unethical, thus unacceptable, more than 2000 children and teenagers had been followed.
Characterization of FG on 2D radiographs still poses a significant challenge even to an experienced investigator/clinician. It is due to paucity of stable reference structures (all parts of the face grow and remodel) and temporally and spatially diversified growth of individual components of the face (i.e. different parts of the face can grow at maximum velocity at different ages and different parts of the face can grow at different directions at different ages, respectively). Furthermore, direction of overall FG can vary depending on the age or developmental period.
Another challenge in assessment of FG is the lack of coordinate system within which the direction of growth can be unambiguously determined. Both in research and clinical practice, different angular or linear measurements or proportions are used for facial classification. Unfortunately, the facial morphology or growth direction categorized as ’X’ by a given measurement (labeling) can fall into a different category when alternative measurement (labeling) is applied.
The main contribution of this work is threefold:
addressing the problem of prediction of the growth of the face with an application of machine learning (ML) methods, which, to the best of our knowledge, has not been analyzed before;
performing a comprehensive data analysis which revealed the complexity of the problem and showed why a prediction based on 2D X-ray images is truly challenging;
applying and comparing a broad set of ML algorithms, best of which achieved 71% to 75% classification accuracy.
2 Related literature
First attempts of prediction of FG were made in 1960’s and 1970’s – the authors of [bjork1969prediction] proposed a structural method consisting in qualitative assessment of seven morphological features on a single 2D radiograph, while in [bhatia1979proposed] statistical modeling was used for classification of facial types, hence FG at 9 and 17 years of age. Later, the authors of [skieller1984prediction] proposed a multivariate regression model with four morphological features as having
predictive estimate of mandibular growth rotation. In[buschang1990mandibular], the authors developed a prediction system by adding mean annual velocities with predictions derived from a polynomial model of the population’s growth curve, while the authors of [rudolph1998multivariate] created prediction equations for identification of favorable vs. unfavorable growth patterns. Unfortunately, a possibility of effective FG prediction was questioned by others ([baumrind1984prediction], [lux1999evaluation], [kolodziej2002evaluation]) who observed that when previously proposed methods were used in different patient samples, their predictive power was limited and clinically irrelevant.
An interesting alternative in the search for the method of prediction of FG can be application of machine learning (ML), which has not been used to this end yet. Until now, methods of ML have been recently applied for automated identification of cephalometric landmarks on 2D radiographs ([arik2017fully], [park2019automated], [hwang2020automated]). In [arik2017fully]
, the authors used deep convolutional neural networks (CNNs) for fully automated quantitative cephalometric analysis, the authors of[park2019automated]
, compared two newly developed deep learning algorithms for automatic identification of cephalometric landmarks annotated on 2D radiographs – You-Only-Look-Once version 3 (YOLOv3) and Single Shot Multibox Detector (SSD), while in[hwang2020automated] performance of human experts and YOLOv3 algorithm for automatic identification of cephalometric landmarks was compared.
In summary, application of ML methods in clinical disciplines related to treatment of craniofacial anomalies is growing fast in recent years. Currently, some methods of automatic identification of cephalometric landmarks can be as accurate as those of human experts. However, to our knowledge, no research on the possibility of FG prediction with ML methods has been presented yet.
Assessment of FG requires a collection of longitudinal (i.e., taken at fixed intervals) images of the same subject. The images should comprehensively capture 3D facial morphology (3D images are preferred to 2D images). Then, images should be made in a standardized fashion ensuring comparability ’along’ individual subject and across a sample and they should be made at standardized ages or developmental period. Fulfilling these requirements is challenging not only because of practical issues such as frequent drop-outs of subjects from the studies but also because of ethical issues (eg., taking 3D radiographic images at fixed intervals in healthy subjects is considered unethical and forbidden; as a result, only already existing 2D radiographs can be studied). Therefore, participants from Norwegian ”Nittedal Growth Study” [el1994longitudinal] and AAOF Craniofacial Growth Legacy Collection (collections included: Bolton-Brush, Burlington, Denver, Fels Longitudinal, Forsyth Twin, Iowa, Mathews, Michigan, Oregon) [american1996growth] were included in this investigation. In summary, children born between the 1950’s and mid-1970’s residing within a specific geographic area of the respective Growth Study reported for a clinical examination and 2D radiographic registration at predetermined ages (intervals). It should be emphasized that imaging technique and age at radiographic imaging were attempted to be standardized only within the individual Growth Study but not between the Studies. As a result, radiographs taken at different locations can differ in image magnification or the age of imaging.
When preparing the dataset, we aimed at having data from 9- and 12-year-olds as an input to predict the change of particular measurements between the 9th and 18th year of age. However, due to a lack of sufficient data, we decided to make these assumptions less strict (to be included in the group of 9-year-olds, one is not obliged to be exactly 9 years old, etc.) to gather enough instances. The total number of samples amounts to 639. Characteristics of particular groups present Table 1.
4 Data analysis
Instances of FG direction are in the form of tabular data. The feature set is composed of cephalometric variables or different coordinates described later. Irrespective of the feature format, an orthodontic expert has to identify about twenty characteristic landmarks on each radiograph of the face. The location of some landmarks is ambiguous and it is impossible to mark such points very precisely [perillo2000effect]. Thus, the data gathering process poses a source of noise.
Although there is no widely accepted set of measurements to assess FG direction, orthodontists can utilize the following three central measurements: SN/MP, FA, and PN-AN. Each of them corresponds to a slightly different aspect of growth. SN/MP is an angle between Sella-Nasion and Menton-Gonion Inferior lines. FA (facial axis) is an angle formed by Basion-Nasion and Gnathion-PTM. PN-AN is defined by the difference of distances from Point A and Pogonion to the line perpendicular to Porion-Orbitale and going through Nasion. FG direction can be described by one of three categories: horizontal, vertical, and mixed. Fig. 1a depicts sample X-ray photographs with landmarks and measurements used to create the predicted variables, while Fig. 1b and Fig. 1c some examples of horizontal and vertical growth.
4.2 Face size
The range of coordinates is extensive. Fig. 2 depicts clouds of landmark coordinates coming from all patients at the age of 9, 12, and 18. For the sake of readability, we limited the number of presented points to five and rotated them to make Sella-Nasion line vertical. The visualization shows that the images were taken at least in three or four different scales. It is not possible to retrieve any absolute measurement values from images. It means that one objectively smaller face can be larger in image than the other one and thus larger in the coordinates space. It may concern images of different people but, worse still, images of one person which confuse a model. Precise scaling is not possible since we do not know which value of the scaling factor to apply.
4.3 Comparison of patients groups
Next, we determined how patients marked as having horizontal growth differ visually from those labeled as having vertical growth. We selected patients that, according to all three measurements, belonged to the first or third class (for labeling details, please refer to Section 5). There were only 31 (out of 639) patients in both categories satisfying all criteria, which shows how ambiguous horizontal and vertical growth terms are. Fig. 2(a) and Fig. 2(b) present how the average point locations changed along with face growth in both groups.
Two main conclusions can be drawn. Firstly, the growth directions of all five points belonging to the chin are significantly different for both groups. Secondly, the location of points in the age of 12 is very close to the line set by landmarks in the age of 9 and 18. These two observations looked promising: two classes were well separable and there was high predictive potential thanks to the high 9–12 and 9–18 direction similarity. Naturally, the angle between, for example, the horizontal line and the growth vector between 9 and 18 became a good candidate for a new growth measurement. Unfortunately, when analyzing particular patients, the aforementioned similarity between 9–12 and 9–18 directions is far lower than that of averaged over all patients in a group. Fig.2(c) and 2(d) present some examples.
4.4 Comparison of growth periods
Having spotted that the growth direction between 9–12 can vary significantly from that between 9–18, we decided to mark additional images of 15-year-olds and measure Pearson correlation of the changes of predicted measurements in particular periods. Table 2 presents the results. Two main conclusions can be drawn. Firstly, there is almost no correlation (Pearson coefficient ranges from -0.37 to 0.05) between three-year periods, which may suggest that growths in particular periods are quite independent of each other. Secondly, the correlation between 9–12 and 9–18 amounts to 0.50, 0.50, and 0.52 for the three predicted measurements.
4.5 Dataset size and class imbalance
The number of features ranges from 16 to 82, depending on the experiment, while the number of all samples that have to be split into training, validation, and test sets amounts to 639. Both the low number of instances and the high ratio of features to instances make any deep architecture inapplicable due to overfitting. Thus, we are limited to shallow networks and standard ML methods. Class imbalance, described in more detail in the next section, makes the learning process even more difficult.
Predicted variables are defined for each of the three measurements, SN/MP, FA, and PN-AN, by subtracting the value of a particular measurement at the age of 9 from the corresponding value at the age of 18. Each variable is categorized in the following way:
first class is created by instances whose value is lower than one standard deviation from the mean (horizontal growth);
second, most numerous, class contains samples located not further than one standard deviation from the mean (mixed growth);
third class constitutes instances greater than one standard deviation from the mean (vertical growth).
As a result, for SN-MP, FA, and PN-AN, the majority class contains , , and instances, respectively. Let us denote the predicted variables as SN-MP(18-9), FA(18-9), and PN-AN(18-9).
5.1 Models parametrization
For the sake of brevity of the presentation, let us introduce the following notation: MLP, MLP(), MLP(
) denote a perceptron with no hidden layer, one hidden layer containingneurons, and two hidden layers with and neurons, respectively. NN(
) relates to nearest neighbors classifier consideringneighbors. XGB(
) refers to the XGBoost algorithm withboosting rounds. RF(
) pertains to random forest consisting of
trees. SVM corresponds to Support Vector Machine classifier. LR concerns logistic regression with the maximum number of iterations set to
. DT denotes a decision tree.
In all experimental setups defined by the input features (described below) and predicted measurement, the following models were tested: MLP, MLP(20), MLP(50), MLP(100), MLP(50, 10), MLP(50, 20), MLP(50, 50), XGB(100), XGB(300), RF(100), RF(300), SVM, LR, DT, NN(3), NN(5).
Each hidden layer in neural networks applied ReLU while the last layer – softmax. The batch size was equal to the number of training instances and we set the maximum number of epochs to 10 000. 20% of training samples served as a validation set. In the case of 50 consecutive epochs with no improvement on the validation loss function, training was stopped and weights from the best epoch were restored. The Adam optimizer[kingma2014adam]
was used to optimize the cross-entropy loss function. The remaining hyperparameters were set to their default values used in Keras 2.3.1 in terms of MLP and in the scikit-learn 0.23.2 in the case of any other model. For each configuration, 5-fold stratified cross-validation was repeated 20 times yielding 100 results altogether.
5.2 Experimental setup
During our experiments, we considered three types of input data: cephalometric (ceph), Procrustes (proc), and transformed coordinates (trans). All of them have a common base which are cephalometric landmarks marked on the 2D X-ray images. Cephalometric data contains mainly angles. Since raw landmark coordinates come from images in different scales, variously rotated and translated, we normalized them so that each meets the following requirements: (1) a centroid of all landmarks is located at (0, 0); (2) a sum of distances between (0, 0) and all transformed coordinates is equal to one; (3) a sum of squared distances between a particular landmark and its average location is minimized across all landmarks and images.
Landmarks formed in the above way are called Procrustes coordinates and are commonly used in biological sciences[zelditch2012geometric]. However, the potential weakness of such representation is the lack of an anchor (all coordinates are relative). Thus, we proposed the third type of feature input which we called transformed coordinates. First, all raw coordinates are moved so that the Sella landmark is located at (0, 0). Then rotation was performed to make Sella-Nasion line vertical. The main reason to unify the Sella-Nasion location is its far distance from the jaw, which presumably allows us to better notice changes during its growth process. Fig. 4 depicts raw landmark coordinates and transformed landmarks according to the aforementioned rules along with their names. The number of features related to cephalometric, Procrustes, and transformed data from one timestamp are 15, 40, and 40, respectively. Moreover, the age value is always appended to the feature set.
One of the aims of our experiments was to check whether and how much do we gain by employing data from both the 9th or 12th year of age in comparison to data from only one timestamp (age of 9 or 12). Thus, we created a set of five input feature variants: data at the age of 9 (9), data at the age of 12 (12), difference between the values at the age of 12 and 9 (12-9), data at the ages of 9 and 12 (9, 12), data at the age of 9 and a difference between values at the age of 12 and 9 (9, 12-9).
As mentioned earlier, we predicted three different growth measurements: the change of SN-MP, FA, and PN-AN, each of them between the age of 9 and 18. Tying this all together, 45 experiment scenarios (three types of data, five different periods, and three predicted variables) were formed, for which we run models defined in Section 5.1.
Table 3 presents five models along with the utilized feature type that obtained the best classification results for each predicted variable and input period. Exceeding the percentage of the most frequent class (MFC) became a challenge in the case of many configurations. We decided to treat MFC as a baseline (Zero Rule). To examine whether the mean accuracy was statistically higher than the MFC, we ran One Sample t-Test and marked with green background results for which p-value was less than . We denote such results as statistically significant scores (SSS).
|9||12||12-9||9, 12||9, 12-9|
|No.||Features, model||Features, model||Features, model||Features, model||Features, model|
|Accuracy [%]||Accuracy [%]||Accuracy [%]||Accuracy [%]||Accuracy [%]|
|Prediction of SN-MP(18-9)|
|1||ceph, SVM||ceph, SVM||ceph, LR||ceph, LR||ceph, RF(300)|
|2||proc, SVM||proc, SVM||ceph, MLP||ceph, SVM||ceph, RF(100)|
|3||trans, SVM||trans, SVM||ceph, RF(300)||proc, SVM||ceph, XGB(100)|
|4||proc, MLP||proc, LR||ceph, RF(100)||trans, SVM||ceph, XGB(300)|
|5||proc, MLP(50, 50)||proc, MLP||proc, XGB(100)||proc, RF(300)||ceph, LR|
|Prediction of FA(18-9)|
|1||proc, MLP(50, 50)||proc, MLP||proc, MLP(50, 20)||proc, MLP(20)||proc, MLP(50, 10)|
|2||ceph, SVM||ceph, SVM||proc, MLP(50, 50)||ceph, SVM||proc, MLP(50, 20)|
|3||proc, SVM||proc, SVM||proc, MLP(20)||proc, SVM||proc, MLP(50)|
|4||trans, SVM||trans, SVM||proc, MLP(50)||trans, SVM||proc, MLP(20)|
|5||proc, LR||proc, LR||ceph, LR||proc, LR||proc, MLP(100)|
|Prediction of PN-AN(18-9)|
|1||ceph, LR||ceph, LR||ceph, LR||trans, RF(300)||ceph, SVM|
|2||trans, RF(300)||trans, RF(300)||ceph, MLP||trans, RF(100)||proc, SVM|
|3||trans, RF(100)||ceph, SVM||proc, SVM||ceph, LR||proc, LR|
|4||ceph, SVM||proc, SVM||proc, LR||ceph, SVM||trans, RF(300)|
|5||proc, SVM||proc, LR||trans, SVM||proc, LR||ceph, LR|
The first observation is that among all configurations with input from one timestamp, only two models achieved SSS and it was just pp and pp above MFC, respectively. It leads us to the vital conclusion that data coming from both timestamps, the ages of 9 and 12 (or their difference), are essential to predict face growth.
Second, the difference between the corresponding values at the age of 12 and 9 constitutes the best feature set. For both predicted variables, SN-MP(18-9) and FA(18-9), all five models built on such features achieved SSS. Since those models obtained better accuracy than models with additional features being values at the age of 9 or models built on values from the age of 9 and 12, we inferred that crucial for the final prediction is a change between 9 and 12 rather than values at particular timestamps themselves.
Third, all models that achieved SSS were built on cephalometric or Procrustes features. We presume that the main reason behind the lack of a model based on transformed coordinates is the non-uniform image scale.
Next, PN-AN(18-9) is the hardest to predict. Only two algorithms achieved SSS, and , for that variable. There is one more model with a bit higher mean accuracy than the latter, 75.14%, but due to substantial standard deviation it is considered as not statistically significant.
The final observation is that all SSS were obtained by logistic regression, some types of multilayer perceptron or tree ensemble – random forest or XGBoost.
MFC was slightly exceeded by our models for all three predicted measurements. It could seem to be a trivial achievement. However, one should be aware that the more problem is imbalanced, the harder is to beat MFC. For the sake of the next experiment, we artificially relabeled our data so that each class constitutes one-third of the total number of samples (instances with the lowest and highest value of the predicted variable from the majority class were tagged as a first or third class). Best models achieved an accuracy ranging from to , depending on the predicted variable. It showed that in the case of evenly distributed classes, our models significantly exceeded MFC, which amounted to in this setting.
6 Conclusions and future work
To our knowledge, this is the first paper that handles the problem of FG prediction with ML methods. The task turned out to be quite challenging. Among the root factors that make the problem so difficult, we pointed different sources of data, different tools taking X-ray photographs, diversified image scale, which is not possible to determine, ambiguous landmarking, low number of instances, and class imbalance. Finally, we were limited to 2D cephalograms since we could not utilize 3D radiographic images due to ethical issues.
Data analysis showed that on a statistical level, the growth directions in periods 9–12, 12–15, and 15–18 are very lowly correlated. Nevertheless, our models achieved statistically significant higher accuracy than the percentage of the most frequent class. Overcoming the aforesaid issues may presumably increase model performance. However, given the number and scope of conducted experiments, we hypothesise that the information essential to predict FG lies beyond the landmark coordinates of 9- and 12-year-olds.
We identified two main lines of inquiry worth pursuing in the future. The first one pertains to further work on tabular data. In this scope, we plan to apply some data augmentation methods to enlarge the size and variety of the dataset along with some feature selection that serves as a denoiser. Moreover, we want to analyze some other face normalization methods. Finally, together with orthodontic experts, we plan to think of new FG measurements that would be easier to predict.
The second line of further research relates to processing raw images. Certainly, due to the nuanced character of the problem, we do not expect it to be an easy task as well. Nevertheless, we hope that some models, especially CNNs (presumably shallow due to a low number of training instances), which are widely used in many computer vision tasks, will be able to find some patterns that are imperceptible for the human eye.
Studies were funded by BIOTECHMED-1 project granted by Warsaw University of Technology under the program Excellence Initiative: Research University (ID-UB).