The analysis of childhood growth curves has played a vital role in estimating the growth trajectory of populations as well as identifying critical factors corresponding to various shapes of those trajectories, such as sex. Indeed, early origins of growth curve modeling utilized cross-sectional growth curves, in which the population data were used to derive the growth patterns for various age and gender groups. The two widely known references of using cross-sectional data are the CDC growth chart (kuczmarski20022000) and WHO growth standards (world2006child), which are mainstays in clinical care. More recently, growth curve analyses based on longitudinal data have allowed more accurate identification of growth patterns, since the longitudinal data allows incorporating within- and between-subject effects simultaneously (willemsen2015multivariate).
Earlier work to fit growth curves based on one of two parametric assumptions namely logarithmic or exponential. Logarithmic curves assume a quick growth increase at the beginning, but the gains slowly disappear as time passes. Logarithmic growth curves broadly applied in bacterial growth (zwietering1990modeling; barry1995bayesian), biodegradation (schmidt1985models), fitness and strength training, and learning ability. By contrast, the exponential curve assumes that growth is slower at the beginning with gains that are more rapid over time. (karlberg1987modelling) explained growth curve modeling through exponential curves.jenss1937mathematical proposed a different model adding a linear term which was fitted by (berkey1982comparison) and found a poor fit of the data based on the systematic variation of the residual (beath2007infant). Moreover,(berkey1986nonlinear) also analyzed the model proposed by Count (count1943growth) and found an exponential model as a better model. Different parametric growth curve models proposed over the last two decades, such as the linear model, reciprocal model, logistic model, Gompertz model, and the Weibull model. (milani1989individual), (wingerd1970relation) explored other approaches of growth curves and found a poor fit.(wishart1938growth) Explained growth data analysis using polynomial regression. One limitation of polynomial regression that it requires a higher degree of polynomial to provide an adequate fit with the resulting coefficients without having any significant interpretation. Alternative approaches to analyze growth curve data were proposed by (geva1993longitudinal) , (ong2002size) (newell2003height) ,(rao1958some). Nonparametric models using regression splines to model the underlying shape function have been shown to decrease bias, thereby improving the estimation of subject-specific effects (viele2006self). Furthermore, regression splines, such as natural splines, have been shown to provide a better-localized fit to the mean response, compared to global polynomials (lambert2001analysis).
Equally important to finding an appropriate model to depict growth patterns is understanding the risk factors that contribute to adverse growth. The growth model is essential for designing an intervention trial or increase public health awareness surrounding potential benefits and adverse effects of various environmental exposures and growth patterns. Furthermore, having interpretable estimates for growth characteristics, such as peak height velocity, can ameliorate confounding in epidemiologic studies.(simpkin2017modelling)
. Other examples, which have motivated statistical developments ranging from hierarchical linear models to multivariate analysis methods, include psychological change over time(hertzog2003assessing), cardiovascular study (llabre2004applying), associations between adolescent moderate-vigorous physical activity and depressive symptoms in young adulthood (brunet2013association), associations among timing of sexual victimization and timing of drinking behavior(griffin2013recent), examine longitudinal associations among cognition, function, and depression in Alzheimer’s Disease patients (zahodne2013coupled), describing change in personality trait (jackson2012can)
In recent years, shape invariant modeling (SIM) has become an active area of research for non-parametric growth curve modeling, where a single function (or, curve) is transformed by scaling and shifting it to fit each subject usually through affine transformations.(lawton1972self), Who first proposed SIM called it self-modeling regression; in their approach, the function for the underlying shape illustrated for various parametric functions. Later,(beath2007infant) developed a model to explain longitudinal growth patterns and extended the SIM to include time-dependent covariates.(cole2010sitar) Extended the model by changing the sign of the velocity parameter and named it SITAR (Superimposition by Translation and Rotation). As a type of SIM, the regression spline expressed as a basis function consisting of a different set of knots; the resulting structure fitted as a nonlinear mixed effects model and parameters are typically estimated using maximum likelihood. This allows estimating the parameters for the between-subjects variation. Based on the underlying pattern of the data, various shape invariant models have proposed. When the data has logarithmic growth, SIM proposed for the log-transformed data.
Bayesian growth curve modeling have also seen similar progress with many applications to real datasets as well as longitudinal growth datasets (barry1995bayesian),(arjas1997prediction), (dunson2001commentary), (pinheiro2000linear), (shohoji1991prediction)
. One of the main advantages of a Bayesian approach is that it generates the uncertainty estimates (i.e., the estimate for the variance) for all unknown parameters naturally since each parameter explained by a probability distribution. Other advantages include the use of prior probability distributions to assimilate information from previous studies or experts opinion and allows to control confounding; having posterior probabilities as easily interpretable alternatives to p-values; in hierarchical modeling, incorporating latent variables such as an individual’s true disease status in the presence of a diagnostic error. Moreover, MCMC methodology facilitates the implementation of Bayesian analyses of complex data sets containing missing observations and multidimensional outcomes(dunson2001commentary). Due to this flexibility and better prediction of the exposure-outcome relationship, researchers are becoming more interested in Bayesian modeling. Notable work in Bayesian growth curve modeling includes the multivariate extension by (willemsen2015multivariate) the original SIM model.
A brief outline of the paper is as follows. In Section 2, a brief description of the original SIM model and the interpretation of various model parameters provided. In a subsection, we provided some descriptions for the spline function used in SIM and how this model connected to GAM. Section 3 described the Bayesian implementation of the SIM model with and without subject-specific covariates, and also the DAG representations of these models. Section 4 illustrated the MCMC implementation; the specification for the prior distributions; the full conditional distribution and the posterior distribution for each model parameter; and how the assessment of model performance. The full derivation of posterior distributions given in the Appendix. Applications with the real data are provided in Section 5, and with simulated data in Section 6. The Final Section includes the discussion and some proposals to future extension.
2 Real Data Example: ADHD Children
Longitudinal data from retrospective chart review for heights and weights for 197 ADHD (Attention-deficit Hyperactivity Disorder) children who visited community-based pediatric primary care practices in Cincinnati (Ohio, USA) collected. The children were in the age range: 1-17 years, and under stimulant medications to effectively reduce symptoms of ADHD. The objective of the original study was to evaluate how the age at start of stimulant medication may have impacted child growth trajectories. The original ADHD data had 6,134-time points recorded on 197 patients, among them only 3084 height measures were available for the analysis. The longitudinal study design is appeared to be unbalanced. Figure 3 shows the growth heights of each patient at various ages. The study enrolled 138 males . For the 197 unique patients the age range was (1.37, 16.76) with the mean age 9.3 years, and the height range (76.2, 183.2) with the mean height 134.4 cm (Table 1). Each patient has prescribed a stimulant medication at a certain visit. The mean age at stimulant medication start was 7.9 years and ranged from 4.2 to 12.3 years.
tableSummary statistics for the 197 ADHD patients Variable Age 1.363 7.249 9.443 9.263 11.369 16.764 3.001 Height 76.2 122.6 134.6 134.4 147.3 183.2 18.86 Wight 9.072 21.32 28.12 31.71 38.33 114.3 31.51 Start age of stimulant medication 4.17 6.75 7.86 7.949 9.120 12.330 1.6
We applied the Bayesian SIM model to the ADHD data and sought to describe the size, tempo, and velocity parameters relevant to the height growth of ADHD patients. And simultaneously, we describe the model fitting procedure, and its relative performances in comparison to the frequentist SIM model in details.
3 Shape invariant model
Shape invariant model (SIM) is a non-parametric model where a single function is transformed by shifting and scaling to fit each subject. In SIM a regression spline is used as the function, with log transformation of the data. For this study we followed the notation from(beath2007infant), the SIM model can be expressed as:
Here, the is the growth measure of child at the time points which corresponds to age (in years) in our motivating example.Subject-specific coefficients enable each individual’s growth trajectory to be aligned to a common growth curve, , via transformations to the and axes.In this formulation, we will estimate using a spline function and let be measurement error.
The goal is to estimate the subject-specific vectorsuch that the corresponding individual growth curve from deviations from the average curve . Following previously described work by Cole and others in equation (1), is termed as Size. can also be interpreted as subject-specific shift up or down in the spline curve along the response axis. is a random intercept term; when the response measure is height, is larger for taller children and smaller for shorter children. termed Tempo, which is a random time intercept and corresponds to differences in the timing of the growth spurt. This subject-specific left-right shift in the growth curves positive for late puberty and negative for early. The scaling factor within the spline function, is termed Velocity and corresponds to differences in the duration of the growth spurt between individuals. The Velocity parameter shrinks or stretches the time scale (cole2010sitar).
4 Bayesian Implementation of Shape invariant model
The above SIM model presented in equation (1) can be written with basis representation as follows;
is the basis of the natural cubic spline, evaluated at . Thus, is a vector of length , and is the regression coefficient vector of same length. Here, is the number of inner knots and 2 is for the boundary knots. so natural cubic spline has independent coefficients. Subject-specific is assumed to have multivariate normal of order 3 with mean vector and variance-covariance matrix.We assume that
is independently normally distributed with meanand variance , and also independent of .
The other distributions for such as Student’s-t can also be considered depending on the type of growth data.
With subject-specific covariates: In many growth curves analyses, the needs for the inclusion of covariates for better predictability, as well as for better explanation of growth mechanisms are essential. For example, it may be of interest to know how the gender difference affects the growth patterns, or how the medication at early age for a specific disease condition affects the growth at later ages specifically by size, tempo, and velocity. The subject-specific covariates can be included in the model specified for with a non-zero mean vector. If we assume (p-1) subject-specific covariates, the mean and the variance of becomes,
The first column of the regression coefficient matrix is for the intercept, and the remaining columns are for each covariate coefficients. Similarly, the first column of the design matrix contains the value .
Directed acyclic graph (DAG): DAG is a graphical representation of a hierarchical model which shows how the observed data and the unobserved parameters are conditionally depending on each others. In the graph, the circle indicates the stochastic node (or, the unobserved parameters that need to be estimated), and the rectangle indicates observed data or the hyper-parameters where they were assigned to fixed values apriori. When the Bayesian hierarchical model has a complex dependency structure, DAG helps in better visualization of the model as a whole, and the derivation of the posterior distribution for each stochastic node. Figure 2 shows the DAG of SIM model without covariates (a), and with subject-specific covariates (b).
5 MCMC Implementation
5.1 Prior distributions :
The prior distribution for all model parameters are assumed to be independent apriori, and follow uninformative flat probability distribution in general. The residual variance was assumed to follow inverse gamma distribution with fixed shape and scale parameters such that. Alternative prior distributions for residual variance parameter can also be used following (gelman2004prior). For covariate coefficients and basis coefficient, we have assumed and , respectively. The vector was defined after stacking the A matrix. The variance-covariance matrix for
is assumed to follow an inverse Wishart distribution with 3 degree of freedom and 0.01 scale parameter,.
5.2 Full conditional distributions :
In Bayesian framework the joint posterior distribution is proportional to the product of the likelihood function and the prior distributions. Therefore, the full posterior distribution for the model in (2) with subject-specific covariates is as follows;
We follow the block update procedures in MCMC iterations whenever the posterior distribution has a standard form. It improves the convergence and mixing as well, and also saves computational time. Following the Gibbs sampling procedure, the updates for each parameters are as shown below.For the sake of simplicity, covariate effects and subject-specific notation are omitted for portions of the updates.
where, , and is a vector of stacked .
Updating (without covariate)
where, is degrees of freedom, and is the scale matrix.
Updating (with covariate)
where, is degrees of freedom and is the scale matrix.
Moreover, for the full conditional of the subject-specific parameter, is given by,
For updating , we used a random walk Metropolis-Hastings (M-H) algorithm to generate posterior samples for subject-specific effect. The candidate samples are generated from multivariate distribution with 5 degrees of freedom and mean at the current value. The variance parameter of multivariate was appropriately tuned to ensure that the acceptance rate of candidate samples to posterior samples in M-H step is around -%.
5.3 Convergence, Mixing, and Identifiability
Before summarizing the MCMC samples (or, posterior samples) as posterior mean, median or highest posterior region; it is important to check that the posterior samples for each parameter are converging, mixing well and has less auto-correlation. Convergence of MCMC samples indicates how close we are to the true posterior distribution and mixing indicates how well the parameter space explored. There are different ways of checking these namely, trace plot, autocorrelation plot, QQ plot, Brooks plot, Gelman-Rubin test, etc. A simple exploration of the trace plot gives an insight into the characteristics of the MCMC samples. Trace plots are produced for each parameter and checked whether different starting values led to better mixing and convergence, saving the computational time as well.
The SIM model requires to specify the Spline function with a specific number of knots. We used both B-spline and the natural cubic spline functions for checking their relative performances. As for specifying the number of knots, we used five equally spaced knots for both spline function, three interior and two exteriors. The number of knots was determined based on a compromise between optimizing a fit criterion and the computational burden. The prior distributions specified according to Section 4.1. The MCMC implementation followed a block update procedure with a mix of Gibbs and M-H algorithm. In an implementation, the inverse-Wishart distribution for the posterior distribution for redefined as a scaled inverse-Wishart distribution for correct estimation of correlation matrix as well as for quick convergence.
We ran multiple chain with a relatively longer burn-in period. The model run for 500,000 iterations with 90% burn-in samples. Then we have 50,000 samples for posterior inference. We further reduced the posterior samples to size 5,000 after thinning by parameter 10 for reducing the autocorrelation in posterior samples. All the results reported in this manuscript derived from this 5,000 posterior samples. Convergence was checked by examining the trace plot of the posterior samples for each parameter and by using (gelman1992inference) test. We have also checked the autocorrelation from the autocorrelation plot, and it showed a much faster rate of decreasing towards zero with the increasing lag values. The variance-covariance matrix of the proposal density in the M-H algorithm was tuned accordingly so that the acceptance rate was approximately .
Mixing well of posterior samples is an important property to ensure that there is no specific trend in posterior samples among the parameters and they are independent as much as possible. We randomly selected three patients and their posterior samples for a horizontal shift (), and stretch () parameters plotted in Figure 3
. The plot shows there is no specific trend in posterior samples for these two parameters and they scattered around the center indicating low correlation.
6 Simulated Data Example
The simulated experiment was designed to assess the relative performances of the Bayesian SIM model concerning its frequentist counterpart. An R package "AGD (Analysis of Growth Data)" Version 0.35 was used to simulate the height data using the LMS method to obtain normalized growth. Using the growth charts data for the reference population, AGD computes heights in cm conditioning on age, sex, and the growth percentile. The current version of ’AGD’ includes data for three reference populations: the United States, The Netherlands, and WHO data from multiple countries. We used the United States population as our reference population in simulating heights. To ensure wide variability and some levels of grouping patterns existed in the simulated heights, we generated data from three groups. In the first group, growth percentile range was set to (70 - 90)%; growth percentile range for the second group set to (40 - 60)%; and the third group set to (10 - 30)%. In each group, the proportions for males and females were equal.
The steps involved in the data generation process are:
Generate the height, using AGD package in R with the United States as a reference population. Generate 150 subjects such that: a) 50 from each group, and b) 25 males and 25 females in each group
All subjects height data were simulated for ages 5-20 years with six months increase
The simulated data in steps 1 and 2 was used in SIM package in R to get estimates of , , , and
Using these estimates, a new for height was generated following the equation:
Where, the random noise generated from a normal distribution with mean and variance , and ’BS’ is for the B-spline function with 5 knots. This random noise can interpreted as a measurement error or the errors that are still unaccounted for by the SIM parameters. Step 4 can be run many times depending on the desired number of realizations achieved for the simulated dataset. We run it five times to generate five realizations. Figure ** a) and b) plots the data generated from the AGD package and one realization after adding the random noise, respectively.
In the evaluation of model performance for the simulated datasets, we will use the estimates from SIM package at Step 3: , , , and , as true estimates of all model parameters.
As similar to the real data example, we implemented both the original SIM model and the Bayesian SIM model with Two spline functions, B-spline and natural cubic spline, on the simulated data. The number of knots was assigned to five. We checked the DIC and the RMSPE values for the most parsimonious model and for checking the predictive model ability, respectively. It appeared that the Bayesian SIM model with natural cubic spline was performing better in both of these criteria.
In this section, we present the results of our analysis. We compared the frequentist SIM model with Bayesian SIM model applying on ADHD patient data. Here, age at peak velocity (APV) is 12.38 years, and peak velocity(PV) is 8.76 cm
In below we reported the results from the Bayesian SIM model with basis spline. Subject-specific posterior mean estimates of plotted in Figure 5. It indicates that the correlation between the horizontal and vertical shift is negative and the correlation between the horizontal effect and the stretch effect are negative.
We were interested to see the pattern of the growth velocity of the ADHD patient children and the effect of medication on their growth. The scatter plot of the centered posterior mean effect, which indicates the deviation from the centered effect.
8 Model Selection and Predictive ability:
We implemented both the original SIM model and the Bayesian SIM model on the ADHD data. Two spline functions, B-spline and natural cubic spline, used in Bayesian SIM model. In all spline functions, five knots used. We checked the DIC values for the two spline functions for the Bayesian SIM model to check the model adequacy. Smaller the DIC is indicative of a parsimonious model. We calculated the root mean square prediction error (RMSPE) for both the SIM model and the Bayesian SIM model to check their predictive ability. It appeared that the Bayesian SIM model with natural cubic spline was performing better in both of these criteria (Table 2).
For checking the predictive ability of Bayesian SIM model, we adopted the cross-validation method where four patients with two early start ages of stimulant medication from each sex and two late start ages from each sex were randomly left out from the analysis as a validation sample. The remaining 193 patients constructed the derivation sample were analyzed using the Bayesian SIM model to generate the mean predictive curve. Figure 6 plots the mean predictive curve with 95% credible interval from the derivation sample. The four patients from the validation sample plotted in the same figure with red color for females and blue for males. The results ensure that the randomly picked four patients with various start ages of stimulant medication and sex to lie within 95% credible interval of the mean predictive curve.
8.1 Sensitivity Analysis
In Bayesian inference, it is recommended to check the validity of the posterior estimates for the choice of other prior distributions, or other hyper-parameters values. Sensitivity analysis ensures this purpose. Even though we assigned noninformative flat prior distributions to all parameters, we checked the sensitivity of the estimates by changing the values for hyper-parameters. We also did the sensitivity analysis following the method proposed by (kass1989approximate) called perturbation function. We perturbed the model with different spline function (with and without covariates) and analyzed the sensitivity of the model. Two spline functions: basis spline and natural cubic spline functions considered for this purpose.
The original SIM model is effective in explaining subject-specific deviations in terms of size, tempo, and velocity from the underlying shape curve. An R package "SIM" was made available to implement this model. The model is currently unable to produce standard error estimates for subject-specific deviations for size, tempo, and velocity parameters. Although bootstrap methods are often used to estimate the standard error, this is a post-hoc analysis and often fails to estimate the true uncertainty specifically when the model has a complex structure, i.e., a non-linear mixed effects model. Estimating the standard error for estimates of subject-specific deviations for size, tempo, and velocity in Bayesian SIM model comes naturally from the posterior samples for the subject-specific parameter (). Although we considered vague prior distributions in our analyses, the model is flexible to assimilate information from previous studies or experts opinion through specifying informative prior distributions.
We applied the Bayesian SIM model with two spline functions and with an optimum number of knots (i.e., 5) to real and simulated data sets, and compared the results with the original SIM model. We observed that for both data sets, Bayesian SIM model with natural cubic spline function has better predictive ability than the original SIM model. With the real data application.
In our future works, we plan to extend the Bayesian SIM model in finding clustering patterns in shape invariant parameters tempo (), size (), and velocity (). In many applications of shape invariant models in growth curve modeling, finding the group of subjects with similar growth patterns in terms of size, tempo and velocity have real significance. Another extension can include but not limited to, extending the model to free knot natural cubic spline, where we will utilize the Bayesian adaptive regression spline (BARS). It shown earlier that BARS provide a parsimonious fits (dimatteo2001bayesian).
The project described was supported by the National Center for Advancing Translational Sciences of the National Institutes of Health, under Award Number 5UL1TR001425-04. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. Coauthor (RS) supported by grant K25 HL125954 from the National Heart, Lung and Blood Institute of the National Institutes of Health.