1 Introduction
Mixed effects models are often adopted to take into account the interindividual variability within a population (see [25] and the references therein). Consider a study with
individuals from a same population. The vector of observations
associated to each individualis assumed to be a realisation of a random variable which depends on a vector of random individual parameters
. Then, inference on the individual parameter amounts to estimate its conditional distribution given the observed data .When the model is a linear (mixed effects) Gaussian model, then this conditional distribution is a normal distribution that can explicitly be computed
[46]. For more complex distributions and models, Monte Carlo methods must be used to approximate this conditional distribution. Most often, direct sampling from this conditional distribution is inefficient and it is necessary to resort to a Markov chain Monte Carlo (MCMC) method for obtaining random samples from this distribution. Yet, MCMC requires a tractable likelihood in order to compute the acceptance ratio. When this computation is impossible, a pseudomarginal Metropolis Hastings (PMMH) has been developed in
[6] and consists in replacing the posterior distribution evaluated in the MH acceptance rate by an unbiased approximation. An extension of the PMMH is the particle MCMC method, introduced in [7], where a Sequential Monte Carlo sampler [19] is used to approximate the intractable likelihood at each iteration. For instance, this method is relevant when the model is SDEbased (see [18]). In a fully Bayesian setting, approximation of the posterior of the global parameters can be used to approximate the posterior of the individual parameters using Integrated Nested Laplace Approximation (INLA) introduced in [41]. When the goal is to do approximate inference, this method has shown great performances mainly because it approximates each marginal separately as univariate Gaussian distribution. In this paper, we focus on developing a method to perform exact inference and do not treat the case of approximate inference algorithms such as the Laplace EM or the First Order Conditional Estimation methods
[48] that can introduce bias in the resulting parameters.Note that generating random samples from is useful for several tasks to avoid approximation of the model, such as linearisation or Laplace method. Such tasks include the estimation of the population parameters of the model by a maximum likelihood approach, i.e. by maximizing the observed incomplete data likelihood using the Stochastic Approximation of the EM algorithm (SAEM) algorithm combined with a MCMC procedure [24]. Lastly, sampling from the conditional distributions is also known to be useful for model building. Indeed, in [26], the authors argue that methods for model assessment and model validation, whether graphical or based on statistical tests, must use samples of the conditional distribution to avoid bias.
Designing a fast mixing sampler for these distributions is therefore of utmost importance to perform Maximum Likelihood Estimation (MLE) using the SAEM algorithm. The most common MCMC method for nonlinear mixed effects (NLME) models is the random walk Metropolis (RWM) algorithm [36, 38, 25]. This method is implemented in software tools such as Monolix, NONMEM, the saemix R package [15] and the nlmefitsa Matlab function. Despite its simplicity, it has been successfully used in many classical examples of pharmacometrics. Nevertheless, it can show its limitations when the dependency structure of the individual parameters is complex. Yet, maintaining an optimal acceptance rate (advocated in [37]) most often implies very small moves and therefore a very large number of iterations in medium and high dimensions since no information of the geometry of the target distribution is used.
The Metropolisadjusted Langevin algorithm (MALA) uses evaluations of the gradient of the target density for proposing new states which are accepted or rejected using the MetropolisHastings algorithm [40, 44]
. Hamiltonian Monte Carlo (HMC) is another MCMC algorithm that exploits information about the geometry of the target distribution in order to efficiently explore the space by selecting transitions that can follow contours of high probability mass
[10]. The NoUTurn Sampler (NUTS) is an extension to HMC that allows an automatic and optimal selection of some of the settings required by the algorithm, [11, 22]. Nevertheless, these methods may be difficult to use in practice, and are computationally involved, in particular when the structural model is a complex ODE based model. The algorithm we propose is an independent MetropolisHastings (IMH) algorithm, but for which the proposal is a Gaussian approximation of the target distribution. For general data model (i.e. categorical, count or timetoevent data models or continuous data models), the Laplace approximation of the incomplete pdf leads to a Gaussian approximation of the conditional distribution .In the special case of continuous data, linearisation of the model leads, by definition, to a Gaussian linear model for which the conditional distribution of the individual parameter given the data is a multidimensional normal distribution that can be computed. Therefore, we design an independent sampler using this multivariate Gaussian distribution to sample from the target conditional distribution and embed this procedure in an exact inference algorithm, the SAEM, to speed the convergence of the vector of estimations of the global parameters .
The paper is organised as follows. Mixed effects models for continuous and noncontinuous data are presented in Section 2. The standard MH for NLME models is described in Section 3. The proposed method, called the nlmeIMH, is introduced in Section 4 as well as the fSAEM, a combination of this new method with the SAEM algorithm for estimating the population parameters of the model. Numerical examples illustrate, in Section 5, the practical performances of the proposed method, both on a continuous pharmacokinetics (PK) model and a timetoevent example. A Monte Carlo study confirms that this new SAEM algorithm shows a faster convergence to the maximum likelihood estimate.
2 Mixed Effect Models
2.1 Population approach and hierarchical models
In the sequel, we adopt a population approach, where we consider individuals and observations per individual . The set of observed data is where are the observations for individual . For the sake of clarity, we assume that each observation takes its values in some subset of . The distribution of the vector of observations depends on a vector of individual parameters that takes its values in a subset of .
We assume that the pairs
are mutually independent and consider a parametric framework: the joint distribution of
is denoted by , where is the vector of parameters of the model. A natural decomposition of this joint distribution reads(1) 
where is the conditional distribution of the observations given the individual parameters, and where is the socalled population distribution used to describe the distribution of the individual parameters within the population.
A particular case of this general framework consists in describing each individual parameter as the sum of a typical value and a vector of individual random effects :
(2) 
In the sequel, we assume that the random effects are distributed according to a multivariate Gaussian distribution: . Extensions of this general model are detailed in Appendix A.
2.2 Continuous data models
A regression model is used to express the link between continuous observations and individual parameters:
(3) 
where is the th observation for individual measured at index , is the residual error. It is assumed that for any index , is twice differentiable in .
We start by assuming that the residual errors are independent and normally distributed with zeromean and a constant variance
. Let be the vector of observation indices for individual . Then, the model for the observations reads:(4) 
If we assume that , then the parameters of the model are .
Remark 1.
An extension of this model consists in assuming that the variance of the residual errors is not constant over time, i.e., . Such extension includes proportional error models () and combined error models () [25] but the proposed method remains the same whatever the residual error model is.
2.3 Noncontinuous data models
Noncontinuous data models include categorical data models [42, 1], timetoevent data models [28, 3], or count data models [42]. A categorical outcome takes its value in a set of categories. Then, the model is defined by the conditional probabilities , that depend on the vector of individual parameters and may be a function of the time .
In a timetoevent data model, the observations are the times at which events occur. An event may be oneoff (e.g., death, hardware failure) or repeated (e.g., epileptic seizures, mechanical incidents). To begin with, we consider a model for a oneoff event. The survival function gives the probability that the event happens after time :
(5) 
where is called the hazard function. In a population approach, we consider a parametric and individual hazard function . The random variable representing the timetoevent for individual is typically written and may possibly be rightcensored. Then, the observation for individual is
(6) 
where is the censoring time and is the information that the event occurred after the censoring time.
For repeated event models, times when events occur for individual are random times for which conditional survival functions can be defined:
(7) 
Here, is the observed value of the random time . If the last event is right censored, then the last observation for individual is the information that the censoring time has been reached . The conditional pdf of reads (see [25] for more details)
(8) 
3 Sampling from conditional distributions
3.1 The conditional distribution of the individual parameters
Once the conditional distribution of the observations and the marginal distribution of the individual parameters are defined, the joint distribution and the conditional distribution are implicitly specified. This conditional distribution plays a crucial role for inference in NLME models.
One of the main task is to compute the maximum likelihood (ML) estimate of
(9) 
where . In NLME models, this optimization is solved by using a surrogate function defined as the conditional expectation of the complete data loglikelihood [29]. The SAEM is an iterative procedure for ML estimation that requires to generate one or several samples from this conditional distribution at each iteration of the algorithm. Once the ML estimate has been computed, the observed Fisher information matrix noted can be derived thanks to the Louis formula [27] which expresses in terms of the conditional expectation and covariance of the complete data loglikelihood. Such procedure also requires to sample from the conditional distributions for all .
Samples from the conditional distributions might also be used to define several statistical tests and diagnostic plots for models assessment. It is advocated in [26] that such samples should be preferred to the modes of these distributions (also called Empirical Bayes Estimate(EBE), or Maximum a Posteriori Estimate), in order to provide unbiased tests and plots. For instance, a strong bias can be observed when the EBEs are used for testing the distribution of the parameters or the correlation between random effects.
In short, being able to sample individual parameters from their conditional distribution is essential in nonlinear mixed models. It is therefore necessary to design an efficient method to sample from this distribution.
3.2 The MetropolisHastings Algorithm
MetropolisHasting (MH) algorithm is a powerful MCMC procedure widely used for sampling from a complex distribution [11]. To simplify the notations, we remove the dependency on . For a given individual , the MH algorithm, to sample from the conditional distribution , is described as:
Initialization: Initialize the chain sampling from some initial distribution .
Iteration k: given the current state of the chain :

Sample a candidate from a proposal distribution .

Compute the MH ratio:
(10) 
Set with probability (otherwise, keep ).
Under weak conditions, is an ergodic Markov chain whose distribution converges to the target [11].
Current implementations of the SAEM algorithm in Monolix [14], saemix (R package) [15], nlmefitsa (Matlab) and NONMEM [9] mainly use the same combination of proposals. The first proposal is an independent MH algorithm which consists in sampling the candidate state directly from the prior distribution of the individual parameter . The MH ratio then reduces to for this proposal.
The other proposals are componentwise and blockwise random walk procedures [31] that updates different components of using univariate and multivariate Gaussian proposal distributions. These proposals are centered at the current state with a diagonal variancecovariance matrix; the variance terms are adaptively adjusted at each iteration in order to reach some target acceptance rate [8, 25]. Nevertheless, those proposals fail to take into account the nonlinear dependence structure of the individual parameters.
A way to alleviate these problems is to use a proposal distribution derived from a discretised Langevin diffusion whose drift term is the gradient of the logarithm of the target density leading to the Metropolis Adjusted Langevin Algorithm (MALA). The MALA proposal is a multivariate Gaussian with the following mean and covariance matrix :
(11) 
where is a positive stepsize and
is the identity matrix in
. These methods appear to behave well for complex models but still do not take into consideration the multidimensional structure of the individual parameters. Recent works include efforts in that direction, such as the Anisotropic MALA for which the covariance matrix of the proposal depends on the gradient of the target measure [2], the Tamed Unadjusted Langevin Algorithm [12] based on the coordinatewise taming of superlinear drift coefficients and a multidimensional extension of the Adaptative Metropolis algorithm [21] simultaneously estimating the covariance of the target measure and coercing the acceptance rate, see [47].The MALA algorithm is a special instance of the Hybrid Monte Carlo (HMC), introduced in [33]; see [11] and the references therein, and consists in augmenting the state space with an auxiliary variable , known as the velocity in Hamiltonian dynamics. This algorithm belongs to the class of data augmentation methods. Indeed, the potential energy is augmented with a kinetic energy, function of an added auxiliary variable. The MCMC procedure then consists in sampling from this augmented posterior distribution. All those methods aim at finding the proposal that accelerates the convergence of the chain. Unfortunately they are computationally involved (even in small and medium dimension settings, the computation of the gradient or the Hessian can be overwhelming) and can be difficult to implement (stepsizes and numerical derivatives need to be tuned and implemented).
We see in the next section how to define a multivariate Gaussian proposal for both continuous and noncontinuous data models, that is easy to implement and that takes into account the multidimensional structure of the individual parameters in order to accelerate the MCMC procedure.
4 The nlmeIMH and the fSAEM
In this section, we assume that the individual parameters are independent and normally distributed with mean and covariance . The MAP estimate, for individual , is the value of that maximizes the conditional distribution :
(12) 
4.1 Proposal based on Laplace approximation
For both continuous and noncontinuous data models, the goal is to find a simple proposal, a multivariate Gaussian distribution in our case, that approximates the target distribution . For general MCMC samplers, it is shown in [39] that the mixing rate in total variation depends on the expectation of the acceptance ratio under the proposal distribution which is also directly related to the ratio of the proposal to the target in the special case of independent samplers (see [30, 39]). This observation naturally suggests to find a proposal which approximates the target. [16]
advocates the use a multivariate Gaussian distribution whose parameters are obtained by minimizing the KullbackLeibler divergence between a multivariate Gaussian variational candidate distribution and the target distribution. In
[4] and the references therein, an adaptative Metropolis algorithm is studied and reconciled to a KL divergence minimisation problem where the resulting multivariate Gaussian distribution can be used as a proposal in a IMH algorithm. Authors note that although this proposal might be a sensible choice when it approximates well the target, it can fail when the parametric form of the proposal is not sufficiently rich. Thus, other parametric forms can be considered and it is suggested in [5] to consider mixtures, finite or infinite, of distributions belonging to the exponential family.In general, this optimization step is difficult and computationally expensive since it requires to approximate (using Monte Carlo integration for instance) the integral of the loglikelihood with respect to the variational candidate distribution.
Independent proposal 1.
We suggest a Laplace approximation of this conditional distribution as described in [41] which is the multivariate Gaussian distribution with mean and variancecovariance
(13) 
where is the Hessian of evaluated at .
Mathematical details for computing this proposal are postponed to Appendix B. We use this multivariate Gaussian distribution as a proposal in our IMH algorithm introduced in the next section, for both continuous and noncontinuous data models.
Remark 2.
Note that the resulting proposal distribution is based on the assumption that, in model (2), the random effects are normally distributed. When this assumption does not hold, our method exploits the same Gaussian proposal, where the variance in (13) is calculated explicitely. Consider the following example: the random effects in (2) are no longer distributed according to a multivariate Gaussian distribution but a multivariate Student distribution with degrees of freedom, zero mean and a prior shape matrix such that . Then the vector of parameters of the model is where is the prior covariance matrix. In that case, our method uses the Independent proposal 1 and computes the MH acceptance ratio (10) with the corresponding multivariate Student density .
We shall now see another method to derive a Gaussian proposal distribution in the specific case of continuous data models (see (3)).
4.2 Nonlinear continuous data models
When the model is described by (3), the approximation of the target distribution can be done twofold: either by using the Laplace approximation, as explained above, or by linearizing the structural model for any individual of the population. using (3) and (12), the MAP estimate can thus be derived as:
(14) 
where is defined by (4) and is the transpose of the matrix .
We linearize the structural model around the MAP estimate :
(15) 
where is the Jacobian of evaluated at . Defining , this expansion yields the following linear model:
(16) 
We can directly use the definition of the conditional distribution under a linear model (see (45) in Appendix C) to get an expression of the conditional covariance of given under (16):
(17) 
Using (14) and the definition of the conditional distribution under a linear model we obtain that (See Appendix D for details). We note that the mode of the conditional distribution of in the nonlinear model (3) is also the mode and the mean of the conditional distribution of in the linear model (16).
Independent proposal 2.
In the case of continuous data models, we propose to use the multivariate Gaussian distribution, with mean and variancecovariance matrix defined by (17) as a proposal for an independent MH algorithm avoiding the computation of an Hessian matrix.
We can note that linearizing the structural model is equivalent to using the Laplace approximation with the expected information matrix. Indeed:
(18) 
Remark 3.
When the model is linear, the probability of accepting a candidate generated with this proposal is equal to 1.
Remark 4.
If we consider a more general error model, that may depend on the individual parameters and the observation times , then the conditional variancecovariance matrix reads:
(19) 
Remark 5.
In the model (33), the transformed variable follows a normal distribution. Then a candidate is drawn from the multivariate Gaussian proposal with parameters:
(20)  
(21) 
where and finally the candidate vector of individual parameters is set to
These approximations of the conditional distribution lead to our nlmeIMH algorithm, an Independent MetropolisHastings (IMH) algorithm for NLME models. For all individuals , the algorithm is defined as:
Initialization: Initialize the chain sampling from some initial distribution .
Iteration t: Given the current state of the chain :

Compute the MAP estimate:
(22) 
Sample a candidate from a the independent proposal denoted .

Compute the MH ratio:
(23) 
Set with probability (otherwise, keep ).
This method shares some similiarities with [45] that suggests to perform a Taylor expansion of around the current state of the chain, leaving unchanged.
Remark 6.
Although a multivariate Gaussian proposal is used in our presentation of the nlmeIMH, other type of distributions could be adopted. For instance, when the target distribution presents heavy tails, a Student distribution with a wellchosen degree of freedom could improve the performance of the independent sampler. In such case, the parameters of the Gaussian proposal are used to shift and scale the Student proposal distribution and the acceptance rate (23) needs to be modified accordingly. The numerical applications in Section 5 are performed using a Gaussian proposal but comparisons with a Student proposal distribution are given in Appendix E.1.
4.3 Maximum Likelihood Estimation
The ML estimator defined by (9) is computed using the Stochastic Approximation of the EM algorithm (SAEM) [17]. The SAEM algorithm is described as follows:
Initialization: , an initial parameter estimate and , the number of MCMC iterations.
Iteration k: given the current model parameter estimate :

Simulation step: for , draw vectors of individual parameters after transitions of Markov kernels
which admit as unique limiting distributions the conditional distributions , 
Stochastic approximation step: update the approximation of the conditional expectation :
(24) where is a sequence of decreasing stepsizes with .

Maximisation step: Update the model parameter estimate:
(25)
The SAEM algorithm is implemented in most sofware tools for NLME models and its convergence is studied in [17, 24, 2]. The practical performances of SAEM are closely linked to the settings of SAEM. In particular, the choice of the transition kernel plays a key role. The transition kernel is directly defined by the proposal(s) used for the MH algorithm.
We propose a fast version of the SAEM algorithm using our resulting independent proposal distribution called the fSAEM. The simulation step of the fSAEM is achieved using the nlmeIMH algorithm (see algorithm 2) for all individuals and the next steps remain unchanged. In practice, the number of transitions is small since the convergence of the SAEM does not require the convergence of the MCMC at each iteration [24]. In the sequel, we carry out numerous numerical experiments to compare our nlmeIMH algorithm to stateoftheart samplers and assess its relevance in a MLE algorithm such as the SAEM.
5 Numerical Examples
5.1 A pharmacokinetic example
5.1.1 Data and model
32 healthy volunteers received a 1.5 mg/kg single oral dose of warfarin, an anticoagulant normally used in the prevention of thrombosis [34]. Figure 1 shows the warfarin plasmatic concentration measured at different times for these patients (the single dose was given at time 0 for all the patients).
We consider a onecompartment pharmacokinetics (PK) model for oral administration, assuming firstorder absorption and linear elimination processes:
(26) 
where is the absorption rate constant, the volume of distribution , the elimination rate constant, and the dose of drug administered. Here, , and are PK parameters that can change from one individual to another. Let be the vector of individual PK parameters for individual . The model for the th measured concentration, noted , for individual writes:
(27) 
We assume in this example that the residual errors are independent and normally distributed with mean 0 and variance . Lognormal distributions are used for the three PK parameters:
(28) 
This is a specific instance of the nonlinear mixed effects model for continuous data described in Section 2.2. We thus use the multivariate Gaussian proposal whose mean and covariance are defined by (47) and (17). In such case the gradient can be explicitly computed. Nevertheless, for the method to be easily extended to any structural model, the gradient is calculated using Automatic Differentiation [20] implemented in the R package “Madness” [35].
5.1.2 MCMC Convergence Diagnostic
We study in this section the behaviour of the MH algorithm used to sample individual parameters from the conditional distribution . We consider only one of the 32 individuals for this study and fix close to the ML estimate obtained with the SAEM algorithm, implemented in the saemix R package [15]: , , , , , and .
We run the classical version of MH implemented in the saemix package and for which different transition kernels are used successively at each iteration: independent proposals from the marginal distribution , componentwise random walk and blockwise random walk. We compare it to our proposed algorithm 2.
We run iterations of these two algorithms and evaluate their convergence by looking at the convergence of the median for the three components of . We see Figure 2 that, for parameter , the sequences of empirical median obtained with the two algorithms converge to the same value, which is supposed to be the theoretical median of the conditional distribution. It is interesting to note that the empirical median with the nlmeIMH converge very rapidly. This is interesting in the population approach framework because it is mainly the median values of each conditional distribution that are used to infer the population distribution. Autocorrelation plots, Figure 2, highlight slower mixing of the RWM whereas samples from the nlmeIMH can be considered independent few iterations after the chain has been initialized. Comparison for all three dimensions of the individual parameter using a Student proposal distribution can be found in Appendix E.1.
The Mean Square Jump Distance (MSJD) as well as the Effective Sample Size (ESS) of the two methods are reported in Table 5. MSJD is a measure used to diagnose the mixing of the chain. It is calculated as the mean of the squared euclidean distance between every point and its previous point. Usually, this quantity indicates if the chain is moving enough or getting stuck at some region and the ESS is a quantity that estimates the number of independent samples obtained from a chain. Larger values of those two quantities for our method show greater performance of the sampler in comparison with the RWM.
MSJD  ESS  

RWM  0.009  0.002  0.006  1728  3414  3784 
nlmeIMH  0.061  0.004  0.018  13694  14907  19976 
Comparison with stateoftheart methods: We then compare our new approach to the three following samplers: an independent sampler that uses variational approximation as proposal distribution [16], the MALA [40] and the NoUTurn Sampler [22].
The same design and settings (dataset, model parameter estimate, individual) as in section 5.1.2 are used throughout the following experiments.
5.1.2.a Variational MCMC algorithm
The Variational MCMC algorithm [16]is a MCMC algorithm with independent proposal. The proposal distribution is a multivariate Gaussian distribution whose parameters are obtained by a variational approach that consists in minimising the Kullback Leibler divergence between a multivariate Gaussian distribution , and the target distribution for a given model parameter estimate noted . This problem boils down to maximizing the socalled Evidence Lower Bound defined as:(29) 
threshold. The means and standard deviations of our nlmeIMH and the Variational MCMC proposals compare with the posterior mean (calculated using the NUTS
[22]) as follows:Means  Stds  

Variational proposal  0.90  7.93  0.48  0.14  0.03  0.07 
Laplace proposal  0.88  7.93  0.52  0.18  0.04  0.09 
NUTS (ground truth)  0.91  7.93  0.51  0.18  0.05  0.09 
5.1.2.b Metropolis Adjusted Langevin Algorithm (MALA) and NoUTurn Sampler (NUTS)
We now compare our method to the MALA, which proposal is defined by (11). The gradient of the log posterior distribution is also calculated by Automatic Differentiation. In this numerical example, the MALA has been initialized at the MAP and the stepsize () is tuned such that the acceptance rate of is reached [37]. We also compare the implementation of NUTS [22, 13]in the RStan package to our method in Figure 4. Figure 4highlights good convergence of a welltuned MALA and the NUTS. nlmeIMH and NUTS mixing properties, from autocorrelation plots in Figure 4seem to be similar and much better than all of the other methods. Table 3presents a benchmark of those methods regarding MSJD and ESS. Both nlmeIMH and NUTS have better performances here. For parameters and , the ESS of the NUTS, presented as a gold standard sampler for this king of problem, are slightly higher than our proposed method.MSJD  ESS  

RWM  0.009  0.002  0.006  1728  3414  3784 
nlmeIMH  0.061  0.004  0.018  13694  14907  19976 
MALA  0.024  0.002  0.006  3458  3786  3688 
NUTS  0.063  0.004  0.018  18684  19327  19083 
ADVI  0.037  0.002  0.010  2499  1944  2649 
5.1.3 Comparison of the chains for the first 500 iterations
We produce independent runs of the RWM, the nlmeIMH, the MALA and the NUTS for iterations. The boxplots of the samples drawn at a given iteration threshold (three different thresholds are used) are presented Figure 5 against the ground truth for the parameter ka. The ground truth has been calculated by running the NUTS for iterations.
For the three numbers of iteration (,,) considered in Figure 5, the median of the nlmeIMH and NUTS samples are closer to the ground truth. Figure 5 also highlights that all those methods succeed in sampling from the whole distribution after iterations.
We now use the RWM, the nlmeIMH and the MALA in the SAEM algorithm and observe the convergence of the resulting sequences of parameters.
5.1.4 Maximum likelihood estimation
We use the SAEM algorithm to estimate the population PK parameters , and , the standard deviations of the random effects , and and the residual variance .
The stepsize is set to 1 during the first 100 iterations and then decreases as where during the next 100 iterations.
Here we compare the standard SAEM algorithm, as implemented in the saemix R package, with the fSAEM algorithm and the SAEM using the MALA sampler. In this example, the nlmeIMH and the MALA are only used during the first iterations of the SAEM. The standard MH algorithm is then used.
Figure 6 shows the estimates of and computed at each iteration of these three variants of SAEM and starting from three different initial values. First of all, we notice that, whatever the initialisation and the sampling algorithm used, all the runs converge towards the maximum likelihood estimate. It is then very clear that the fSAEM converges faster than the standard algorithm. The SAEM using the MALA algorithm for sampling from the individual conditional distribution presents a similar convergence behavior as the reference.
We can conclude, for this example, that sampling around the MAP of each individual conditional distribution is the key to a fast convergence of the SAEM during the first iterations.
5.1.5 Monte Carlo study
We conduct a Monte Carlo study to confirm the properties of the fSAEM algorithm for computing the ML estimates.
datasets have been simulated using the PK model previously used for fitting the warfarin PK data with the following parameter values: , , , , , and . The same original design with patients and a total number of 251 PK measurements were used for all the simulated datasets. Since all the simulated data are different, the value of the ML estimator varies from one simulation to another. If we run iterations of SAEM, the last element of the sequence is the estimate obtained from the th simulated dataset. To investigate how fast converges to we study how fast goes to 0. For a given sequence of estimates, we can then define, at each iteration and for each component of the parameter, the mean square distance over the replicates
(30) 
Figure 7 shows using the new proposal leads to a much faster convergence towards the maximum likelihood estimate. Less than 10 iterations are required to converge with the fSAEM on this example, instead of 50 with the original version. It should also be noted that the distance decreases monotonically. The sequence of estimates approaches the target at each iteration, compared to the standard algorithm which makes twists and turns before converging.
5.2 Timetoevent Data Model
5.2.1 The model
In this section, we consider a Weibull model for timetoevent data [25, 49]. For individual , the hazard function of this model is:
(31) 
Here, the vector of individual parameters is . These two parameters are assumed to be independent and lognormally distributed:
(32) 
Then, the vector of population parameters is .
Repeated events were generated, for individuals, using the Weibull model (31) with , , and and assuming a right censoring time .
5.2.2 MCMC Convergence Diagnostic
Similarly to the previous section, we start by looking at the behaviour of the MCMC procedure used for sampling from the conditional distribution for a given individual and assuming that is known. We use the generating model parameter in these experiments ().
We run iterations of the reference MH algorithm the nlmeIMH to estimate the median of the posterior distribution of . We see Figure 8 that the sequences of empirical medians obtained with the two procedures converge to the same value but the new algorithm converges faster than the standard MH algorithm. Autocorrelation plots, Figure 8, are also significantly showing the advantage of the new sampler as the chain obtained with the nlmeIMH is mixing almost ten times faster than the reference sampler.
MSJD  ESS  

RWM  0.055  0.093  3061  1115 
nlmeIMH  0.095  0.467  8759  8417 
Plots for the other parameter can be found in Appendix E.2. Comparisons with stateoftheart methods were conducted as in the previous section. These comparisons led us to the same remarks as those made for the previous continuous data model both on the asymptotic and non asymptotic regimes.
5.2.3 Maximum likelihood estimation of the population parameters
We run the standard SAEM algorithm implemented in the saemix package (extension of this package for noncontinuous data models is available on GitHub: https://github.com/belhal/saemix) and the fSAEM on the generated dataset.
Figure 9 shows the estimates of and computed at each iteration of the two versions of the SAEM and starting from three different initial values. The same behaviour is observed as in the continuous case: regardless the initial values and the algorithm, all the runs converge to the same solution but convergence is much faster with the proposed method. The same comment applies for the two other parameters and .
5.2.4 Monte Carlo study
We now conduct a Monte Carlo study in order to confirm the good properties of the new version of the SAEM algorithm for estimating the population parameters of a timetoevent data model. synthetic datasets are generated using the same design as above. Figure 10 shows the convergence of the mean square distances defined in (30) for and . All these distances converge monotonically to 0 which means that both algorithms properly converge to the maximum likelihood estimate, but very few iterations are required with the new version to converge while about thirty iterations are needed with the SAEM.
6 Conclusion and discussion
We present in this article an independent MetropolisHastings procedure for sampling random effects from their conditional distributions and a fast MLE algorithm, called the fSAEM, in nonlinear mixed effects models.
The idea of the method is to approximate each individual conditional distribution by a multivariate normal distribution. A Laplace approximation makes it possible to consider any type of data, but we have shown that, in the case of continuous data, this approximation is equivalent to linearizing the structural model around the conditional mode of the random effects.
The numerical experiments demonstrate that the proposed nlmeIMH sampler converges faster to the target distribution than a standard random walk Metropolis. This practical behaviour is partly explained by the fact that the conditional mode of the random effects in the linearized model coincides with the conditional mode of the random effects in the original model. The proposal distribution is therefore a normal distribution centered around this MAP. On the other hand, the dependency structure in the conditional distribution of the random effects is well approximated by the covariance structure of the Gaussian proposal. So far, we have mainly applied our method to standard problems encountered in pharmacometrics and for which the number of random effects remains small. It can nevertheless be interesting to see how this method behaves in higher dimension and compare it with methods adapted to such situations such as MALA or HMC. Lastly, we have shown that this new IMH algorithm can easily be embedded in the SAEM algorithm for maximum likelihood estimation of the population parameters. Our numerical studies have shown empirically that the new transition kernel is effective in the very first iterations. It is of great interest to determine automatically and in an adaptive way an optimal scheme of kernel transitions combining this new proposal with the blockwise random walk Metropolis.
References
 Agresti [1990] Alan Agresti. Categorical data analysis. A WileyInterscience publication. Wiley, New York, 1990.
 Allassonniere and Kuhn [2013] Stéphanie Allassonniere and Estelle Kuhn. Convergent Stochastic Expectation Maximization algorithm with efficient sampling in high dimension. Application to deformable template model estimation. arXiv preprint arXiv:1207.5938, 2013.
 Andersen [2006] P. K. Andersen. Survival Analysis. Wiley Reference Series in Biostatistics, 2006.
 Andrieu and Thoms [2008] Christophe Andrieu and Johannes Thoms. A tutorial on adaptive mcmc. Statistics and computing, 18(4):343–373, 2008.
 Andrieu et al. [2006] Christophe Andrieu, Éric Moulines, et al. On the ergodicity properties of some adaptive mcmc algorithms. The Annals of Applied Probability, 16(3):1462–1505, 2006.
 Andrieu et al. [2009] Christophe Andrieu, Gareth O Roberts, et al. The pseudomarginal approach for efficient monte carlo computations. The Annals of Statistics, 37(2):697–725, 2009.
 Andrieu et al. [2010] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle markov chain monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
 Atchadé and Rosenthal [2005] Yves F. Atchadé and Jeffrey S. Rosenthal. On adaptive Markov chain Monte Carlo algorithms. Bernoulli, 11(5):815–828, 10 2005. doi: 10.3150/bj/1130077595.
 Beal and Sheiner [1980] Stuart Beal and Lewis Sheiner. The NONMEM system. The American Statistician, 34(2):118–119, 1980.
 Betancourt [2017] Michael Betancourt. A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint arXiv:1701.02434, 2017.
 Brooks et al. [2011] Steve Brooks, Andrew Gelman, Galin Jones, and XiaoLi Meng. Handbook of markov chain monte carlo. CRC press, 2011.
 Brosse et al. [2017] Nicolas Brosse, Alain Durmus, Éric Moulines, and Sotirios Sabanis. The tamed unadjusted langevin algorithm. arXiv preprint arXiv:1710.05559, 2017.
 Carpenter et al. [2017] Bob Carpenter, Andrew Gelman, Matthew Hoffman, Daniel Lee, Goodrich Ben, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 2017.
 Chan et al. [2011] P. L. S. Chan, P. Jacqmin, M. Lavielle, L. McFadyen, and B. Weatherley. The use of the SAEM algorithm in MONOLIX software for estimation of population pharmacokineticpharmacodynamicviral dynamics parameters of maraviroc in asymptomatic HIV subjects. Journal of Pharmacokinetics and Pharmacodynamics, 38(1):41–61, 2011.
 Comets et al. [2017] Emmanuelle Comets, Audrey Lavenu, and Marc Lavielle. Parameter estimation in nonlinear mixed effect models using saemix, an r implementation of the saem algorithm. Journal of Statistical Software, 80(3):1–42, 2017.

de Freitas et al. [2001]
Nando de Freitas, Pedro HøjenSørensen, Michael I Jordan, and Stuart
Russell.
Variational mcmc.
Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence
, pages 120–127, 2001.  Delyon et al. [1999] Bernard Delyon, Marc Lavielle, and Eric Moulines. Convergence of a stochastic approximation version of the EM algorithm. Ann. Statist., 27(1):94–128, 03 1999. doi: 10.1214/aos/1018031103.
 Donnet and Samson [2013] Sophie Donnet and Adeline Samson. Using pmcmc in em algorithm for stochastic mixed models: theoretical and practical issues. Journal de la Société Française de Statistique, 155(1):49–72, 2013.
 Doucet et al. [2000] Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics and computing, 10(3):197–208, 2000.
 Griewank and Walther [2008] Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of algorithmic differentiation, volume 105. Siam, 2008.
 Haario et al. [2001] Heikki Haario, Eero Saksman, Johanna Tamminen, et al. An adaptive metropolis algorithm. Bernoulli, 7(2):223–242, 2001.

Hoffman and Gelman [2014]
Matthew D Hoffman and Andrew Gelman.
The NoUturn sampler: adaptively setting path lengths in
Hamiltonian Monte Carlo.
Journal of Machine Learning Research
, 15(1):1593–1623, 2014.  Kucukelbir et al. [2015] Alp Kucukelbir, Rajesh Ranganath, Andrew Gelman, and David Blei. Automatic variational inference in stan. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 568–576. Curran Associates, Inc., 2015.
 Kuhn and Lavielle [2004] Estelle Kuhn and Marc Lavielle. Coupling a stochastic approximation version of EM with an MCMC procedure. ESAIM: Probability and Statistics, 8:115–131, 2004.
 Lavielle [2014] Marc Lavielle. Mixed effects models for the population approach: models, tasks, methods and tools. CRC press, 2014.
 Lavielle and Ribba [2016] Marc Lavielle and Benjamin Ribba. Enhanced method for diagnosing pharmacometric models: random sampling from conditional distributions. Pharmaceutical research, 33(12):2979–2988, 2016.
 Louis [1982] Thomas A. Louis. Finding the observed information matrix when using the em algorithm. Journal of the Royal Statistical Society, Series B: Methodological, 44:226–233, 1982.
 Mbogning et al. [2015] Cyprien Mbogning, Kevin Bleakley, and Marc Lavielle. Joint modeling of longitudinal and repeated timetoevent data using nonlinear mixedeffects models and the SAEM algorithm. Journal of Statistical Computation and Simulation, 85(8):1512–1528, 2015. doi: 10.1080/00949655.2013.878938.
 McLachlan and Krishnan [2007] Geoffrey McLachlan and Thriyambakam Krishnan. The EM algorithm and extensions, volume 382. John Wiley & Sons, 2007.
 Mengersen and Tweedie [1996] K. L. Mengersen and R. L. Tweedie. Rates of convergence of the hastings and metropolis algorithms. Ann. Statist., 24(1):101–121, 02 1996. doi: 10.1214/aos/1033066201.
 Metropolis et al. [1953] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092, 1953. doi: 10.1063/1.1699114.
 Migon et al. [2014] H.S. Migon, D. Gamerman, and F. Louzada. Statistical Inference: An Integrated Approach, Second Edition. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis, 2014. ISBN 9781439878804.
 Neal et al. [2011] Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2, 2011.
 O’Reilly and Aggeler [1968] Robert A. O’Reilly and Paul M. Aggeler. Studies on coumarin anticoagulant drugs initiation of warfarin therapy without a loading dose. Circulation, 38(1):169–177, 1968.
 Pav [2016] Steven E Pav. Madness: a package for multivariate automatic differentiation. 2016.
 Robert and Casella [2010] Christian P. Robert and George Casella. Metropolis–Hastings Algorithms, pages 167–197. Springer New York, New York, NY, 2010.
 Roberts and Rosenthal [1997] G. O. Roberts and J. S. Rosenthal. Optimal scaling of discrete approximations to langevin diffusions. J. R. Statist. Soc. B, 60:255–268, 1997.
 Roberts et al. [1997] G. O. Roberts, A. Gelman, and W. R. Gilks. Weak convergence and optimal scaling of random walk metropolis algorithms. Ann. Appl. Probab., 7(1):110–120, 02 1997. doi: 10.1214/aoap/1034625254.
 Roberts and Rosenthal [2011] Gareth O Roberts and Jeffrey S Rosenthal. Quantitative nongeometric convergence bounds for independence samplers. Methodology and Computing in Applied Probability, 13(2):391–403, 2011.
 Roberts and Tweedie [1996] Gareth O. Roberts and Richard L. Tweedie. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 12 1996.

Rue et al. [2009]
Håvard Rue, Sara Martino, and Nicolas Chopin.
Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations.
Journal of the royal statistical society: Series b (statistical methodology), 71(2):319–392, 2009.  Savic et al. [2011] R. M. Savic, F. Mentré, and M. Lavielle. Implementation and evaluation of the SAEM algorithm for longitudinal ordered categorical data with an illustration in pharmacokineticspharmacodynamics. The AAPS Journal, 13(1):44–53, 2011.
 Stan Development Team [2018] Stan Development Team. RStan: the R interface to Stan, 2018. URL http://mcstan.org/. R package version 2.17.3.
 Stramer and Tweedie [1999] O. Stramer and R. L. Tweedie. Langevintype models i: Diffusions with given stationary distributions and their discretizations*. Methodology And Computing In Applied Probability, 1(3):283–306, Oct 1999. ISSN 15737713. doi: 10.1023/A:1010086427957.
 Titsias and Papaspiliopoulos [2018] Michalis K. Titsias and Omiros Papaspiliopoulos. Auxiliary gradient‐based sampling algorithms. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 0(0), 2018. doi: 10.1111/rssb.12269.
 Verbeke [1997] Geert Verbeke. Linear mixed models for longitudinal data. Springer, 1997.
 Vihola [2012] Matti Vihola. Robust adaptive metropolis algorithm with coerced acceptance rate. Statistics and Computing, 22(5):997–1008, 2012.
 Wang [2007] Yaning Wang. Derivation of various nonmem estimation methods. Journal of Pharmacokinetics and pharmacodynamics, 34(5):575–593, 2007.
 Zhang [2016] Z. Zhang. Parametric regression model for survival data: Weibull regression model as an example. Ann Transl Med., 24, 2016.
Appendix A Extensions of model (2)
Several extensions of model (2) are also possible. We can assume for instance that the transformed individual parameters are normally distributed:
(33) 
where is a strictly monotonic transformation applied on the individual parameters . Examples of such transformation are the logarithmic function (in which case the components of
are lognormally distributed), the logit and the probit transformations
[25]. In the following, we either use the original parameter or the Gaussian transformed parameter .Another extension of model (2) consists in introducing individual covariates in order to explain part of the interindividual variability:
(34) 
where is a matrix of individual covariates. Here, the fixed effects are the vector of coefficients and the vector of typical parameters .
Appendix B Calculus of the proposal in the noncontinuous case
Laplace approximation (see [32]) consists in approximating an integral of the form
(35) 
where is at least twice differentiable.
The following second order Taylor expansion of the function around a point
(36) 
provides an approximation of the integral
(consider a multivariate Gaussian probability distribution function which integral sums to 1):
(37) 
In our context, we can write the marginal pdf that we aim to approximate as
(38)  
(39) 
Then, let
(40)  
(41) 
and compute its Taylor expansion around the MAP . We have by definition that
which leads to the following Laplace approximation of :
We thus obtain the following approximation of the logarithm of the conditional pdf of given evaluated at :
which is precisely the logpdf of a multivariate Gaussian distribution with mean and variancecovariance with:
Comments
There are no comments yet.