 # f-SAEM: A fast Stochastic Approximation of the EM algorithm for nonlinear mixed effects models

The ability to generate samples of the random effects from their conditional distributions is fundamental for inference in mixed effects models. Random walk Metropolis is widely used to perform such sampling, but this method is known to converge slowly for medium dimensional problems, or when the joint structure of the distributions to sample is spatially heterogeneous. The main contribution consists of an independent Metropolis-Hastings (MH) algorithm based on a multidimensional Gaussian proposal that takes into account the joint conditional distribution of the random effects and does not require any tuning. Indeed, this distribution is automatically obtained thanks to a Laplace approximation of the incomplete data model. Such approximation is shown to be equivalent to linearizing the structural model in the case of continuous data. Numerical experiments based on simulated and real data illustrate the performance of the proposed methods. For fitting nonlinear mixed effects models, the suggested MH algorithm is efficiently combined with a stochastic approximation version of the EM algorithm for maximum likelihood estimation of the global parameters.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Mixed effects models are often adopted to take into account the inter-individual variability within a population (see  and the references therein). Consider a study with

individuals from a same population. The vector of observations

associated to each individual

is assumed to be a realisation of a random variable which depends on a vector of random individual parameters

. Then, inference on the individual parameter amounts to estimate its conditional distribution given the observed data .

When the model is a linear (mixed effects) Gaussian model, then this conditional distribution is a normal distribution that can explicitly be computed



. For more complex distributions and models, Monte Carlo methods must be used to approximate this conditional distribution. Most often, direct sampling from this conditional distribution is inefficient and it is necessary to resort to a Markov chain Monte Carlo (MCMC) method for obtaining random samples from this distribution. Yet, MCMC requires a tractable likelihood in order to compute the acceptance ratio. When this computation is impossible, a pseudo-marginal Metropolis Hastings (PMMH) has been developed in

 and consists in replacing the posterior distribution evaluated in the MH acceptance rate by an unbiased approximation. An extension of the PMMH is the particle MCMC method, introduced in , where a Sequential Monte Carlo sampler  is used to approximate the intractable likelihood at each iteration. For instance, this method is relevant when the model is SDE-based (see ). In a fully Bayesian setting, approximation of the posterior of the global parameters can be used to approximate the posterior of the individual parameters using Integrated Nested Laplace Approximation (INLA) introduced in 

. When the goal is to do approximate inference, this method has shown great performances mainly because it approximates each marginal separately as univariate Gaussian distribution. In this paper, we focus on developing a method to perform exact inference and do not treat the case of approximate inference algorithms such as the Laplace EM or the First Order Conditional Estimation methods

 that can introduce bias in the resulting parameters.

Note that generating random samples from is useful for several tasks to avoid approximation of the model, such as linearisation or Laplace method. Such tasks include the estimation of the population parameters of the model by a maximum likelihood approach, i.e. by maximizing the observed incomplete data likelihood using the Stochastic Approximation of the EM algorithm (SAEM) algorithm combined with a MCMC procedure . Lastly, sampling from the conditional distributions is also known to be useful for model building. Indeed, in , the authors argue that methods for model assessment and model validation, whether graphical or based on statistical tests, must use samples of the conditional distribution to avoid bias.

Designing a fast mixing sampler for these distributions is therefore of utmost importance to perform Maximum Likelihood Estimation (MLE) using the SAEM algorithm. The most common MCMC method for nonlinear mixed effects (NLME) models is the random walk Metropolis (RWM) algorithm [36, 38, 25]. This method is implemented in software tools such as Monolix, NONMEM, the saemix R package  and the nlmefitsa Matlab function. Despite its simplicity, it has been successfully used in many classical examples of pharmacometrics. Nevertheless, it can show its limitations when the dependency structure of the individual parameters is complex. Yet, maintaining an optimal acceptance rate (advocated in ) most often implies very small moves and therefore a very large number of iterations in medium and high dimensions since no information of the geometry of the target distribution is used.

The Metropolis-adjusted Langevin algorithm (MALA) uses evaluations of the gradient of the target density for proposing new states which are accepted or rejected using the Metropolis-Hastings algorithm [40, 44]

. Hamiltonian Monte Carlo (HMC) is another MCMC algorithm that exploits information about the geometry of the target distribution in order to efficiently explore the space by selecting transitions that can follow contours of high probability mass

. The No-U-Turn Sampler (NUTS) is an extension to HMC that allows an automatic and optimal selection of some of the settings required by the algorithm, [11, 22]. Nevertheless, these methods may be difficult to use in practice, and are computationally involved, in particular when the structural model is a complex ODE based model. The algorithm we propose is an independent Metropolis-Hastings (IMH) algorithm, but for which the proposal is a Gaussian approximation of the target distribution. For general data model (i.e. categorical, count or time-to-event data models or continuous data models), the Laplace approximation of the incomplete pdf leads to a Gaussian approximation of the conditional distribution .

In the special case of continuous data, linearisation of the model leads, by definition, to a Gaussian linear model for which the conditional distribution of the individual parameter given the data is a multidimensional normal distribution that can be computed. Therefore, we design an independent sampler using this multivariate Gaussian distribution to sample from the target conditional distribution and embed this procedure in an exact inference algorithm, the SAEM, to speed the convergence of the vector of estimations of the global parameters .

The paper is organised as follows. Mixed effects models for continuous and noncontinuous data are presented in Section 2. The standard MH for NLME models is described in Section 3. The proposed method, called the nlme-IMH, is introduced in Section 4 as well as the f-SAEM, a combination of this new method with the SAEM algorithm for estimating the population parameters of the model. Numerical examples illustrate, in Section 5, the practical performances of the proposed method, both on a continuous pharmacokinetics (PK) model and a time-to-event example. A Monte Carlo study confirms that this new SAEM algorithm shows a faster convergence to the maximum likelihood estimate.

## 2 Mixed Effect Models

### 2.1 Population approach and hierarchical models

In the sequel, we adopt a population approach, where we consider individuals and observations per individual . The set of observed data is where are the observations for individual . For the sake of clarity, we assume that each observation takes its values in some subset of . The distribution of the vector of observations depends on a vector of individual parameters that takes its values in a subset of .

We assume that the pairs

are mutually independent and consider a parametric framework: the joint distribution of

is denoted by , where is the vector of parameters of the model. A natural decomposition of this joint distribution reads

 pi(yi,ψi;θ)=pi(yi|ψi;θ)pi(ψi;θ), (1)

where is the conditional distribution of the observations given the individual parameters, and where is the so-called population distribution used to describe the distribution of the individual parameters within the population.

A particular case of this general framework consists in describing each individual parameter as the sum of a typical value and a vector of individual random effects :

 ψi=ψpop+ηi. (2)

In the sequel, we assume that the random effects are distributed according to a multivariate Gaussian distribution: . Extensions of this general model are detailed in Appendix A.

### 2.2 Continuous data models

A regression model is used to express the link between continuous observations and individual parameters:

 yij=fi(tij,ψi)+εij, (3)

where is the -th observation for individual measured at index , is the residual error. It is assumed that for any index , is twice differentiable in .

We start by assuming that the residual errors are independent and normally distributed with zero-mean and a constant variance

. Let be the vector of observation indices for individual . Then, the model for the observations reads:

 yi|ψi∼N(fi(ψi),σ2Idni×ni)wherefi(ψi)=(fi(ti,1,ψi),…,fi(ti,ni,ψi)). (4)

If we assume that , then the parameters of the model are .

###### Remark 1.

An extension of this model consists in assuming that the variance of the residual errors is not constant over time, i.e., . Such extension includes proportional error models () and combined error models ()  but the proposed method remains the same whatever the residual error model is.

### 2.3 Noncontinuous data models

Noncontinuous data models include categorical data models [42, 1], time-to-event data models [28, 3], or count data models . A categorical outcome takes its value in a set of categories. Then, the model is defined by the conditional probabilities , that depend on the vector of individual parameters and may be a function of the time .

In a time-to-event data model, the observations are the times at which events occur. An event may be one-off (e.g., death, hardware failure) or repeated (e.g., epileptic seizures, mechanical incidents). To begin with, we consider a model for a one-off event. The survival function gives the probability that the event happens after time :

 S(t)≜P(T>t)=exp{−∫t0h(u)du}, (5)

where is called the hazard function. In a population approach, we consider a parametric and individual hazard function . The random variable representing the time-to-event for individual is typically written and may possibly be right-censored. Then, the observation for individual is

 yi={Tiif Ti≤τc"Ti>τc"otherwise, (6)

where is the censoring time and is the information that the event occurred after the censoring time.

For repeated event models, times when events occur for individual are random times for which conditional survival functions can be defined:

 P(Tij>t|Ti(j−1)=ti(j−1))=exp{−∫tti(j−1)h(u,ψi)du}. (7)

Here, is the observed value of the random time . If the last event is right censored, then the last observation for individual is the information that the censoring time has been reached . The conditional pdf of reads (see  for more details)

 pi(yi|ψi)=exp{−∫τc0h(u,ψi)du}ni−1∏j=1h(tij,ψi). (8)

## 3 Sampling from conditional distributions

### 3.1 The conditional distribution of the individual parameters

Once the conditional distribution of the observations and the marginal distribution of the individual parameters are defined, the joint distribution and the conditional distribution are implicitly specified. This conditional distribution plays a crucial role for inference in NLME models.

One of the main task is to compute the maximum likelihood (ML) estimate of

 ^θML=argmaxθ∈RdL(θ,y), (9)

where . In NLME models, this optimization is solved by using a surrogate function defined as the conditional expectation of the complete data log-likelihood . The SAEM is an iterative procedure for ML estimation that requires to generate one or several samples from this conditional distribution at each iteration of the algorithm. Once the ML estimate has been computed, the observed Fisher information matrix noted can be derived thanks to the Louis formula  which expresses in terms of the conditional expectation and covariance of the complete data log-likelihood. Such procedure also requires to sample from the conditional distributions for all .

Samples from the conditional distributions might also be used to define several statistical tests and diagnostic plots for models assessment. It is advocated in  that such samples should be preferred to the modes of these distributions (also called Empirical Bayes Estimate(EBE), or Maximum a Posteriori Estimate), in order to provide unbiased tests and plots. For instance, a strong bias can be observed when the EBEs are used for testing the distribution of the parameters or the correlation between random effects.

In short, being able to sample individual parameters from their conditional distribution is essential in nonlinear mixed models. It is therefore necessary to design an efficient method to sample from this distribution.

### 3.2 The Metropolis-Hastings Algorithm

Metropolis-Hasting (MH) algorithm is a powerful MCMC procedure widely used for sampling from a complex distribution . To simplify the notations, we remove the dependency on . For a given individual , the MH algorithm, to sample from the conditional distribution , is described as:

Under weak conditions, is an ergodic Markov chain whose distribution converges to the target .

Current implementations of the SAEM algorithm in Monolix , saemix (R package) , nlmefitsa (Matlab) and NONMEM  mainly use the same combination of proposals. The first proposal is an independent MH algorithm which consists in sampling the candidate state directly from the prior distribution of the individual parameter . The MH ratio then reduces to for this proposal.

The other proposals are component-wise and block-wise random walk procedures  that updates different components of using univariate and multivariate Gaussian proposal distributions. These proposals are centered at the current state with a diagonal variance-covariance matrix; the variance terms are adaptively adjusted at each iteration in order to reach some target acceptance rate [8, 25]. Nevertheless, those proposals fail to take into account the nonlinear dependence structure of the individual parameters.

A way to alleviate these problems is to use a proposal distribution derived from a discretised Langevin diffusion whose drift term is the gradient of the logarithm of the target density leading to the Metropolis Adjusted Langevin Algorithm (MALA). The MALA proposal is a multivariate Gaussian with the following mean and covariance matrix :

 μ(k)i,MALA=ψ(k)i+γ∇ψilogpi(ψ(k)i|yi)andΓMALA=2γIp (11)

where is a positive stepsize and

is the identity matrix in

. These methods appear to behave well for complex models but still do not take into consideration the multidimensional structure of the individual parameters. Recent works include efforts in that direction, such as the Anisotropic MALA for which the covariance matrix of the proposal depends on the gradient of the target measure , the Tamed Unadjusted Langevin Algorithm  based on the coordinate-wise taming of superlinear drift coefficients and a multidimensional extension of the Adaptative Metropolis algorithm  simultaneously estimating the covariance of the target measure and coercing the acceptance rate, see .

The MALA algorithm is a special instance of the Hybrid Monte Carlo (HMC), introduced in ; see  and the references therein, and consists in augmenting the state space with an auxiliary variable , known as the velocity in Hamiltonian dynamics. This algorithm belongs to the class of data augmentation methods. Indeed, the potential energy is augmented with a kinetic energy, function of an added auxiliary variable. The MCMC procedure then consists in sampling from this augmented posterior distribution. All those methods aim at finding the proposal that accelerates the convergence of the chain. Unfortunately they are computationally involved (even in small and medium dimension settings, the computation of the gradient or the Hessian can be overwhelming) and can be difficult to implement (stepsizes and numerical derivatives need to be tuned and implemented).

We see in the next section how to define a multivariate Gaussian proposal for both continuous and noncontinuous data models, that is easy to implement and that takes into account the multidimensional structure of the individual parameters in order to accelerate the MCMC procedure.

## 4 The nlme-IMH and the f-SAEM

In this section, we assume that the individual parameters are independent and normally distributed with mean and covariance . The MAP estimate, for individual , is the value of that maximizes the conditional distribution :

 ^ψi=argmaxψi∈Rppi(ψi|yi)=argmaxψi∈Rppi(yi|ψi)pi(ψi). (12)

### 4.1 Proposal based on Laplace approximation

For both continuous and noncontinuous data models, the goal is to find a simple proposal, a multivariate Gaussian distribution in our case, that approximates the target distribution . For general MCMC samplers, it is shown in  that the mixing rate in total variation depends on the expectation of the acceptance ratio under the proposal distribution which is also directly related to the ratio of the proposal to the target in the special case of independent samplers (see [30, 39]). This observation naturally suggests to find a proposal which approximates the target. 

advocates the use a multivariate Gaussian distribution whose parameters are obtained by minimizing the Kullback-Leibler divergence between a multivariate Gaussian variational candidate distribution and the target distribution. In

 and the references therein, an adaptative Metropolis algorithm is studied and reconciled to a KL divergence minimisation problem where the resulting multivariate Gaussian distribution can be used as a proposal in a IMH algorithm. Authors note that although this proposal might be a sensible choice when it approximates well the target, it can fail when the parametric form of the proposal is not sufficiently rich. Thus, other parametric forms can be considered and it is suggested in  to consider mixtures, finite or infinite, of distributions belonging to the exponential family.

In general, this optimization step is difficult and computationally expensive since it requires to approximate (using Monte Carlo integration for instance) the integral of the log-likelihood with respect to the variational candidate distribution.

###### Independent proposal 1.

We suggest a Laplace approximation of this conditional distribution as described in  which is the multivariate Gaussian distribution with mean and variance-covariance

 Γi=(−H^ψi+Ω−1)−1, (13)

where is the Hessian of evaluated at .

Mathematical details for computing this proposal are postponed to Appendix B. We use this multivariate Gaussian distribution as a proposal in our IMH algorithm introduced in the next section, for both continuous and noncontinuous data models.

###### Remark 2.

Note that the resulting proposal distribution is based on the assumption that, in model (2), the random effects are normally distributed. When this assumption does not hold, our method exploits the same Gaussian proposal, where the variance in (13) is calculated explicitely. Consider the following example: the random effects in (2) are no longer distributed according to a multivariate Gaussian distribution but a multivariate Student distribution with degrees of freedom, zero mean and a prior shape matrix such that . Then the vector of parameters of the model is where is the prior covariance matrix. In that case, our method uses the Independent proposal 1 and computes the MH acceptance ratio (10) with the corresponding multivariate Student density .

We shall now see another method to derive a Gaussian proposal distribution in the specific case of continuous data models (see (3)).

### 4.2 Nonlinear continuous data models

When the model is described by (3), the approximation of the target distribution can be done twofold: either by using the Laplace approximation, as explained above, or by linearizing the structural model for any individual of the population. using (3) and (12), the MAP estimate can thus be derived as:

 ^ψi=argminψi∈Rp(1σ2∥yi−fi(ψi)∥2+(ψi−mi)′Ω−1(ψi−mi)). (14)

where is defined by (4) and is the transpose of the matrix .

We linearize the structural model around the MAP estimate :

 fi(ψi)≈fi(^ψi)+Jfi(^ψi)(ψi−^ψi), (15)

where is the Jacobian of evaluated at . Defining , this expansion yields the following linear model:

 zi=Jfi(^ψi)ψi+εi. (16)

We can directly use the definition of the conditional distribution under a linear model (see (45) in Appendix C) to get an expression of the conditional covariance of given under (16):

 Γi=⎛⎝J′fi(^ψi)Jfi(^ψi)σ2+Ω−1⎞⎠−1. (17)

Using (14) and the definition of the conditional distribution under a linear model we obtain that (See Appendix D for details). We note that the mode of the conditional distribution of in the nonlinear model (3) is also the mode and the mean of the conditional distribution of in the linear model (16).

###### Independent proposal 2.

In the case of continuous data models, we propose to use the multivariate Gaussian distribution, with mean and variance-covariance matrix defined by (17) as a proposal for an independent MH algorithm avoiding the computation of an Hessian matrix.

We can note that linearizing the structural model is equivalent to using the Laplace approximation with the expected information matrix. Indeed:

 Eyi|^ψi(−Hl(^ψi))=J′fi(^ψi)Jfi(^ψi)σ2. (18)
###### Remark 3.

When the model is linear, the probability of accepting a candidate generated with this proposal is equal to 1.

###### Remark 4.

If we consider a more general error model, that may depend on the individual parameters and the observation times , then the conditional variance-covariance matrix reads:

 Γi=(J′fi(^ψi)Σ(ti,^ψi)−1Jfi(^ψi)+Ω−1)−1. (19)
###### Remark 5.

In the model (33), the transformed variable follows a normal distribution. Then a candidate is drawn from the multivariate Gaussian proposal with parameters:

 μi=^ϕi, (20) Γi=⎛⎝J′fi(u−1(^ϕi))Jfi(u−1(^ϕi))σ2+Ω−1⎞⎠−1, (21)

where and finally the candidate vector of individual parameters is set to

These approximations of the conditional distribution lead to our nlme-IMH algorithm, an Independent Metropolis-Hastings (IMH) algorithm for NLME models. For all individuals , the algorithm is defined as:

This method shares some similiarities with  that suggests to perform a Taylor expansion of around the current state of the chain, leaving unchanged.

###### Remark 6.

Although a multivariate Gaussian proposal is used in our presentation of the nlme-IMH, other type of distributions could be adopted. For instance, when the target distribution presents heavy tails, a Student distribution with a well-chosen degree of freedom could improve the performance of the independent sampler. In such case, the parameters of the Gaussian proposal are used to shift and scale the Student proposal distribution and the acceptance rate (23) needs to be modified accordingly. The numerical applications in Section 5 are performed using a Gaussian proposal but comparisons with a Student proposal distribution are given in Appendix E.1.

### 4.3 Maximum Likelihood Estimation

The ML estimator defined by (9) is computed using the Stochastic Approximation of the EM algorithm (SAEM) . The SAEM algorithm is described as follows:

The SAEM algorithm is implemented in most sofware tools for NLME models and its convergence is studied in [17, 24, 2]. The practical performances of SAEM are closely linked to the settings of SAEM. In particular, the choice of the transition kernel plays a key role. The transition kernel is directly defined by the proposal(s) used for the MH algorithm.

We propose a fast version of the SAEM algorithm using our resulting independent proposal distribution called the f-SAEM. The simulation step of the f-SAEM is achieved using the nlme-IMH algorithm (see algorithm 2) for all individuals and the next steps remain unchanged. In practice, the number of transitions is small since the convergence of the SAEM does not require the convergence of the MCMC at each iteration . In the sequel, we carry out numerous numerical experiments to compare our nlme-IMH algorithm to state-of-the-art samplers and assess its relevance in a MLE algorithm such as the SAEM.

## 5 Numerical Examples

### 5.1 A pharmacokinetic example

#### 5.1.1 Data and model

32 healthy volunteers received a 1.5 mg/kg single oral dose of warfarin, an anticoagulant normally used in the prevention of thrombosis . Figure 1 shows the warfarin plasmatic concentration measured at different times for these patients (the single dose was given at time 0 for all the patients). Figure 1: Warfarin concentration (mg/l) over time (h) for 32 subjects

We consider a one-compartment pharmacokinetics (PK) model for oral administration, assuming first-order absorption and linear elimination processes:

 f(t,ka,V,k)=DkaV(ka−k)(e−kat−e−kt), (26)

where is the absorption rate constant, the volume of distribution , the elimination rate constant, and the dose of drug administered. Here, , and are PK parameters that can change from one individual to another. Let be the vector of individual PK parameters for individual . The model for the -th measured concentration, noted , for individual writes:

 yij=f(tij,ψi)+εij. (27)

We assume in this example that the residual errors are independent and normally distributed with mean 0 and variance . Lognormal distributions are used for the three PK parameters:

 log(kai)∼N(log(kapop),ω2ka),log(Vi)∼N(log(Vpop),ω2V),log(ki)∼N(log(kpop),ω2k). (28)

This is a specific instance of the nonlinear mixed effects model for continuous data described in Section 2.2. We thus use the multivariate Gaussian proposal whose mean and covariance are defined by (47) and (17). In such case the gradient can be explicitly computed. Nevertheless, for the method to be easily extended to any structural model, the gradient is calculated using Automatic Differentiation  implemented in the R package “Madness” .

#### 5.1.2 MCMC Convergence Diagnostic

We study in this section the behaviour of the MH algorithm used to sample individual parameters from the conditional distribution . We consider only one of the 32 individuals for this study and fix close to the ML estimate obtained with the SAEM algorithm, implemented in the saemix R package : , , , , , and .

We run the classical version of MH implemented in the saemix package and for which different transition kernels are used successively at each iteration: independent proposals from the marginal distribution , component-wise random walk and block-wise random walk. We compare it to our proposed algorithm 2.

We run iterations of these two algorithms and evaluate their convergence by looking at the convergence of the median for the three components of . We see Figure 2 that, for parameter , the sequences of empirical median obtained with the two algorithms converge to the same value, which is supposed to be the theoretical median of the conditional distribution. It is interesting to note that the empirical median with the nlme-IMH converge very rapidly. This is interesting in the population approach framework because it is mainly the median values of each conditional distribution that are used to infer the population distribution. Autocorrelation plots, Figure 2, highlight slower mixing of the RWM whereas samples from the nlme-IMH can be considered independent few iterations after the chain has been initialized. Comparison for all three dimensions of the individual parameter using a Student proposal distribution can be found in Appendix E.1.

The Mean Square Jump Distance (MSJD) as well as the Effective Sample Size (ESS) of the two methods are reported in Table 5. MSJD is a measure used to diagnose the mixing of the chain. It is calculated as the mean of the squared euclidean distance between every point and its previous point. Usually, this quantity indicates if the chain is moving enough or getting stuck at some region and the ESS is a quantity that estimates the number of independent samples obtained from a chain. Larger values of those two quantities for our method show greater performance of the sampler in comparison with the RWM. Figure 2: Modelling of the warfarin PK data. Top plot: convergence of the empirical medians of pi(ki|yi;θ) for a single individual. Comparison between the reference MH algorithm (blue) and the nlme-IMH (red). Bottom plot: Autocorrelation plots of the MCMC samplers for parameter ki.

Comparison with state-of-the-art methods: We then compare our new approach to the three following samplers: an independent sampler that uses variational approximation as proposal distribution , the MALA  and the No-U-Turn Sampler .

The same design and settings (dataset, model parameter estimate, individual) as in section 5.1.2 are used throughout the following experiments.

5.1.2.a
##### 5.1.2.a Variational MCMC algorithm
The Variational MCMC algorithm is a MCMC algorithm with independent proposal. The proposal distribution is a multivariate Gaussian distribution whose parameters are obtained by a variational approach that consists in minimising the Kullback Leibler divergence between a multivariate Gaussian distribution , and the target distribution for a given model parameter estimate noted . This problem boils down to maximizing the so-called Evidence Lower Bound defined as:
 ELBO(δ)≜∫q(ψi,δ)(log%pi(yi,ψi,θ)−logq(ψi,δ))dψi. (29)
We use the Automatic Differentiation Variational Inference (ADVI) implemented in RStan (R Package ) to obtain the vector of parameters noted defined as:
 δVI≜argmaxδ∈Rp×Rp×pELBO(δ).
The algorithm stops when the variation of the median of the objective function falls below the

threshold. The means and standard deviations of our nlme-IMH and the Variational MCMC proposals compare with the posterior mean (calculated using the NUTS

) as follows: We observe that the mean of the variational approximation is slightly shifted from the estimated posterior mode (see table 2for comparison) whereas a considerable difference lies in the numerical value of the covariance matrix obtained with ADVI. The empirical standard deviation of the Variational MCMC proposal is much smaller than our new proposal defined by (17) (see table 2), which slows down the MCMC convergence. Figure 3shows the proposals marginals and the marginal posterior distribution for the individual parameters and . Biplot of the samples drawn from the two multivariate Gaussian proposals (our independent proposal and the variational MCMC proposal) as well as samples drawn from the posterior distribution (using the NUTS) are also presented in this figure. We conclude that both marginal and bivariate posterior distributions are better approximated by our independent proposal than the one resulting from a KL divergence optimization. Besides similar marginal variances, both our independent proposal and the true posterior share a strong anisotropic nature, confirmed by the similar correlation values of table 6(see Appendix E.1). Same characteristics are observed for the other parameters. Those highlighted properties leads to a better performance of the nlme-IMH versus the ADVI sampler as reflected in Figure 4. Larger jumps of the chain and bigger ESS show how effective the nlme-IMH is compared to the ADVI (see Table 3). Figure 3: Modelling of the warfarin PK data: Comparison between the proposals of the nlme-IMH (blue), the Variational MCMC (green) and the empirical target distribution sampled using the NUTS (red). Marginals and biplots of the conditional distributions ki|yi and Vi|yi for a single individual. Ellipses containing 90% of the data points are represented on the main plot.5.1.2.b
###### 5.1.2.b Metropolis Adjusted Langevin Algorithm (MALA) and No-U-Turn Sampler (NUTS)
We now compare our method to the MALA, which proposal is defined by (11). The gradient of the log posterior distribution is also calculated by Automatic Differentiation. In this numerical example, the MALA has been initialized at the MAP and the stepsize () is tuned such that the acceptance rate of is reached . We also compare the implementation of NUTS [22, 13]in the RStan package to our method in Figure 4. Figure 4highlights good convergence of a well-tuned MALA and the NUTS. nlme-IMH and NUTS mixing properties, from autocorrelation plots in Figure 4seem to be similar and much better than all of the other methods. Table 3presents a benchmark of those methods regarding MSJD and ESS. Both nlme-IMH and NUTS have better performances here. For parameters and , the ESS of the NUTS, presented as a gold standard sampler for this king of problem, are slightly higher than our proposed method. Figure 4: Modelling of the warfarin PK data: Autocorrelation plots of the MCMC samplers for parameter ki. In practice, those three methods imply tuning phases that are computationally involved , warming up the chain and a careful initialisation whereas our independent sampler is automatic and fast to implement. Investigating the asymptotic convergence behavior of those methods highlights the competitive properties of our IMH algorithm to sample from the target distribution. Since our goal is to embed those samplers into a MLE algorithm such as the SAEM, we shall now study how they behave in the very first iterations of the MCMC procedure. Recall that the SAEM requires only few iterations of MCMC sampling under the current model parameter estimate. We present this non asymptotic study in the following section.

#### 5.1.3 Comparison of the chains for the first 500 iterations

We produce independent runs of the RWM, the nlme-IMH, the MALA and the NUTS for iterations. The boxplots of the samples drawn at a given iteration threshold (three different thresholds are used) are presented Figure 5 against the ground truth for the parameter ka. The ground truth has been calculated by running the NUTS for iterations.

For the three numbers of iteration (,,) considered in Figure 5, the median of the nlme-IMH and NUTS samples are closer to the ground truth. Figure 5 also highlights that all those methods succeed in sampling from the whole distribution after iterations. Figure 5: Modelling of the warfarin PK data: Boxplots for the RWM, the nlme-IMH, the MALA and the NUTS algorithm, averaged over 100 independent runs. The groundtruth median, 0.25 and 0.75 percentiles are plotted as a dashed purple line and its maximum and minimum as a dashed grey line.

We now use the RWM, the nlme-IMH and the MALA in the SAEM algorithm and observe the convergence of the resulting sequences of parameters.

#### 5.1.4 Maximum likelihood estimation

We use the SAEM algorithm to estimate the population PK parameters , and , the standard deviations of the random effects , and and the residual variance .

The stepsize is set to 1 during the first 100 iterations and then decreases as where during the next 100 iterations.

Here we compare the standard SAEM algorithm, as implemented in the saemix R package, with the f-SAEM algorithm and the SAEM using the MALA sampler. In this example, the nlme-IMH and the MALA are only used during the first iterations of the SAEM. The standard MH algorithm is then used.

Figure 6 shows the estimates of and computed at each iteration of these three variants of SAEM and starting from three different initial values. First of all, we notice that, whatever the initialisation and the sampling algorithm used, all the runs converge towards the maximum likelihood estimate. It is then very clear that the f-SAEM converges faster than the standard algorithm. The SAEM using the MALA algorithm for sampling from the individual conditional distribution presents a similar convergence behavior as the reference.

We can conclude, for this example, that sampling around the MAP of each individual conditional distribution is the key to a fast convergence of the SAEM during the first iterations. Figure 6: Estimation of the population PK parameters for the warfarin data: convergence of the sequences of estimates (^Vpop,k,1≤k≤200) and (^ωV,k,1≤k≤200) obtained with SAEM and three different initial values using the reference MH algorithm (blue), the f-SAEM (red) and the SAEM using the MALA sampler (black).

#### 5.1.5 Monte Carlo study

We conduct a Monte Carlo study to confirm the properties of the f-SAEM algorithm for computing the ML estimates.

datasets have been simulated using the PK model previously used for fitting the warfarin PK data with the following parameter values: , , , , , and . The same original design with patients and a total number of 251 PK measurements were used for all the simulated datasets. Since all the simulated data are different, the value of the ML estimator varies from one simulation to another. If we run iterations of SAEM, the last element of the sequence is the estimate obtained from the -th simulated dataset. To investigate how fast converges to we study how fast goes to 0. For a given sequence of estimates, we can then define, at each iteration and for each component of the parameter, the mean square distance over the replicates

 Ek(ℓ)=1MM∑m=1(θ(m)k(ℓ)−θ(m)K(ℓ))2. (30)

Figure 7 shows using the new proposal leads to a much faster convergence towards the maximum likelihood estimate. Less than 10 iterations are required to converge with the f-SAEM on this example, instead of 50 with the original version. It should also be noted that the distance decreases monotonically. The sequence of estimates approaches the target at each iteration, compared to the standard algorithm which makes twists and turns before converging. Figure 7: Convergence of the sequences of mean square distances (Ek(Vpop),1≤k≤200) and (Ek(ωV),1≤k≤200) for Vpop and ωV obtained with SAEM on M=50 synthetic datasets using the reference MH algorithm (blue) and the f-SAEM (red).

### 5.2 Time-to-event Data Model

#### 5.2.1 The model

In this section, we consider a Weibull model for time-to-event data [25, 49]. For individual , the hazard function of this model is:

 h(t,ψi)=βiλi(tλi)βi−1. (31)

Here, the vector of individual parameters is . These two parameters are assumed to be independent and lognormally distributed:

 log(λi)∼N(log(λpop),ω2λ),log(βi)∼N(log(βpop),ω2β). (32)

Then, the vector of population parameters is .

Repeated events were generated, for individuals, using the Weibull model (31) with , , and and assuming a right censoring time .

#### 5.2.2 MCMC Convergence Diagnostic

Similarly to the previous section, we start by looking at the behaviour of the MCMC procedure used for sampling from the conditional distribution for a given individual and assuming that is known. We use the generating model parameter in these experiments ().

We run iterations of the reference MH algorithm the nlme-IMH to estimate the median of the posterior distribution of . We see Figure 8 that the sequences of empirical medians obtained with the two procedures converge to the same value but the new algorithm converges faster than the standard MH algorithm. Autocorrelation plots, Figure 8, are also significantly showing the advantage of the new sampler as the chain obtained with the nlme-IMH is mixing almost ten times faster than the reference sampler. Figure 8: Time-to-event data modelling. Top plot: convergence of the empirical medians of pi(λi|yi;θ) for a single individual. Comparison between the reference MH algorithm (blue) and the nlme-IMH (red). Bottom plot: Autocorrelation plots of the MCMC samplers for parameter λi.

Plots for the other parameter can be found in Appendix E.2. Comparisons with state-of-the-art methods were conducted as in the previous section. These comparisons led us to the same remarks as those made for the previous continuous data model both on the asymptotic and non asymptotic regimes.

#### 5.2.3 Maximum likelihood estimation of the population parameters

We run the standard SAEM algorithm implemented in the saemix package (extension of this package for noncontinuous data models is available on GitHub: https://github.com/belhal/saemix) and the f-SAEM on the generated dataset.

Figure 9 shows the estimates of and computed at each iteration of the two versions of the SAEM and starting from three different initial values. The same behaviour is observed as in the continuous case: regardless the initial values and the algorithm, all the runs converge to the same solution but convergence is much faster with the proposed method. The same comment applies for the two other parameters and . Figure 9: Population parameter estimation in time-to-event-data models: convergence of the sequences of estimates (^λpop,k,1≤k≤200) and (^ωλ,k,1≤k≤200) obtained with SAEM and three different initial values using the reference MH algorithm (blue) and the f-SAEM (red).

#### 5.2.4 Monte Carlo study

We now conduct a Monte Carlo study in order to confirm the good properties of the new version of the SAEM algorithm for estimating the population parameters of a time-to-event data model. synthetic datasets are generated using the same design as above. Figure 10 shows the convergence of the mean square distances defined in (30) for and . All these distances converge monotonically to 0 which means that both algorithms properly converge to the maximum likelihood estimate, but very few iterations are required with the new version to converge while about thirty iterations are needed with the SAEM. Figure 10: Convergence of the sequences of mean square distances (Ek(λpop),1≤k≤200) and (Ek(ωλ),1≤k≤200) for λpop and ωλ obtained with SAEM from M=50 synthetic datasets using the reference MH algorithm (blue) and the f-SAEM (red).

## 6 Conclusion and discussion

We present in this article an independent Metropolis-Hastings procedure for sampling random effects from their conditional distributions and a fast MLE algorithm, called the f-SAEM, in nonlinear mixed effects models.

The idea of the method is to approximate each individual conditional distribution by a multivariate normal distribution. A Laplace approximation makes it possible to consider any type of data, but we have shown that, in the case of continuous data, this approximation is equivalent to linearizing the structural model around the conditional mode of the random effects.

The numerical experiments demonstrate that the proposed nlme-IMH sampler converges faster to the target distribution than a standard random walk Metropolis. This practical behaviour is partly explained by the fact that the conditional mode of the random effects in the linearized model coincides with the conditional mode of the random effects in the original model. The proposal distribution is therefore a normal distribution centered around this MAP. On the other hand, the dependency structure in the conditional distribution of the random effects is well approximated by the covariance structure of the Gaussian proposal. So far, we have mainly applied our method to standard problems encountered in pharmacometrics and for which the number of random effects remains small. It can nevertheless be interesting to see how this method behaves in higher dimension and compare it with methods adapted to such situations such as MALA or HMC. Lastly, we have shown that this new IMH algorithm can easily be embedded in the SAEM algorithm for maximum likelihood estimation of the population parameters. Our numerical studies have shown empirically that the new transition kernel is effective in the very first iterations. It is of great interest to determine automatically and in an adaptive way an optimal scheme of kernel transitions combining this new proposal with the block-wise random walk Metropolis.

## References

• Agresti  Alan Agresti. Categorical data analysis. A Wiley-Interscience publication. Wiley, New York, 1990.
• Allassonniere and Kuhn  Stéphanie Allassonniere and Estelle Kuhn. Convergent Stochastic Expectation Maximization algorithm with efficient sampling in high dimension. Application to deformable template model estimation. arXiv preprint arXiv:1207.5938, 2013.
• Andersen  P. K. Andersen. Survival Analysis. Wiley Reference Series in Biostatistics, 2006.
• Andrieu and Thoms  Christophe Andrieu and Johannes Thoms. A tutorial on adaptive mcmc. Statistics and computing, 18(4):343–373, 2008.
• Andrieu et al.  Christophe Andrieu, Éric Moulines, et al. On the ergodicity properties of some adaptive mcmc algorithms. The Annals of Applied Probability, 16(3):1462–1505, 2006.
• Andrieu et al.  Christophe Andrieu, Gareth O Roberts, et al. The pseudo-marginal approach for efficient monte carlo computations. The Annals of Statistics, 37(2):697–725, 2009.
• Andrieu et al.  Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle markov chain monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
• Atchadé and Rosenthal  Yves F. Atchadé and Jeffrey S. Rosenthal. On adaptive Markov chain Monte Carlo algorithms. Bernoulli, 11(5):815–828, 10 2005. doi: 10.3150/bj/1130077595.
• Beal and Sheiner  Stuart Beal and Lewis Sheiner. The NONMEM system. The American Statistician, 34(2):118–119, 1980.
• Betancourt  Michael Betancourt. A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint arXiv:1701.02434, 2017.
• Brooks et al.  Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of markov chain monte carlo. CRC press, 2011.
• Brosse et al.  Nicolas Brosse, Alain Durmus, Éric Moulines, and Sotirios Sabanis. The tamed unadjusted langevin algorithm. arXiv preprint arXiv:1710.05559, 2017.
• Carpenter et al.  Bob Carpenter, Andrew Gelman, Matthew Hoffman, Daniel Lee, Goodrich Ben, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 2017.
• Chan et al.  P. L. S. Chan, P. Jacqmin, M. Lavielle, L. McFadyen, and B. Weatherley. The use of the SAEM algorithm in MONOLIX software for estimation of population pharmacokinetic-pharmacodynamic-viral dynamics parameters of maraviroc in asymptomatic HIV subjects. Journal of Pharmacokinetics and Pharmacodynamics, 38(1):41–61, 2011.
• Comets et al.  Emmanuelle Comets, Audrey Lavenu, and Marc Lavielle. Parameter estimation in nonlinear mixed effect models using saemix, an r implementation of the saem algorithm. Journal of Statistical Software, 80(3):1–42, 2017.
• de Freitas et al.  Nando de Freitas, Pedro Højen-Sørensen, Michael I Jordan, and Stuart Russell. Variational mcmc.

Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence

, pages 120–127, 2001.
• Delyon et al.  Bernard Delyon, Marc Lavielle, and Eric Moulines. Convergence of a stochastic approximation version of the EM algorithm. Ann. Statist., 27(1):94–128, 03 1999. doi: 10.1214/aos/1018031103.
• Donnet and Samson  Sophie Donnet and Adeline Samson. Using pmcmc in em algorithm for stochastic mixed models: theoretical and practical issues. Journal de la Société Française de Statistique, 155(1):49–72, 2013.
• Doucet et al.  Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics and computing, 10(3):197–208, 2000.
• Griewank and Walther  Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of algorithmic differentiation, volume 105. Siam, 2008.
• Haario et al.  Heikki Haario, Eero Saksman, Johanna Tamminen, et al. An adaptive metropolis algorithm. Bernoulli, 7(2):223–242, 2001.
• Hoffman and Gelman  Matthew D Hoffman and Andrew Gelman. The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo.

Journal of Machine Learning Research

, 15(1):1593–1623, 2014.
• Kucukelbir et al.  Alp Kucukelbir, Rajesh Ranganath, Andrew Gelman, and David Blei. Automatic variational inference in stan. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 568–576. Curran Associates, Inc., 2015.
• Kuhn and Lavielle  Estelle Kuhn and Marc Lavielle. Coupling a stochastic approximation version of EM with an MCMC procedure. ESAIM: Probability and Statistics, 8:115–131, 2004.
• Lavielle  Marc Lavielle. Mixed effects models for the population approach: models, tasks, methods and tools. CRC press, 2014.
• Lavielle and Ribba  Marc Lavielle and Benjamin Ribba. Enhanced method for diagnosing pharmacometric models: random sampling from conditional distributions. Pharmaceutical research, 33(12):2979–2988, 2016.
• Louis  Thomas A. Louis. Finding the observed information matrix when using the em algorithm. Journal of the Royal Statistical Society, Series B: Methodological, 44:226–233, 1982.
• Mbogning et al.  Cyprien Mbogning, Kevin Bleakley, and Marc Lavielle. Joint modeling of longitudinal and repeated time-to-event data using nonlinear mixed-effects models and the SAEM algorithm. Journal of Statistical Computation and Simulation, 85(8):1512–1528, 2015. doi: 10.1080/00949655.2013.878938.
• McLachlan and Krishnan  Geoffrey McLachlan and Thriyambakam Krishnan. The EM algorithm and extensions, volume 382. John Wiley & Sons, 2007.
• Mengersen and Tweedie  K. L. Mengersen and R. L. Tweedie. Rates of convergence of the hastings and metropolis algorithms. Ann. Statist., 24(1):101–121, 02 1996. doi: 10.1214/aos/1033066201.
• Metropolis et al.  Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092, 1953. doi: 10.1063/1.1699114.
• Migon et al.  H.S. Migon, D. Gamerman, and F. Louzada. Statistical Inference: An Integrated Approach, Second Edition. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis, 2014. ISBN 9781439878804.
• Neal et al.  Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2, 2011.
• O’Reilly and Aggeler  Robert A. O’Reilly and Paul M. Aggeler. Studies on coumarin anticoagulant drugs initiation of warfarin therapy without a loading dose. Circulation, 38(1):169–177, 1968.
• Pav  Steven E Pav. Madness: a package for multivariate automatic differentiation. 2016.
• Robert and Casella  Christian P. Robert and George Casella. Metropolis–Hastings Algorithms, pages 167–197. Springer New York, New York, NY, 2010.
• Roberts and Rosenthal  G. O. Roberts and J. S. Rosenthal. Optimal scaling of discrete approximations to langevin diffusions. J. R. Statist. Soc. B, 60:255–268, 1997.
• Roberts et al.  G. O. Roberts, A. Gelman, and W. R. Gilks. Weak convergence and optimal scaling of random walk metropolis algorithms. Ann. Appl. Probab., 7(1):110–120, 02 1997. doi: 10.1214/aoap/1034625254.
• Roberts and Rosenthal  Gareth O Roberts and Jeffrey S Rosenthal. Quantitative non-geometric convergence bounds for independence samplers. Methodology and Computing in Applied Probability, 13(2):391–403, 2011.
• Roberts and Tweedie  Gareth O. Roberts and Richard L. Tweedie. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 12 1996.
• Rue et al.  Håvard Rue, Sara Martino, and Nicolas Chopin.

Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations.

Journal of the royal statistical society: Series b (statistical methodology), 71(2):319–392, 2009.
• Savic et al.  R. M. Savic, F. Mentré, and M. Lavielle. Implementation and evaluation of the SAEM algorithm for longitudinal ordered categorical data with an illustration in pharmacokinetics-pharmacodynamics. The AAPS Journal, 13(1):44–53, 2011.
• Stan Development Team  Stan Development Team. RStan: the R interface to Stan, 2018. URL http://mc-stan.org/. R package version 2.17.3.
• Stramer and Tweedie  O. Stramer and R. L. Tweedie. Langevin-type models i: Diffusions with given stationary distributions and their discretizations*. Methodology And Computing In Applied Probability, 1(3):283–306, Oct 1999. ISSN 1573-7713. doi: 10.1023/A:1010086427957.
• Titsias and Papaspiliopoulos  Michalis K. Titsias and Omiros Papaspiliopoulos. Auxiliary gradient‐based sampling algorithms. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 0(0), 2018. doi: 10.1111/rssb.12269.
• Verbeke  Geert Verbeke. Linear mixed models for longitudinal data. Springer, 1997.
• Vihola  Matti Vihola. Robust adaptive metropolis algorithm with coerced acceptance rate. Statistics and Computing, 22(5):997–1008, 2012.
• Wang  Yaning Wang. Derivation of various nonmem estimation methods. Journal of Pharmacokinetics and pharmacodynamics, 34(5):575–593, 2007.
• Zhang  Z. Zhang. Parametric regression model for survival data: Weibull regression model as an example. Ann Transl Med., 24, 2016.

## Appendix A Extensions of model (2)

Several extensions of model (2) are also possible. We can assume for instance that the transformed individual parameters are normally distributed:

 u(ψi)=u(ψpop)+ηi, (33)

where is a strictly monotonic transformation applied on the individual parameters . Examples of such transformation are the logarithmic function (in which case the components of

are log-normally distributed), the logit and the probit transformations

. In the following, we either use the original parameter or the Gaussian transformed parameter .

Another extension of model (2) consists in introducing individual covariates in order to explain part of the inter-individual variability:

 u(ψi)=u(ψpop)+Ciβ+ηi, (34)

where is a matrix of individual covariates. Here, the fixed effects are the vector of coefficients and the vector of typical parameters .

## Appendix B Calculus of the proposal in the noncontinuous case

Laplace approximation (see ) consists in approximating an integral of the form

 I:=∫ev(x)dx, (35)

where is at least twice differentiable.

The following second order Taylor expansion of the function around a point

 v(x)≈v(x0)+∇v(x0)(x−x0)+12(x−x0)∇2v(x0)(x−x0), (36)

provides an approximation of the integral

(consider a multivariate Gaussian probability distribution function which integral sums to 1):

 I≈ev(x0)√(2π)p|−∇2v(x0)|exp{−12∇v(x0)′∇2v(x0)−1∇v(x0)}. (37)

In our context, we can write the marginal pdf that we aim to approximate as

 pi(yi) =∫pi(yi,ψi)dψi (38) =∫elogpi(yi,ψi)dψi. (39)

Then, let

 v(ψi) :=logpi(yi,ψi) (40) =l(ψi)+logpi(ψi), (41)

and compute its Taylor expansion around the MAP . We have by definition that

 ∇logpi(yi,^ψi)=0,

which leads to the following Laplace approximation of :

 −2logpi(yi)≈−plog2π−2logpi(yi,^ψi)+log(∣∣−∇2logpi(yi,^ψi)∣∣).

We thus obtain the following approximation of the logarithm of the conditional pdf of given evaluated at :

 logpi(^ψi|yi)≈−p2log2π−12log(∣∣−∇2logpi(yi,^ψi)∣∣),

which is precisely the log-pdf of a multivariate Gaussian distribution with mean and variance-covariance with:

 ∇2logpi(yi,