1 Introduction
Bayesian inference offers a principled approach to learn about unknown variables from data using a probabilistic analysis. The conclusions we draw are based on the posterior distribution which, in all but the simplest cases, is intractable. We can however probe the posterior using a host of techniques such as Markov Chains Monte Carlo sampling and approximate Bayesian computation. Writing these algorithms is a tedious and error prone endeavor but fortunately modelers can often rely on existing software with efficient implementations.
In the field of pharmacometrics, statistical software such as NONMEM^{®} [Beal_undatedqu] and Monolix^{®} [monolix2021] support many routines to specify and analyze pharmacokinetic and pharmacodynamic population models. There also exist more general probabilistic programing languages such as BUGS [lunn2009bugs] and more recently Stan [Carpenter:2017]
to only name a few examples. This tutorial focuses on Stan. Stan supports a rich library of probability densities, mathematical functions including matrix operations and numerical solvers for differential equations. These features make for an expressive and flexible language, however writing common pharmacometrics models can be tedious. Torsten extends Stan by providing a suite of functions to facilitate the specification of pharmacometrics models. These functions make it straightforward to model the event schedule of a clinical trial and parallelize computation across patients for population models.
1.1 Why Stan?
We believe that Stan, coupled with Torsten, can be an important addition to the pharmacometrician’s toolkit, especially for Bayesian data analysis.
The most obvious strength of Stan is its flexibility: it is straightforward to specify priors, systems of ODEs, a broad range of measurement models, missing data models and hierarchies (i.e. population models). Because of this flexibility, various data sources and their corresponding measurement models can be combined into one large model, over which full Bayesian inference can be performed [e.g Weber:2018]. There are not many examples in pharmacometrics where the flexibility of Stan would be fully utilized, but we believe this is in part because such tools were not readily available to pharmacometricians in the past. The richness of the Stan language could well open the gate to new types of models.
Secondly, Stan supports state of the art inference algorithms, most notably its adaptive Hamiltonian Monte Carlo sampler, a gradientbased Markov chains Monte Carlo algorithm [Betancourt:2018] based on the No UTurn sampler (NUTS) [Hoffman:2014], automatic differentiation variational inference (ADVI) [Kucukelbir:2017]
, and penalized maximum likelihood estimators. Stan’s inference algorithms are supported by a modern automatic differentiation library that efficiently generates the requisite derivatives
[Carpenter:2015]. It is worth pointing out that algorithms such as NUTS and ADVI were first developed and implemented in Stan, before being widely adopted by the applied statistics and machine learning communities. As of the writing of this article, new inference algorithms continue to be prototyped in Stan. Recent such examples include adjoint–differentiated Laplace approximations
[Margossian:2020], crosschain warmup [Zhang:2020], and path finding for improved chain initialization [Zhang:2021].Thirdly, Stan provides a rich set of diagnostics, including the detection of divergent transitions [Betancourt:2018], and the improved computation of effective sample sizes and scale reduction factors, [Vehtari:2020], as well as detailed warning messages based on these diagnostics. The automated running of these diagnostics makes the platform more userfriendly and provides much guidance when troubleshooting our model and our inference.
Last but not least, both Stan and Torsten are opensource projects, meaning they are free and their source code can be examined and, if needed, scrutinized. The projects are under active development with new features being added regularly.
1.2 Bayesian inference: notation, goals, and comments
Given observed data and latent variables from the parameter space
, a Bayesian model is defined by the joint distribution
. The latent variables can include model parameters, missing data, and more. In this tutorial, we are mostly concerned with estimating model parameters.The joint distribution observes a convenient decomposition,
with the prior distribution and the likelihood. The prior encodes information about the parameters, usually based on scientific expertise or results from previous analysis. The likelihood tells us how the data is distributed for a fixed parameter value and, per one interpretation, can be thought of as a “story of how the data is generated” [Gelman:2013b]. The Bayesian proposition is to base our inference on the posterior distribution, .
For typical pharmacometric applications, the full joint posterior density of the model parameters is an unfathomable object which lives in a high dimensional space. Usually we cannot even numerically evaluate the posterior density at any particular point! Instead we must probe the posterior distribution and learn the characteristics that interest us the most. In our experience, this often includes a measure of a central tendency and a quantification of uncertainty, for example the mean and the variance, or the median and the
and quantiles. For skewed or multimodal distributions, we may want a more refined analysis which looks at many quantiles. What we compute are estimates of these quantities. Most Bayesian inference involves calculations based on marginal posterior distributions. That typically requires integration over a high number of dimensions—an integration that is rarely tractable by analytic or numerical quadrature. One strategy is to generate approximate samples from the posterior distribution and then use sample mean, sample variance, and sample quantiles as our estimators.Bayes’ rule teaches us that
Typically we can evaluate the joint density in the nominator but not the normalizing constant, , in the denominator. A useful method must therefore be able to generate samples using the unnormalized posterior density, . Many Markov chains Monte Carlo (MCMC) algorithms are designed to do exactly this. Starting at an initial point, these chains explore the parameter space , one iteration at a time, to produce the desired samples. Hamiltonian Monte Carlo (HMC) is an MCMC method which uses the gradient to efficiently move across the parameter space [Neal:2012, Betancourt:2018]. Computationally, running HMC requires evaluating and many times across , i.e. for varying values of but fixed values of . For this procedure to be welldefined, must be a continuous variable, else the requisite gradient does not exist. Discrete parameters require a special treatment, which we will not discuss in this tutorial.
A Stan program specifies a method to evaluate . Thanks to automatic differentiation, this implicitly defines a procedure to compute [Griewank:2008, Baydin:2018, Margossian:2019]. Together, these two objects provide all the relevant information about our model to run HMC sampling and other gradientbased inference algorithms.
1.3 Bayesian workflow
Bayesian inference is only one step of a broader modeling process, which we might call the Bayesian workflow [Betancourt:2018, Gabry:2019, Gelman:2020]. Once we fit the model, we need to check the inference and if needed, fine tune our algorithm, or potentially change method. And once we trust the inference, we naturally need to check the fitted model. Our goal is to understand the shortcomings of our model and motivate useful revisions. During the early stages of model development, this mostly comes down to troubleshooting our implementation and later this “criticism” step can lead to deeper insights.
All through the tutorial, we will demonstrate how Stan and Torsten can be used to check our inference and our fitted model.
1.4 Setting up Stan and Torsten
Detailed instructions on installing Stan and Torsten can be found on https://github.com/metrumresearchgroup/Torsten. At its core, Stan is a C++ library but it can be interfaced with one of many scripting languages, including R, Python, and Julia. We will use cmdStanR, which is a lightweight wrapper of Stan in R, and in addition, the packages posterior [Bukner:2020], bayesplot [Gabry:2021], and loo [Gabry:2020]. We generate most of the figures in this paper using BayesPlot, though at times we trade convenience for flexibility and fall back to ggplot2 [Wickham:2009].
The R and Stan code for all the examples are available at https://github.com/metrumresearchgroup/torsten_tutorial_1_supplementary.
1.5 Prerequisites and resources
Our aim is to provide a selfcontained discussion, while trying to remain concise. The objectives of this tutorial are to describe basic model implementation and execution using Stan and Torsten and subsequent analysis of the results to generate MCMC diagnostics and limited model checking. Thus we illustrate key elements of a Bayesian workflow, but do not present complete Bayesian workflows for each example due to space limitations. We assume the reader is familiar with compartment models as they arise in pharmacokinetic and pharmacodynamic models, and has experience with data that describe a clinical event schedule. For the latter, we follow the convention established by NONMEM^{®}/NMTRAN^{®}/PREDPP^{®}
. Exposure to Bayesian statistics and inference algorithms is desirable, in particular an elementary understanding of standard MCMC methods. We expect the reader to know R but we don’t assume any background in Stan.
Helpful reads include the Stan User Manual [Stan:2021] and the Torsten User Manual [Torsten:2021]. Statistical Rethinking by McElreath (2020)[mcelreath2020statistical] provides an excellent tutorial on Bayesian analysis that may be used for selflearning. A comprehensive textbook on Bayesian modeling is Bayesian Data Analysis by Gelman et al (2013) [Gelman:2013b], with more recent insights on the Bayesian workflow provided by Gelman et al (2020) [Gelman:2020]. Betancourt (2018) [Betancourt:2018] offers an accessible discussion on MCMC methods, with an emphasis on HMC.
2 Two compartment model
As a starting example, we demonstrate the analysis of longitudinal plasma drug concentration data from a single individual using a linear 2 compartment model with first order absorption. The individual receives multiple doses at regular time intervals and the plasma drug concentration is recorded over time. Our goal is to estimate the posterior distribution of the parameters of the model describing the time course of the plasma drug concentrations in this individual.
2.1 Pharmacokinetic model and clinical event schedule
Let’s suppose an individual receives a drug treatment of 1200 mg boluses q12h14 doses. Drug concentrations are measured in plasma obtained from blood sampled at 0.083, 0.167, 0.25, 0.5, 0.75, 1, 1.5, 2, 3, 4, 6 and 8 hours following the first, second and final dose. In addition, we take measurements before each drug intake, as well as 12, 18, and 24 hours following the last dose. We analyze that data using a linear 2 compartment model with first order absorption:
(1a)  
(1b)  
(1c) 
with

: drug mass in each compartment (mg),

: absorption rate constant (h),

: elimination clearance from the central compartment (L / h),

: intercompartmental clearance (L/h),

: volume of the central compartment (L),

: volume of the peripheral compartment (L).
Both intervention and measurement events are described by the event schedule, which follows the convention established by NONMEM. This means the user must provide for each event the following variables: cmt, evid, addl, ss, amt, time, rate, and ii. Table 1 provides an overview of the roll of each variable and more details can be found in the Torsten User Manual.
Variable  Description 
cmt  Compartment in which event occurs. 
evid  Type of event: (0) measurement, (1) dosing. 
addl  For dosing events, number of additional doses. 
ss  Steady state indicator: (0) no, (1) yes. 
amt  Amount of drug administered. 
time  Time of the event. 
rate  For dosing by infusion, rate of infusion. 
ii  For events with multiple dosing, interdose interval. 
2.2 Statistical model
Given a treatment, , and the physiological parameters, , we compute the drug plasma concentration, , by solving the relevant ODEs. Our measurements, , are a perturbation of . This is to account for the fact that our model is not perfect and that our measurements have finite precision. We model this residual error using a lognormal distribution, that is
where is a scale parameter we wish to estimate. The deterministic computation of along with the measurement model, defines our likelihood function , where and are input data, i.e. the clinical event schedule. Note that we are not limited to the above simple model. Stan supports many distributions and allows users to write more complex residual models, with, for instance, proportional and additive error variance [Stan_func_ref:2021].
It remains to define a prior distribution, . Our prior should allocate probability mass to every plausible parameter value and exclude patently absurd values. For example the volume of the central compartment is on the order of ten liters, but it cannot be the size of the Sun. In this simulated example, our priors for the individual parameters are based on population estimates from previous (hypothetical) studies.
Suggestions for building priors can be found in references [Gabry:2019, Betancourt:2020] and at https://github.com/standev/stan/wiki/PriorChoiceRecommendations.
2.3 Specifying a model in Stan
We can now specify our statistical model using a Stan file, which is divided into coding blocks, each with a specific role. From R, we then run inference algorithms which take this Stan file as an input.
2.3.1 Data and parameters block
To define a model, we need a procedure which returns the log joint distribution, . Our first task is to declare the data, , and the parameters, , using the coding blocks data and parameters. It is important to distinguish the two. The data is fixed. By contrast, the parameter values change as HMC explores the parameter space, and gradients of the joint density are computed with respect to , but not .
For each variable we introduce, we must declare a type and, for containers such as arrays, vectors, and matrices, the size of the container
[Stan_users_guide:2021, Ch.5]. In addition, each statement ends with a semicolon. It is possible to specify constraints on the parameters, using the keywords lower and upper. If one of these constraints is violated, Stan returns an error message. More importantly, constrained parameters are transformed into unconstrained parameters – for instance, positive variables are put on the log scale – which greatly improves computation.2.3.2 model block
Next, the model block allows us to modify the variable target, which Stan recognizes as the log joint distribution. The following statement increments target using the prior on , which is a normal density, truncated at 0 to only put mass on positive values.
The truncation is implied by the fact is declared as lowerbounded by 0 in the parameters block. An alternative syntax is the following:
This statement now looks like our statistical formulation and makes the code more readable. But we should be mindful that this is not a sampling statement, rather instructions on how to increment target. We now give the full model block:
The likelihood statement involves a crucial term we have not defined yet: concentrationHat. Additional variables can be created using the transformed data and transformed parameters blocks. We will take advantage of these to compute the drug concentration in the central compartment for each event. Note that for the likelihood, we only use the concentration during observation events, hence the indexing [iObs].
2.3.3 Transformed data and transformed parameters block
In transformed data, we can construct variables which only depend on the data. For this model, we simply specify the number of compartments in our model (including the gut), nCmt, and the numbers of pharmacokinetic parameters, nTheta, two variables which will come in handy shortly.
Because the data is fixed, this operation is only computed once. By contrast, operations in the transformed parameters block need to be performed (and differentiated) for each new parameter values.
To compute concentrationHat we need to solve the relevant ODE within the clinical event schedule. Torsten provides a function which returns the drug mass in each compartment at each time point of the event schedule.
The first eight arguments define the event schedule and the last argument, theta, is an array containing the pharmacokinetic parameters, and defined as follows:
It is also possible to have theta change between events, and specify lag times and bioavailability fractions, although we will not take advantage of these features in the example at hand.
The Torsten function we have chosen to use solves the ODEs analytically. Other routines use a matrix exponential, a numerical solver, or a combination of analytical and numerical methods [Margossian:2017]. It now remains to compute the concentration in the central compartment at the relevant times. The full transformed parameters block is as follows:
The Stan file contains all the coding blocks in the following order: data, transformed data, parameters, transformed parameters, model. The full Stan code can be found in the Supplementary Material.
2.4 Calling Stan from R
The R package cmdstanr allows us to run a number of algorithms on a model defined in a Stan file. An excellent place to get started with the package is https://mcstan.org/cmdstanr/articles/cmdstanr.html.
The first step is to “transpile” the file – call it twocpt.stan –, that is translate the file into C++ and then compile it.
We can then run Stan’s HMC sampler by passing in the requisite data and providing other tuning parameters. Here: (i) the number of Markov chains (which we run in parallel), (ii) the initial value for each chain, (iii) the number of warmup iterations, and (iv) the number of sampling iterations.
There are several other arguments we can pass to the sampler and which we will take advantage of throughout the tutorial. For applications in pharmacometrics, we recommend specifying the initial starting points via the init argument, as the defaults may not be appropriate. In this tutorial, we draw the initial points from their priors by defining an appropriate R function.
The resulting fit object stores the samples generated by HMC from which can deduce the sample mean, sample variance, and sample quantiles of our posterior distribution. This information is readily accessible using fit$summary() and summarized in table 2. We could also extract the samples and perform any number of operations on them.
mean  median  sd  mad  q5  q95  ESS  ESS  
10.0  10.0  0.378  0.367  9.39  10.6  1.00  1580  1348  
19.8  19.5  4.00  4.01  13.8  26.8  1.00  985  1235  
41.2  40.8  9.71  9.96  25.6  57.7  1.00  732  1120  
124  123  18.0  18.0  97.1  155  1.00  1877  1279  
1.73  1.67  0.523  0.522  1.01  2.68  1.00  762  1108  
0.224  0.222  0.0244  0.0232  0.187  0.269  1.01  1549  1083 
The first columns return sample estimates of the posterior mean, median, standard deviation, median absolute deviation,
and quantiles, based on our approximate samples. The next three columns return the statistics and the effective sample size for bulk and tail estimates, and can be used to identify problems with our inference.2.5 Checking our inference
Unfortunately there is no guarantee that a particular algorithm will work across all the applications we will encounter. We can however make sure that certain necessary conditions are met. Much of the MCMC literature focuses on estimating expectation values, and we will use these results to develop some intuition.
2.5.1 Central limit theorem
Many common MCMC concepts, such as effective sample size, are amiable to intuitive interpretations. But to really grasp their meaning and take advantage of them, we must examine them in the context of central limit theorems.
For any function of our latent parameters , we define the posterior mean to be
a quantity also termed the expectation value. If we were able to generate exact independent samples,
one sensible estimator would be the sample mean,
Then, provided the variance of is finite, the central limit theorem
teaches us that the sample mean converges, as the sample size increases, to a normal distribution. We will not dwell on the technical details of the theorem and go straight to its practical implication, namely:
where the deviation from the approximating normal has order . This means that even for a moderate sample size, the approximation tends to be very good. This is a powerful result for two reasons: first it tells us that our estimator is unbiased and more importantly that the expected squared error is driven by the variance of divided by our sample size, .
Unfortunately, estimates constructed with MCMC samples will, in general, neither be unbiased, nor will their variance decrease at rate . For our estimators to be useful, we must therefore check that our samples are unbiased and then use a corrected central limit theorem.
2.5.2 Checking for bias with
MCMC samples are biased for several reasons. Perhaps the most obvious one is that Markov chains generate correlated samples, meaning any sample has some correlation with the initial point. If we run the algorithm for enough iterations, the correlation to the initial point becomes negligible and the chain “forgets” its starting point. But what constitutes enough iterations? It isn’t hard to construct examples where removing the initial bias in any reasonable time is a hopeless endeavor.
To identify this bias, we run multiple Markov chains, each started at different points, and check that they all converge to the same region of the parameter space. More precisely, we shouldn’t be able to distinguish the Markov chains based on the samples alone. One way to check this is to compute the statistics, for which we provide an intuitive definition:
If the chains are mixing properly, then converges to 1.0, as increases. Moreover, we want , as is the case in table 2. Stan uses an improved statistics described in a recent paper by Vehtari et al (2020) [Vehtari:2020]. Further discussion on how to interpret and what it measures can be found in Margossian et al (2021) [Margossian:2021, Section 2]. In addition to , we can visually check that the chains are properly mixing using a trace plot (Figure 1).
If and, more generally, if the chains were not mixing, this would be cause for concern and an invitation to adjust our inference method. Even when , we should entertain the possibility that all the chains suffer from the same bias. Stan offers additional diagnostics to identify sampling biases, notably by reporting divergent transitions of the HMC sampler, a topic we will discuss when we fit more sophisticated models.
2.5.3 Controlling the variance of our estimator
Let’s assume that our samples are indeed unbiased. The expected error of our estimator is now determined by the variance. Under certain regularity conditions, our estimator follows an MCMC central limit theorem,
where the key difference with the standard central limit theorem is that is now divided by the effective sample size, , rather than the sample size. This change accounts for the fact our samples are not independent: their correlation induces a loss in information, which increases the variance of our estimator. For , we have 2,000 samples, but the effective sample size is 1,580 (Table 2). If is low, our estimator may not be precise enough and we should generate more samples.
The effective sample size is only formally defined in the context of estimators for expectation values. We may also be interested in tail quantities, such as extreme quantiles, which are much more difficult to estimate and require many more samples to achieve a desired precision. Vehtari et al (2020) [Vehtari:2020] propose a generalization of the effective sample size for such quantities, and introduce the tail effective sample size. This is to be distinguished from the traditional effective sample size, henceforth the bulk effective sample size. Both quantities are reported by Stan.
2.6 Checking the model: posterior predictive checks
Once we develop enough confidence in our inference, we still want to check our fitted model. There are many ways of doing this. We may look at the posterior distribution of an interpretable parameter and see if it suggests implausible values. Or we may evaluate the model’s ability to perform a certain task, e.g. classification or prediction, as is often done in machine learning. In practice, we find it useful to do posterior predictive checks (PPC), that is simulate data from the fitted model and compare the simulation to the observed data [Gelman:2013, chapter 6]. Mechanically, the procedure is straightforward:

Draw the parameters from their posterior,

Draw new observations from the likelihood, conditional on the drawn parameters, .
This amounts to drawing observations from their posterior distribution, that is . Both the uncertainty due to our estimation and the uncertainty due to our measurement model propagate to our predictions.
Stan provides a generated quantities block, which allows us to compute values, based on sampled parameters. In our two compartment model example, the following code draws new observations from the likelihood:
We generate predictions at the observed points for each sampled point, . This gives us a sample of predictions and we can use the and
quantiles to construct a credible interval. We may then plot the observations and the credible intervals (Figure
2) and see that, indeed, the data generated by the model is consistent with the observations.2.7 Comparing models: leaveoneout cross validation
Beyond model criticism, we may be interested in model comparison. Continuing our running example, we compare our two compartment model to a one compartment model, which is also supported by Torsten via the pmx_solve_onecpt routine. The corresponding posterior predictive checks are shown in Figure 3.
There are several ways of comparing models and which method is appropriate crucially depends on the insights we wish to gain. If our goal is to asses a model’s ability to make good outofsample predictions, we may consider Bayesian leaveoneout (LOO) cross validation. The premise of crossvalidation is to exclude a point, , from the training set, i.e. the set of data to which we fit the model. Here denotes the covariate and in our example, the relevant row in the event schedule. We denote the reduced data set, . We then generate a prediction using the fitted model, and compare to . A classic metric to make this comparison is the squared error, .
Another approach is to use the LOO estimate of outofsample predictive fit:
Here, no prediction is made. We however examine how consistent an “unobserved” data point is with our fitted model. Computing this estimator is expensive, since it requires fitting the model to different training sets in order to evaluate each term in the sum.
Vehtari et al (2016) [Vehtari:2016] propose an estimator of , which uses Pareto smooth importance sampling and only requires a single model fit. The premise is to compute
and correct this value, using importance sampling, to estimate . Naturally this estimator may be inaccurate. What makes this tool so useful is that we can use the Pareto shape parameter, , to asses how reliable the estimate is. In particular, if , then the estimate shouldn’t be trusted. The estimator is implemented in the R package loo. See [Gabry:2020] for more details including its connection and comparison to the widely applicable information criterion (WAIC).
Conveniently, we can compute in Stan’s generated quantities block.
These results can then be extracted and fed into Loo to compute . The file twoCpt.r in the Supplementary Material shows exactly how to do this. Figure 4 plots the estimated , along with a standard deviation, and shows the two compartment model has better outofsample predictive capabilities.
3 Two compartment population model
We now consider the scenario where we have data from multiple patients and fit a population model. Population models are a powerful tool to capture the heterogeneity between patients, while also recognizing similarities. Building the right prior allows us to pool information between patients, the idea being that what we learn from one patient teaches us something – though not everything – about the other patients. In practice, such models can frustrate inference algorithms and need to be implemented with care [Betancourt:2013]. We start with an example where the interaction between the model and our MCMC sampler is well behaved. In Part II of this tutorial, we will examine a more difficult case, for which we will leverage Stan’s diagnostic capabilities in order to run reliable inference.
3.1 Statistical model
Let be the 2D array of body weightnormalized pharmacokinetic parameters for each patient, with
the parameters for the patient. We construct a population model by introducing random variation to describe otherwise unexplained interindividual variability. In a Bayesian context this is sometimes referred to as a prior distribution for the individual parameters,
As before we work on the log scale to account for the fact the pharmacokinetic parameters are constrained to be positive. is the population mean (on the logarithmic scale) and the population covariance matrix. Both and are estimated. In this example, we start with the simple case where is diagonal. For our example we will also use conventional allometric scaling to adjust the clearance and volume parameters for body weight.
The likelihood remains mostly unchanged, with the caveat that it must now be computed for each patient. Putting this all together, we have the following model, as specified by the joint distribution,
3.2 Specifying the model in Stan
We begin by adjusting our parameters block:
The declaration for illustrates that constraints may be expressions including other variables in the model. In this case is constrained to avoid identifiability problems due to “flipflop”.
The variable, is introduced in transformed parameters, mostly for convenience purposes:
The model block reflects our statistical formulation:
In the transformed parameters block we also declare and calculate the individual parameters given and any relevant covariates—body weight in this case.
It remains to compute concentrationObs. There are several ways to do this and, depending on the computational resources available, we may either compute the concentration for each patients sequentially or in parallel. For now, we do the simpler sequential approach. In the upcoming Part II of this tutorial, we examine how Torsten offers easytouse parallelization for population models.
Sequentially computing the concentration is a simple matter of bookkeeping. In transformed parameters we loop through the patients using a for loop. The code is identical to what we used in Section 2.3.3, with the caveat that the arguments to pmx_solve_twocpt are now indexed to indicate for which patient we compute the drug mass. For example, assuming the time schedule is ordered by patient, the event times corresponding to the patient are given by
where start[j] and end[j] contain the indices of the first and last event for the patient, and the syntax for indexing is as in R. The full for loop is then
Note that the last array argument in pmx_solve_twocpt is generated using {} syntax.
Once we have written our Stan model, we can apply the same methods for inference and diagnostics as we did in the previous section.
3.3 Posterior predictive checks
We follow the exact same procedure as in Section 2.6 – using even the same line of code – to simulate new observations for the same patients we analyzed. Figure 5 plots posterior predictions for each individual patient. In addition, we simulate new observations for hypothetical new patients by: (i) drawing pharmacokinetic parameters from our population distribution, (ii) solving the ODEs with these simulated parameters and (iii) using our measurement model to simulate new observations. Those predictions are also shown in Figure 5 for each individual. Figure 6 depicts a composite posterior predictive check for all individuals. The generated quantities block then looks as follows:
It is worth noting that the computational cost of running operations in the generated quantities is relatively small. While these operations are executed once per iteration, in order to generate posterior samples of the generated quantities, operations in the transformed parameters and model blocks are run and differentiated multiple times per iterations, meaning they amply dominate the computation. Hence the cost of doing posterior predictive checks, even when it involves solving ODEs, is marginal. The computational scaling of Stan, notably for ODEbased models, is discussed in the article by Grinsztajn et al (2021) [Grinsztajn:2021].
4 Nonlinear pharmacokinetic / pharmacodynamic model
Now let us consider a PKPD model described in terms of a nonlinear ODE system that requires the use of a numerical solver. The patient receives multiple doses at regular time intervals and the drug plasma concentration is recorded over time.
4.1 Nonlinear pharmacokinetic / pharmacodynamic model
In this the last example, we go back to the single patient twocompartment model and append it with a PD model. Specifically, we examine the FribergKarlsson semimechanistic model for druginduced myelosuppression [3181, 2364, 2518, 3187, 3188, 3537] with the goal to model the relation between neutrophil counts and drug exposure. The model describes a delayed feedback mechanism that keeps the absolute neutrophil count (ANC) at the baseline () in a circulatory compartment (), as well as the drug effect that perturbs this mechanism. The delay between proliferative cells () and is modeled by three transit compartments with mean transit time
(2) 
where is the transit rate constant. Figure 7 summarizes the model (see also [3181, Figure 2]).
The PD likelihood is
ANC  
where is the drug concentration calculated from the PK model, and solves the nonlinear ODE:
(3a)  
(3b)  
(3c)  
(3d)  
(3e) 
We use
to model the linear effect of the drug once it has been absorbed in the central compartment. This effect reduces the proliferation rate and induces a reduction in neutrophil count. The upper bound of 1 on excludes the scenario where the feedback loop becomes negative for large . While we expect that for any reasonable parameter values, , we should anticipate the possibility that our Markov chains may encounter less wellbehaved values as it explores the parameter space. Encoding such constraints can lead to improved numerical stability when solving the ODE.
We obtain the complete ODE system for the PKPD model by coupling equation (1) and (3). Because the equation is nonlinear, we can no longer resort to analytical solutions as we have done in the previous sections.
4.1.1 Numerically solving ODEs
To solve an ODE numerically in Stan we first need to define a function that returns the righthandside of the ODE, i.e. the derivative of the solution, in the functions block. The functions block allows users to define functions and is written at the top of the Stan file before the data block.
The above function is an almost direct translation of Eq. (1) and (3). The first three components of dydt describes the PK. The next five components of dydt describe the PD minus the baseline . Writing the ODE as a difference from the baseline means the initial PD conditions is 0, as opposed to a parameter dependent value. This results in better computation, because derivatives of the ODE solution with respect to the initial conditions no longer need to be computed; for more details, see [Grinsztajn:2021, Section 5.2]. In addition, we encode a constraint on the circulatory compartment
where is the machine precision and can be interpreted as the smallest nonzero number the computer can handle. This is to improve numerical stability, especially during the early stages of MCMC exploration when we may need to handle somewhat implausible parameter values.
Stan and Torsten provide several numerical solvers. In this example we use the RungeKutta solver pmx_solve_rk45 [Torsten:2021, Section 3.4]. The signature of pmx_solve_rk45 is a bit more sophisticated than that of pmx_solve_twocpt and requires the following arguments:

the name of the userdefined ODE function (twoCptNeutModelODE)

the number of states/compartments in the ODE

the event schedule

the bioavailibility fraction, , and the dosing lag time, for each compartment (optional)

the tuning parameters for the ODE solver (optional)
Because arguments are nameless in Stan, we can only pass the ODE tuning parameters if we also pass and . By setting to 1 and to 0 for each compartment, we essentially ignore their effect. This is best done in the transformed data block:
Numerical solvers in Stan and Torsten admit three tuning parameters:

rtol: relative tolerance to determine solution convergence,

atol: absolute tolerance to determine solution convergence,

max_num_step: maximum number of steps allowed.
Though Stan and Torsten provide default values, we highly recommend that the user define the ODE solver control parameters in the data block:
Users should make problemdependent decisions on rtol and atol, according to the expected scale of the unknowns, so that the error does not affect our inference. For example, when an unknown can be neglected below a certain threshold without affecting the rest of the dynamic system, setting atol greater than that threshold avoids spurious and errorprone computation. For more details, see [Stan_users_guide:2021, Chapter 13] and [Torsten:2021, Section 3.7.5] and references therein.
As before, we solve the ODE within the event schedule in the transformed parameters block:
4.1.2 Solving PKPD ODEs as a coupled system
The approach in the last section applies to all models that involve ODE solutions, but we will not use it here. An acute observer may have noticed the PKPD model here exhibits a particular oneway coupling structure. That is, the PK (Eq. 1) and PD (Eq. 3) are coupled through the proliferation cell count and , such that the PK can be solved independently from the PD. This is what motivates Torsten’s coupled solvers which analytically solves the PK ODEs before passing the PK solution to the PD ODE. The PD ODE is then solved numerically. Since the dimension of the numerical ODE solution is reduced, in general this coupled strategy is more efficient than numerically solving a full ODE system. To see it in action, let us apply the coupled solver pmx_solve_twocpt_rk45 [Torsten:2021, Section 3.5] to the same model. We need only make two changes. First, we modify the ODE function to reflect that only the PD states are to be solved.
Note that we pass in PD and PK states as separate arguments, and respectively. The above function only returns , while is solved internally using an analytical solution, meaning users do not need to explicitly call pmx_solve_twocpt.
Then we replace pmx_solve_rk45 with pmx_solve_twocpt_rk45 call.
4.2 Building the remaining coding blocks
We omit the data block, but note that it is similar to the one we constructed in previous sections. The key difference is we now include measurements for the absolute neutrophil count. The parameters block now contains the PD variables:
The model block is similar to that in Section 2.1:
4.2.1 Posterior predictive checks
We hope by now reader has developed the habit of performing PPC on every model. Since we have both PK (drug concentration) and PD (neutrophil count) observations, the PPC should be conducted on both.
It is possible to only run the generated quantities block based on a fitted model using cmdstanr’s generate_quantities routine. This is useful when we change the generated quantities, but not the rest of a model we have already fitted. The compiled model and the fit are respectively stored in the mod and fit objects in R. We then run:
5 Discussion
Stan provides an expressive language to build models, stateoftheart algorithms to fit these models, and a host of easytouse diagnostics. Torsten complements Stan with a suite of routines which solve ODEs within the context of a clinical event schedules. Together, Stan and Torsten are potent tools when working through the tangled steps of a Bayesian workflow for PKPD modeling.
5.1 Current and potential role for Stan and Torsten for pharmacometrics applications
We can apply Stan/Torsten to a large palette of generative models, both for inference and simulation. Applications range from simple linear regression to complex multiscale Quantitative Systems Pharmacology models. Compared to specialized pharmacometrics tools such as NONMEM
^{®}, Stan/Torsten is particularly well–suited for cases where more flexibility is desired. This includes models with
random effects distributions other than normal,

prior distributions other than the limited set available in existing pharmacometrics tools,

multiple submodels with different random effect structures.
It’s important to recognize that MCMC, including the HMC scheme used by Stan/Torsten, can be computationally intensive, notably when fitting hierarchical models which require us to numerically solve ODEs. This can be especially frustrating during the initial model exploration stage of a project. For such exploratory analyses access to a rapid approximate Bayesian inference engine may be desirable. Stan/Torsten includes two optimizationbased inference engines, one for estimation of posterior modes and one for variational inference. These algorithms attempt to simultaneously optimize over the entire joint posterior distribution of all model parameters. This process can be relatively slow and error prone when trying to optimize over the large number of population and individual–level parameters of a typical population pharmacometrics model. This contrasts with typical mixed effects modeling programs that use algorithms specialized for a more limited range of models—usually employing an alternating sequence of lower dimensional optimization problems.
For applications that may be implemented with typical pharmacometrics tools, the choice between those and Stan/Torsten comes down to the tradeoffs between flexibility, doing accurate Bayesian inference and computation time.
We would also like to point out that Stan is not the only probabilistic programing language that is actively under development. PyMC3 [salvatier_probabilistic_2016]
, TensorFlow Probability
[Dillon:2017, Lao:2020], and Turing [ge_turing_2018], among others, provide similar modeling capabilities. A full review and comparison of these languages is however beyond the scope of this paper.5.2 Preview of part 2
In part 2 of this tutorial, we plan to build on the material we have covered thus far and tackle more advanced topics, including:

Improving the performance of HMC, using withinchain parallelization for population models and Torsten’s dedicated group solvers.

Advanced diagnostic tools, namely divergent transitions which can flag bias in our posterior samples. Stan makes these diagnostics readily available.

Fake data simulation and analysis, in particular prior predictive checks as a way to understand and build priors, fitting the model to fake data as an imperfect tool to troubleshoot Bayesian inference, and an overview of the more sophisticated but computationally demanding simulationbased calibration [Talts:2020].

Performance tuning of ODE models, such as solver selection, accurarcy control, as well as stability issues.
We will dive into these subjects by examining more advanced models and utilizing techniques such as reparameterization, withinchain parallelization, and pooling multiple data sources. We will also discuss ongoing developments with Stan and Torsten, such as tools to handle larger scale ODEs and plans to leverage parallelization.
Comments
There are no comments yet.