Flexible and efficient Bayesian pharmacometrics modeling using Stan and Torsten, Part I

09/21/2021
by   Charles C. Margossian, et al.
0

Stan is an open-source probabilistic programing language, primarily designed to do Bayesian data analysis. Its main inference algorithm is an adaptive Hamiltonian Monte Carlo sampler, supported by state of the art gradient computation. Stan's strengths include efficient computation, an expressive language which offers a great deal of flexibility, and numerous diagnostics that allow modelers to check whether the inference is reliable. Torsten extends Stan with a suite of functions that facilitate the specification of pharmacokinetic and pharmacodynamic models, and makes it straightforward to specify a clinical event schedule. Part I of this tutorial demonstrates how to build, fit, and criticize standard pharmacokinetic and pharmacodynamic models using Stan and Torsten.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 21

05/23/2020

Bayesian workflow for disease transmission modeling in Stan

This tutorial shows how to build, fit, and criticize disease transmissio...
06/18/2015

Hamiltonian Monte Carlo Acceleration Using Surrogate Functions with Random Bases

For big data analysis, high computational cost for Bayesian methods ofte...
08/06/2019

Functional probabilistic programming for scalable Bayesian modelling

Bayesian inference involves the specification of a statistical model by ...
05/28/2019

Selecting the Metric in Hamiltonian Monte Carlo

We present a selection criterion for the Euclidean metric adapted during...
09/14/2019

A Bayesian Approach for De-duplication in the Presence of Relational Data

In this paper we study the impact of combining profile and network data ...
12/12/2012

Accelerating Inference: towards a full Language, Compiler and Hardware stack

We introduce Dimple, a fully open-source API for probabilistic modeling....
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian inference offers a principled approach to learn about unknown variables from data using a probabilistic analysis. The conclusions we draw are based on the posterior distribution which, in all but the simplest cases, is intractable. We can however probe the posterior using a host of techniques such as Markov Chains Monte Carlo sampling and approximate Bayesian computation. Writing these algorithms is a tedious and error prone endeavor but fortunately modelers can often rely on existing software with efficient implementations.

In the field of pharmacometrics, statistical software such as NONMEM® [Beal_undated-qu] and Monolix® [monolix2021] support many routines to specify and analyze pharmacokinetic and pharmacodynamic population models. There also exist more general probabilistic programing languages such as BUGS [lunn2009bugs] and more recently Stan [Carpenter:2017]

to only name a few examples. This tutorial focuses on Stan. Stan supports a rich library of probability densities, mathematical functions including matrix operations and numerical solvers for differential equations. These features make for an expressive and flexible language, however writing common pharmacometrics models can be tedious. Torsten extends Stan by providing a suite of functions to facilitate the specification of pharmacometrics models. These functions make it straightforward to model the event schedule of a clinical trial and parallelize computation across patients for population models.

1.1 Why Stan?

We believe that Stan, coupled with Torsten, can be an important addition to the pharmacometrician’s toolkit, especially for Bayesian data analysis.

The most obvious strength of Stan is its flexibility: it is straightforward to specify priors, systems of ODEs, a broad range of measurement models, missing data models and hierarchies (i.e. population models). Because of this flexibility, various data sources and their corresponding measurement models can be combined into one large model, over which full Bayesian inference can be performed [e.g Weber:2018]. There are not many examples in pharmacometrics where the flexibility of Stan would be fully utilized, but we believe this is in part because such tools were not readily available to pharmacometricians in the past. The richness of the Stan language could well open the gate to new types of models.

Secondly, Stan supports state of the art inference algorithms, most notably its adaptive Hamiltonian Monte Carlo sampler, a gradient-based Markov chains Monte Carlo algorithm [Betancourt:2018] based on the No U-Turn sampler (NUTS) [Hoffman:2014], automatic differentiation variational inference (ADVI) [Kucukelbir:2017]

, and penalized maximum likelihood estimators. Stan’s inference algorithms are supported by a modern automatic differentiation library that efficiently generates the requisite derivatives

[Carpenter:2015]

. It is worth pointing out that algorithms such as NUTS and ADVI were first developed and implemented in Stan, before being widely adopted by the applied statistics and machine learning communities. As of the writing of this article, new inference algorithms continue to be prototyped in Stan. Recent such examples include adjoint–differentiated Laplace approximations

[Margossian:2020], cross-chain warmup [Zhang:2020], and path finding for improved chain initialization [Zhang:2021].

Thirdly, Stan provides a rich set of diagnostics, including the detection of divergent transitions [Betancourt:2018], and the improved computation of effective sample sizes and scale reduction factors, [Vehtari:2020], as well as detailed warning messages based on these diagnostics. The automated running of these diagnostics makes the platform more user-friendly and provides much guidance when troubleshooting our model and our inference.

Last but not least, both Stan and Torsten are open-source projects, meaning they are free and their source code can be examined and, if needed, scrutinized. The projects are under active development with new features being added regularly.

1.2 Bayesian inference: notation, goals, and comments

Given observed data and latent variables from the parameter space

, a Bayesian model is defined by the joint distribution

. The latent variables can include model parameters, missing data, and more. In this tutorial, we are mostly concerned with estimating model parameters.

The joint distribution observes a convenient decomposition,

with the prior distribution and the likelihood. The prior encodes information about the parameters, usually based on scientific expertise or results from previous analysis. The likelihood tells us how the data is distributed for a fixed parameter value and, per one interpretation, can be thought of as a “story of how the data is generated” [Gelman:2013b]. The Bayesian proposition is to base our inference on the posterior distribution, .

For typical pharmacometric applications, the full joint posterior density of the model parameters is an unfathomable object which lives in a high dimensional space. Usually we cannot even numerically evaluate the posterior density at any particular point! Instead we must probe the posterior distribution and learn the characteristics that interest us the most. In our experience, this often includes a measure of a central tendency and a quantification of uncertainty, for example the mean and the variance, or the median and the

and quantiles. For skewed or multimodal distributions, we may want a more refined analysis which looks at many quantiles. What we compute are estimates of these quantities. Most Bayesian inference involves calculations based on marginal posterior distributions. That typically requires integration over a high number of dimensions—an integration that is rarely tractable by analytic or numerical quadrature. One strategy is to generate approximate samples from the posterior distribution and then use sample mean, sample variance, and sample quantiles as our estimators.

Bayes’ rule teaches us that

Typically we can evaluate the joint density in the nominator but not the normalizing constant, , in the denominator. A useful method must therefore be able to generate samples using the unnormalized posterior density, . Many Markov chains Monte Carlo (MCMC) algorithms are designed to do exactly this. Starting at an initial point, these chains explore the parameter space , one iteration at a time, to produce the desired samples. Hamiltonian Monte Carlo (HMC) is an MCMC method which uses the gradient to efficiently move across the parameter space [Neal:2012, Betancourt:2018]. Computationally, running HMC requires evaluating and many times across , i.e. for varying values of but fixed values of . For this procedure to be well-defined, must be a continuous variable, else the requisite gradient does not exist. Discrete parameters require a special treatment, which we will not discuss in this tutorial.

A Stan program specifies a method to evaluate . Thanks to automatic differentiation, this implicitly defines a procedure to compute [Griewank:2008, Baydin:2018, Margossian:2019]. Together, these two objects provide all the relevant information about our model to run HMC sampling and other gradient-based inference algorithms.

1.3 Bayesian workflow

Bayesian inference is only one step of a broader modeling process, which we might call the Bayesian workflow [Betancourt:2018, Gabry:2019, Gelman:2020]. Once we fit the model, we need to check the inference and if needed, fine tune our algorithm, or potentially change method. And once we trust the inference, we naturally need to check the fitted model. Our goal is to understand the shortcomings of our model and motivate useful revisions. During the early stages of model development, this mostly comes down to troubleshooting our implementation and later this “criticism” step can lead to deeper insights.

All through the tutorial, we will demonstrate how Stan and Torsten can be used to check our inference and our fitted model.

1.4 Setting up Stan and Torsten

Detailed instructions on installing Stan and Torsten can be found on https://github.com/metrumresearchgroup/Torsten. At its core, Stan is a C++ library but it can be interfaced with one of many scripting languages, including R, Python, and Julia. We will use cmdStanR, which is a lightweight wrapper of Stan in R, and in addition, the packages posterior [Bukner:2020], bayesplot [Gabry:2021], and loo [Gabry:2020]. We generate most of the figures in this paper using BayesPlot, though at times we trade convenience for flexibility and fall back to ggplot2 [Wickham:2009].

The R and Stan code for all the examples are available at https://github.com/metrumresearchgroup/torsten_tutorial_1_supplementary.

1.5 Prerequisites and resources

Our aim is to provide a self-contained discussion, while trying to remain concise. The objectives of this tutorial are to describe basic model implementation and execution using Stan and Torsten and subsequent analysis of the results to generate MCMC diagnostics and limited model checking. Thus we illustrate key elements of a Bayesian workflow, but do not present complete Bayesian workflows for each example due to space limitations. We assume the reader is familiar with compartment models as they arise in pharmacokinetic and pharmacodynamic models, and has experience with data that describe a clinical event schedule. For the latter, we follow the convention established by NONMEM®/NMTRAN®/PREDPP®

. Exposure to Bayesian statistics and inference algorithms is desirable, in particular an elementary understanding of standard MCMC methods. We expect the reader to know R but we don’t assume any background in Stan.

Helpful reads include the Stan User Manual [Stan:2021] and the Torsten User Manual [Torsten:2021]. Statistical Rethinking by McElreath (2020)[mcelreath2020statistical] provides an excellent tutorial on Bayesian analysis that may be used for self-learning. A comprehensive textbook on Bayesian modeling is Bayesian Data Analysis by Gelman et al (2013) [Gelman:2013b], with more recent insights on the Bayesian workflow provided by Gelman et al (2020) [Gelman:2020]. Betancourt (2018) [Betancourt:2018] offers an accessible discussion on MCMC methods, with an emphasis on HMC.

2 Two compartment model

As a starting example, we demonstrate the analysis of longitudinal plasma drug concentration data from a single individual using a linear 2 compartment model with first order absorption. The individual receives multiple doses at regular time intervals and the plasma drug concentration is recorded over time. Our goal is to estimate the posterior distribution of the parameters of the model describing the time course of the plasma drug concentrations in this individual.

2.1 Pharmacokinetic model and clinical event schedule

Let’s suppose an individual receives a drug treatment of 1200 mg boluses q12h14 doses. Drug concentrations are measured in plasma obtained from blood sampled at 0.083, 0.167, 0.25, 0.5, 0.75, 1, 1.5, 2, 3, 4, 6 and 8 hours following the first, second and final dose. In addition, we take measurements before each drug intake, as well as 12, 18, and 24 hours following the last dose. We analyze that data using a linear 2 compartment model with first order absorption:

(1a)
(1b)
(1c)

with

  • : drug mass in each compartment (mg),

  • : absorption rate constant (h),

  • : elimination clearance from the central compartment (L / h),

  • : intercompartmental clearance (L/h),

  • : volume of the central compartment (L),

  • : volume of the peripheral compartment (L).

Both intervention and measurement events are described by the event schedule, which follows the convention established by NONMEM. This means the user must provide for each event the following variables: cmt, evid, addl, ss, amt, time, rate, and ii. Table 1 provides an overview of the roll of each variable and more details can be found in the Torsten User Manual.

Variable Description
cmt Compartment in which event occurs.
evid Type of event: (0) measurement, (1) dosing.
addl For dosing events, number of additional doses.
ss Steady state indicator: (0) no, (1) yes.
amt Amount of drug administered.
time Time of the event.
rate For dosing by infusion, rate of infusion.
ii For events with multiple dosing, inter-dose interval.
Table 1: Torsten variables to specify an event schedule.

2.2 Statistical model

Given a treatment, , and the physiological parameters, , we compute the drug plasma concentration, , by solving the relevant ODEs. Our measurements, , are a perturbation of . This is to account for the fact that our model is not perfect and that our measurements have finite precision. We model this residual error using a lognormal distribution, that is

where is a scale parameter we wish to estimate. The deterministic computation of along with the measurement model, defines our likelihood function , where and are input data, i.e. the clinical event schedule. Note that we are not limited to the above simple model. Stan supports many distributions and allows users to write more complex residual models, with, for instance, proportional and additive error variance [Stan_func_ref:2021].

It remains to define a prior distribution, . Our prior should allocate probability mass to every plausible parameter value and exclude patently absurd values. For example the volume of the central compartment is on the order of ten liters, but it cannot be the size of the Sun. In this simulated example, our priors for the individual parameters are based on population estimates from previous (hypothetical) studies.

Suggestions for building priors can be found in references [Gabry:2019, Betancourt:2020] and at https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations.

2.3 Specifying a model in Stan

We can now specify our statistical model using a Stan file, which is divided into coding blocks, each with a specific role. From R, we then run inference algorithms which take this Stan file as an input.

2.3.1 Data and parameters block

To define a model, we need a procedure which returns the log joint distribution, . Our first task is to declare the data, , and the parameters, , using the coding blocks data and parameters. It is important to distinguish the two. The data is fixed. By contrast, the parameter values change as HMC explores the parameter space, and gradients of the joint density are computed with respect to , but not .

For each variable we introduce, we must declare a type and, for containers such as arrays, vectors, and matrices, the size of the container

[Stan_users_guide:2021, Ch.5]. In addition, each statement ends with a semi-colon. It is possible to specify constraints on the parameters, using the keywords lower and upper. If one of these constraints is violated, Stan returns an error message. More importantly, constrained parameters are transformed into unconstrained parameters – for instance, positive variables are put on the log scale – which greatly improves computation.

  data {
    int<lower = 1> nEvent;      // number of events
    int<lower = 1> nObs;        // number of observations
    int<lower = 1> iObs[nObs];  // index of events which
                                // are observations.
    // Event schedule
    int<lower = 1> cmt[nEvent];
    int evid[nEvent];
    int addl[nEvent];
    int ss[nEvent];
    real amt[nEvent];
    real time[nEvent];
    real rate[nEvent];
    real ii[nEvent];
    // observed drug concentration
    vector<lower = 0>[nObs] cObs;
  }
  parameters {
    real<lower = 0> CL;
    real<lower = 0> Q;
    real<lower = 0> VC;
    real<lower = 0> VP;
    real<lower = 0> ka;
    real<lower = 0> sigma;
  }

2.3.2 model block

Next, the model block allows us to modify the variable target, which Stan recognizes as the log joint distribution. The following statement increments target using the prior on , which is a normal density, truncated at 0 to only put mass on positive values.

target += normal_lpdf(sigma | 0, 1);

The truncation is implied by the fact is declared as lower-bounded by 0 in the parameters block. An alternative syntax is the following:

sigma ~ normal(0, 1);

This statement now looks like our statistical formulation and makes the code more readable. But we should be mindful that this is not a sampling statement, rather instructions on how to increment target. We now give the full model block:

model {
  // priors
  CL ~ lognormal(log(10), 0.25);
  Q ~ lognormal(log(15), 0.5);
  VC ~ lognormal(log(35), 0.25);
  VP ~ lognormal(log(105), 0.5);
  ka ~ lognormal(log(2.5), 1);
  sigma ~ normal(0, 1);
  // likelihood
  cObs ~ lognormal(log(concentrationHat[iObs]), sigma);
}

The likelihood statement involves a crucial term we have not defined yet: concentrationHat. Additional variables can be created using the transformed data and transformed parameters blocks. We will take advantage of these to compute the drug concentration in the central compartment for each event. Note that for the likelihood, we only use the concentration during observation events, hence the indexing [iObs].

2.3.3 Transformed data and transformed parameters block

In transformed data, we can construct variables which only depend on the data. For this model, we simply specify the number of compartments in our model (including the gut), nCmt, and the numbers of pharmacokinetic parameters, nTheta, two variables which will come in handy shortly.

  transformed data {
    int nCmt = 3;
    int nTheta = 5;
  }

Because the data is fixed, this operation is only computed once. By contrast, operations in the transformed parameters block need to be performed (and differentiated) for each new parameter values.

To compute concentrationHat we need to solve the relevant ODE within the clinical event schedule. Torsten provides a function which returns the drug mass in each compartment at each time point of the event schedule.

matrix<lower = 0>[nCmt, nEvent]
  mass = pmx_solve_twocpt(time, amt, rate, ii, evid,
                          cmt, addl, ss, theta);

The first eight arguments define the event schedule and the last argument, theta, is an array containing the pharmacokinetic parameters, and defined as follows:

real theta[nTheta] = {CL, Q, VC, VP, ka};

It is also possible to have theta change between events, and specify lag times and bioavailability fractions, although we will not take advantage of these features in the example at hand.

The Torsten function we have chosen to use solves the ODEs analytically. Other routines use a matrix exponential, a numerical solver, or a combination of analytical and numerical methods [Margossian:2017]. It now remains to compute the concentration in the central compartment at the relevant times. The full transformed parameters block is as follows:

transformed parameters {
  real theta[nTheta] = {CL, Q, VC, VP, ka};
  row_vector<lower = 0>[nEvent] concentrationHat;
  matrix<lower = 0>[nCmt, nEvent] mass;
  mass = pmx_solve_twocpt(time, amt, rate, ii, evid,
                          cmt, addl, ss, theta);
  // Extract mass in central compartment and divide
  // by central volume.
  concentrationHat = mass[2, ] ./ VC;
}

The Stan file contains all the coding blocks in the following order: data, transformed data, parameters, transformed parameters, model. The full Stan code can be found in the Supplementary Material.

2.4 Calling Stan from R

The R package cmdstanr allows us to run a number of algorithms on a model defined in a Stan file. An excellent place to get started with the package is https://mc-stan.org/cmdstanr/articles/cmdstanr.html.

The first step is to “transpile” the file – call it twocpt.stan –, that is translate the file into C++ and then compile it.

mod <- cmdstan_model("twocpt.stan")

We can then run Stan’s HMC sampler by passing in the requisite data and providing other tuning parameters. Here: (i) the number of Markov chains (which we run in parallel), (ii) the initial value for each chain, (iii) the number of warmup iterations, and (iv) the number of sampling iterations.

fit <- mod$sample(data = data, chains = n_chains,
                  parallel_chains = n_chains,
                  init = init,
                  iter_warmup = 500,
                  iter_sampling = 500)

There are several other arguments we can pass to the sampler and which we will take advantage of throughout the tutorial. For applications in pharmacometrics, we recommend specifying the initial starting points via the init argument, as the defaults may not be appropriate. In this tutorial, we draw the initial points from their priors by defining an appropriate R function.

The resulting fit object stores the samples generated by HMC from which can deduce the sample mean, sample variance, and sample quantiles of our posterior distribution. This information is readily accessible using fit$summary() and summarized in table 2. We could also extract the samples and perform any number of operations on them.

mean median sd mad q5 q95 ESS ESS
10.0 10.0 0.378 0.367 9.39 10.6 1.00 1580 1348
19.8 19.5 4.00 4.01 13.8 26.8 1.00 985 1235
41.2 40.8 9.71 9.96 25.6 57.7 1.00 732 1120
124 123 18.0 18.0 97.1 155 1.00 1877 1279
1.73 1.67 0.523 0.522 1.01 2.68 1.00 762 1108
0.224 0.222 0.0244 0.0232 0.187 0.269 1.01 1549 1083
Table 2: Summary of results when fitting a two compartment model.

The first columns return sample estimates of the posterior mean, median, standard deviation, median absolute deviation,

and quantiles, based on our approximate samples. The next three columns return the statistics and the effective sample size for bulk and tail estimates, and can be used to identify problems with our inference.

2.5 Checking our inference

Unfortunately there is no guarantee that a particular algorithm will work across all the applications we will encounter. We can however make sure that certain necessary conditions are met. Much of the MCMC literature focuses on estimating expectation values, and we will use these results to develop some intuition.

2.5.1 Central limit theorem

Many common MCMC concepts, such as effective sample size, are amiable to intuitive interpretations. But to really grasp their meaning and take advantage of them, we must examine them in the context of central limit theorems.

For any function of our latent parameters , we define the posterior mean to be

a quantity also termed the expectation value. If we were able to generate exact independent samples,

one sensible estimator would be the sample mean,

Then, provided the variance of is finite, the central limit theorem

teaches us that the sample mean converges, as the sample size increases, to a normal distribution. We will not dwell on the technical details of the theorem and go straight to its practical implication, namely:

where the deviation from the approximating normal has order . This means that even for a moderate sample size, the approximation tends to be very good. This is a powerful result for two reasons: first it tells us that our estimator is unbiased and more importantly that the expected squared error is driven by the variance of divided by our sample size, .

Unfortunately, estimates constructed with MCMC samples will, in general, neither be unbiased, nor will their variance decrease at rate . For our estimators to be useful, we must therefore check that our samples are unbiased and then use a corrected central limit theorem.

2.5.2 Checking for bias with

MCMC samples are biased for several reasons. Perhaps the most obvious one is that Markov chains generate correlated samples, meaning any sample has some correlation with the initial point. If we run the algorithm for enough iterations, the correlation to the initial point becomes negligible and the chain “forgets” its starting point. But what constitutes enough iterations? It isn’t hard to construct examples where removing the initial bias in any reasonable time is a hopeless endeavor.

To identify this bias, we run multiple Markov chains, each started at different points, and check that they all converge to the same region of the parameter space. More precisely, we shouldn’t be able to distinguish the Markov chains based on the samples alone. One way to check this is to compute the statistics, for which we provide an intuitive definition:

If the chains are mixing properly, then converges to 1.0, as increases. Moreover, we want , as is the case in table 2. Stan uses an improved statistics described in a recent paper by Vehtari et al (2020) [Vehtari:2020]. Further discussion on how to interpret and what it measures can be found in Margossian et al (2021) [Margossian:2021, Section 2]. In addition to , we can visually check that the chains are properly mixing using a trace plot (Figure 1).

Figure 1: Trace plots. The sampled values for each parameters are plotted against the iterations during the sampling phase. Multiple Markov chains were initialized at different points. However, once in the sampling phase, we cannot distinguish the chains.

If and, more generally, if the chains were not mixing, this would be cause for concern and an invitation to adjust our inference method. Even when , we should entertain the possibility that all the chains suffer from the same bias. Stan offers additional diagnostics to identify sampling biases, notably by reporting divergent transitions of the HMC sampler, a topic we will discuss when we fit more sophisticated models.

2.5.3 Controlling the variance of our estimator

Let’s assume that our samples are indeed unbiased. The expected error of our estimator is now determined by the variance. Under certain regularity conditions, our estimator follows an MCMC central limit theorem,

where the key difference with the standard central limit theorem is that is now divided by the effective sample size, , rather than the sample size. This change accounts for the fact our samples are not independent: their correlation induces a loss in information, which increases the variance of our estimator. For , we have 2,000 samples, but the effective sample size is 1,580 (Table 2). If is low, our estimator may not be precise enough and we should generate more samples.

The effective sample size is only formally defined in the context of estimators for expectation values. We may also be interested in tail quantities, such as extreme quantiles, which are much more difficult to estimate and require many more samples to achieve a desired precision. Vehtari et al (2020) [Vehtari:2020] propose a generalization of the effective sample size for such quantities, and introduce the tail effective sample size. This is to be distinguished from the traditional effective sample size, henceforth the bulk effective sample size. Both quantities are reported by Stan.

2.6 Checking the model: posterior predictive checks

Once we develop enough confidence in our inference, we still want to check our fitted model. There are many ways of doing this. We may look at the posterior distribution of an interpretable parameter and see if it suggests implausible values. Or we may evaluate the model’s ability to perform a certain task, e.g. classification or prediction, as is often done in machine learning. In practice, we find it useful to do posterior predictive checks (PPC), that is simulate data from the fitted model and compare the simulation to the observed data [Gelman:2013, chapter 6]. Mechanically, the procedure is straightforward:

  1. Draw the parameters from their posterior,

  2. Draw new observations from the likelihood, conditional on the drawn parameters, .

This amounts to drawing observations from their posterior distribution, that is . Both the uncertainty due to our estimation and the uncertainty due to our measurement model propagate to our predictions.

Stan provides a generated quantities block, which allows us to compute values, based on sampled parameters. In our two compartment model example, the following code draws new observations from the likelihood:

generated quantities {
  real concentrationObsPred[nObs]
    = lognormal_rng(log(concentrationHat[iObs]), sigma);
}

We generate predictions at the observed points for each sampled point, . This gives us a sample of predictions and we can use the and

quantiles to construct a credible interval. We may then plot the observations and the credible intervals (Figure 

2) and see that, indeed, the data generated by the model is consistent with the observations.

Figure 2: Posterior predictive checks for two compartment model. The circles represent the observed data and the shaded areas the and credible intervals based on posterior draws.

2.7 Comparing models: leave-one-out cross validation

Beyond model criticism, we may be interested in model comparison. Continuing our running example, we compare our two compartment model to a one compartment model, which is also supported by Torsten via the pmx_solve_onecpt routine. The corresponding posterior predictive checks are shown in Figure 3.

Figure 3: Posterior predictive checks for one compartment model. The circles represent the observed data and the shaded areas the and credible intervals based on posterior draws. A graphical inspection suggests the credible interval is wider for the one compartment model than they are for the two compartment model.

There are several ways of comparing models and which method is appropriate crucially depends on the insights we wish to gain. If our goal is to asses a model’s ability to make good out-of-sample predictions, we may consider Bayesian leave-one-out (LOO) cross validation. The premise of cross-validation is to exclude a point, , from the training set, i.e. the set of data to which we fit the model. Here denotes the covariate and in our example, the relevant row in the event schedule. We denote the reduced data set, . We then generate a prediction using the fitted model, and compare to . A classic metric to make this comparison is the squared error, .

Another approach is to use the LOO estimate of out-of-sample predictive fit:

Here, no prediction is made. We however examine how consistent an “unobserved” data point is with our fitted model. Computing this estimator is expensive, since it requires fitting the model to different training sets in order to evaluate each term in the sum.

Vehtari et al (2016) [Vehtari:2016] propose an estimator of , which uses Pareto smooth importance sampling and only requires a single model fit. The premise is to compute

and correct this value, using importance sampling, to estimate . Naturally this estimator may be inaccurate. What makes this tool so useful is that we can use the Pareto shape parameter, , to asses how reliable the estimate is. In particular, if , then the estimate shouldn’t be trusted. The estimator is implemented in the R package loo. See [Gabry:2020] for more details including its connection and comparison to the widely applicable information criterion (WAIC).

Conveniently, we can compute in Stan’s generated quantities block.

vector[nObs] log_lik;
for (i in 1:nObs)
  log_lik[i] =
   lognormal_lpdf(cObs[i] | log(concentrationHat[iObs[i]]),
                  sigma);

These results can then be extracted and fed into Loo to compute . The file twoCpt.r in the Supplementary Material shows exactly how to do this. Figure 4 plots the estimated , along with a standard deviation, and shows the two compartment model has better out-of-sample predictive capabilities.

Figure 4: Leave-one-out estimate of out-of-sample predictive fit. Plotted is the estimate, , for the one and two compartment models. Clearly, the two compartment models has superior predictive capabilities.

3 Two compartment population model

We now consider the scenario where we have data from multiple patients and fit a population model. Population models are a powerful tool to capture the heterogeneity between patients, while also recognizing similarities. Building the right prior allows us to pool information between patients, the idea being that what we learn from one patient teaches us something – though not everything – about the other patients. In practice, such models can frustrate inference algorithms and need to be implemented with care [Betancourt:2013]. We start with an example where the interaction between the model and our MCMC sampler is well behaved. In Part II of this tutorial, we will examine a more difficult case, for which we will leverage Stan’s diagnostic capabilities in order to run reliable inference.

3.1 Statistical model

Let be the 2D array of body weight-normalized pharmacokinetic parameters for each patient, with

the parameters for the patient. We construct a population model by introducing random variation to describe otherwise unexplained inter-individual variability. In a Bayesian context this is sometimes referred to as a prior distribution for the individual parameters,

As before we work on the log scale to account for the fact the pharmacokinetic parameters are constrained to be positive. is the population mean (on the logarithmic scale) and the population covariance matrix. Both and are estimated. In this example, we start with the simple case where is diagonal. For our example we will also use conventional allometric scaling to adjust the clearance and volume parameters for body weight.

The likelihood remains mostly unchanged, with the caveat that it must now be computed for each patient. Putting this all together, we have the following model, as specified by the joint distribution,

3.2 Specifying the model in Stan

We begin by adjusting our parameters block:

parameters {
  // Population parameters
  real<lower = 0> CL_pop;
  real<lower = 0> Q_pop;
  real<lower = 0> VC_pop;
  real<lower = 0> VP_pop;
// Constrain ka_pop > lambda_1
  real<lower = (CL_pop / VC_pop + Q_pop / VC_pop + Q_pop / VP_pop +
                sqrt((CL_pop / VC_pop + Q_pop / VC_pop +
                  Q_pop / VP_pop)^2 -
                4 * CL_pop / VC_pop * Q_pop / VP_pop)) / 2 >
                   ka_pop;
  // Inter-individual variability
  vector<lower = 0>[nIIV] omega;
  real<lower = 0> theta[nSubjects, nTheta];
  // Residual variability
  real<lower = 0> sigma;
}

The declaration for illustrates that constraints may be expressions including other variables in the model. In this case is constrained to avoid identifiability problems due to “flip-flop”.

The variable, is introduced in transformed parameters, mostly for convenience purposes:

vector<lower = 0>[nTheta]
  theta_pop = to_vector({CL_pop, Q_pop, VC_pop, VP_pop,
                         ka_pop});

The model block reflects our statistical formulation:

model {
  // prior on population parameters
  CL_pop ~ lognormal(log(10), 0.25);
  Q_pop ~ lognormal(log(15), 0.5);
  VC_pop ~ lognormal(log(35), 0.25);
  VP_pop ~ lognormal(log(105), 0.5);
  ka_pop ~ lognormal(log(2.5), 1);
  // independent lognormal priors on each element of omega.
  omega ~ lognormal(0.25, 0.1);
  sigma ~ normal(0, 1);
  // interindividual variability
  for (j in 1:nSubjects)
    theta[j, ] ~ lognormal(log(theta_pop), omega);
  // likelihood
  cObs ~ lognormal(log(concentrationObs), sigma);
}

In the transformed parameters block we also declare and calculate the individual parameters given and any relevant covariates—body weight in this case.

  // Individual parameters
  vector<lower = 0>[nSubjects] CL = to_vector(theta[, 1]) .* exp(0.75 * log(weight / 70 ));
  vector<lower = 0>[nSubjects] Q = to_vector(theta[, 2]) .* exp(0.75 * log(weight / 70 ));
  vector<lower = 0>[nSubjects] VC = to_vector(theta[, 3]) .* (weight / 70);
  vector<lower = 0>[nSubjects] VP = to_vector(theta[, 4]) .* (weight / 70);
  vector<lower = 0>[nSubjects] ka = to_vector(theta[, 5]);

It remains to compute concentrationObs. There are several ways to do this and, depending on the computational resources available, we may either compute the concentration for each patients sequentially or in parallel. For now, we do the simpler sequential approach. In the upcoming Part II of this tutorial, we examine how Torsten offers easy-to-use parallelization for population models.

Sequentially computing the concentration is a simple matter of bookkeeping. In transformed parameters we loop through the patients using a for loop. The code is identical to what we used in Section 2.3.3, with the caveat that the arguments to pmx_solve_twocpt are now indexed to indicate for which patient we compute the drug mass. For example, assuming the time schedule is ordered by patient, the event times corresponding to the patient are given by

time[start[j]:end[j]]

where start[j] and end[j] contain the indices of the first and last event for the patient, and the syntax for indexing is as in R. The full for loop is then

for (j in 1:nSubjects) {
  mass[, start[j]:end[j]] =
       pmx_solve_twocpt(time[start[j]:end[j]],
                        amt[start[j]:end[j]],
                        rate[start[j]:end[j]],
                        ii[start[j]:end[j]],
                        evid[start[j]:end[j]],
                        cmt[start[j]:end[j]],
                        addl[start[j]:end[j]],
                        ss[start[j]:end[j]],
                        {CL[j], Q[j], VC[j], VP[j], ka[j]});
  concentration[start[j]:end[j]] =
                    mass[2, start[j]:end[j]] / VC[j];
}

Note that the last array argument in pmx_solve_twocpt is generated using {} syntax.

Once we have written our Stan model, we can apply the same methods for inference and diagnostics as we did in the previous section.

3.3 Posterior predictive checks

We follow the exact same procedure as in Section 2.6 – using even the same line of code – to simulate new observations for the same patients we analyzed. Figure 5 plots posterior predictions for each individual patient. In addition, we simulate new observations for hypothetical new patients by: (i) drawing pharmacokinetic parameters from our population distribution, (ii) solving the ODEs with these simulated parameters and (iii) using our measurement model to simulate new observations. Those predictions are also shown in Figure 5 for each individual. Figure 6 depicts a composite posterior predictive check for all individuals. The generated quantities block then looks as follows:

generated quantities {
  real concentrationObsPred[nObs]
    = exp(normal_rng(log(concentration[iObs]), sigma));
  real cObsNewPred[nObs];
  matrix<lower = 0>[nCmt, nEvent] massNew;
  real thetaNew[nSubjects, nTheta];
  row_vector<lower = 0>[nEvent] concentrationNew;
  vector<lower = 0>[nSubjects] CLNew;
  vector<lower = 0>[nSubjects] QNew;
  vector<lower = 0>[nSubjects] VCNew;
  vector<lower = 0>[nSubjects] VPNew;
  vector<lower = 0>[nSubjects] kaNew;
  for (j in 1:nSubjects) {
    thetaNew[j, ] = lognormal_rng(log(theta_pop), omega);
  CLNew = to_vector(thetaNew[, 1]) .* exp(0.75 * log(weight / 70 ));
  QNew = to_vector(thetaNew[, 2]) .* exp(0.75 * log(weight / 70 ));
  VCNew = to_vector(thetaNew[, 3]) .* (weight / 70);
  VPNew = to_vector(thetaNew[, 4]) .* (weight / 70);
  kaNew = to_vector(thetaNew[, 5]);
  massNew[, start[j]:end[j]]
      = pmx_solve_twocpt(time[start[j]:end[j]],
                      amt[start[j]:end[j]],
                      rate[start[j]:end[j]],
                      ii[start[j]:end[j]],
                      evid[start[j]:end[j]],
                      cmt[start[j]:end[j]],
                      addl[start[j]:end[j]],
                      ss[start[j]:end[j]],
                      {CLNew[j], QNew[j], VCNew[j], VPNew[j], kaNew[j]});
      concentrationNew[start[j]:end[j]]
        = massNew[2, start[j]:end[j]] / VCNew[j];
  }
  cObsNewPred = lognormal_rng(log(concentrationNew[iObs]), sigma);
}
Figure 5: Population two compartment model: Posterior predictive checks for each individual. Key: Black dots = observed data; red curve and shaded area = posterior median and 90% credible intervals for prediction of new observations in the same individual; blue curve and shaded area = posterior median and 90% credible intervals for prediction of new observations in a hypothetical new individual with the same body weight.
Figure 6: Population two compartment model: Posterior predictive checks for all individuals. Key: Black circles = observed data; blue curves and shaded areas = posterior median and 80% credible intervals for the population median; red curve and shaded area = posterior median and 80% credible intervals for the and population percentiles intervals

It is worth noting that the computational cost of running operations in the generated quantities is relatively small. While these operations are executed once per iteration, in order to generate posterior samples of the generated quantities, operations in the transformed parameters and model blocks are run and differentiated multiple times per iterations, meaning they amply dominate the computation. Hence the cost of doing posterior predictive checks, even when it involves solving ODEs, is marginal. The computational scaling of Stan, notably for ODE-based models, is discussed in the article by Grinsztajn et al (2021) [Grinsztajn:2021].

4 Nonlinear pharmacokinetic / pharmacodynamic model

Now let us consider a PKPD model described in terms of a nonlinear ODE system that requires the use of a numerical solver. The patient receives multiple doses at regular time intervals and the drug plasma concentration is recorded over time.

4.1 Nonlinear pharmacokinetic / pharmacodynamic model

In this the last example, we go back to the single patient two-compartment model and append it with a PD model. Specifically, we examine the Friberg-Karlsson semi-mechanistic model for drug-induced myelosuppression [3181, 2364, 2518, 3187, 3188, 3537] with the goal to model the relation between neutrophil counts and drug exposure. The model describes a delayed feedback mechanism that keeps the absolute neutrophil count (ANC) at the baseline () in a circulatory compartment (), as well as the drug effect that perturbs this mechanism. The delay between proliferative cells () and is modeled by three transit compartments with mean transit time

(2)

where is the transit rate constant. Figure 7 summarizes the model (see also [3181, Figure 2]).

Figure 7: Friberg-Karlsson semi-mechanistic Model.

The PD likelihood is

ANC

where is the drug concentration calculated from the PK model, and solves the non-linear ODE:

(3a)
(3b)
(3c)
(3d)
(3e)

We use

to model the linear effect of the drug once it has been absorbed in the central compartment. This effect reduces the proliferation rate and induces a reduction in neutrophil count. The upper bound of 1 on excludes the scenario where the feedback loop becomes negative for large . While we expect that for any reasonable parameter values, , we should anticipate the possibility that our Markov chains may encounter less well-behaved values as it explores the parameter space. Encoding such constraints can lead to improved numerical stability when solving the ODE.

We obtain the complete ODE system for the PKPD model by coupling equation (1) and (3). Because the equation is nonlinear, we can no longer resort to analytical solutions as we have done in the previous sections.

4.1.1 Numerically solving ODEs

To solve an ODE numerically in Stan we first need to define a function that returns the right-hand-side of the ODE, i.e. the derivative of the solution, in the functions block. The functions block allows users to define functions and is written at the top of the Stan file before the data block.

functions {
  vector twoCptNeutModelODE(real t, vector y, real[] parms, real[] rdummy, int[] idummy){
    real CL = parms[1];
    real Q = parms[2];
    real V1 = parms[3];
    real V2 = parms[4];
    real ka = parms[5];
    real mtt = parms[6];
    real circ0 = parms[7];
    real gamma = parms[8];
    real alpha = parms[9];
    real k10 = CL / V1;
    real k12 = Q / V1;
    real k21 = Q / V2;
    real ktr = 4 / mtt;
    vector[8] dydt;
    real conc = y[2] / V1;
    real EDrug = fmin(1.0, alpha * conc);
    real prol = y[4] + circ0;
    real transit1 = y[5] + circ0;
    real transit2 = y[6] + circ0;
    real transit3 = y[7] + circ0;
    real circ = fmax(machine_precision(), y[8] + circ0);
    dydt[1] = -ka * y[1];
    dydt[2] = ka * y[1] - (k10 + k12) * y[2] + k21 * y[3];
    dydt[3] = k12 * y[2] - k21 * y[3];
    // y[4], y[5], y[6], y[7] and y[8] are differences from circ0.
    dydt[4] = ktr * prol * ((1 - EDrug) * ((circ0 / circ)^gamma) - 1);
    dydt[5] = ktr * (prol - transit1);
    dydt[6] = ktr * (transit1 - transit2);
    dydt[7] = ktr * (transit2 - transit3);
    dydt[8] = ktr * (transit3 - circ);
    return dydt;
  }
}

The above function is an almost direct translation of Eq. (1) and (3). The first three components of dydt describes the PK. The next five components of dydt describe the PD minus the baseline . Writing the ODE as a difference from the baseline means the initial PD conditions is 0, as opposed to a parameter dependent value. This results in better computation, because derivatives of the ODE solution with respect to the initial conditions no longer need to be computed; for more details, see [Grinsztajn:2021, Section 5.2]. In addition, we encode a constraint on the circulatory compartment

where is the machine precision and can be interpreted as the smallest non-zero number the computer can handle. This is to improve numerical stability, especially during the early stages of MCMC exploration when we may need to handle somewhat implausible parameter values.

Stan and Torsten provide several numerical solvers. In this example we use the Runge-Kutta solver pmx_solve_rk45 [Torsten:2021, Section 3.4]. The signature of pmx_solve_rk45 is a bit more sophisticated than that of pmx_solve_twocpt and requires the following arguments:

  1. the name of the user-defined ODE function (twoCptNeutModelODE)

  2. the number of states/compartments in the ODE

  3. the event schedule

  4. the bioavailibility fraction, , and the dosing lag time, for each compartment (optional)

  5. the tuning parameters for the ODE solver (optional)

Because arguments are nameless in Stan, we can only pass the ODE tuning parameters if we also pass and . By setting to 1 and to 0 for each compartment, we essentially ignore their effect. This is best done in the transformed data block:

  int<lower = 1> nCmt = 8;
  real F[nCmt] = rep_array(1.0, nCmt);
  real tLag[nCmt] = rep_array(0.0, nCmt);

Numerical solvers in Stan and Torsten admit three tuning parameters:

  • rtol: relative tolerance to determine solution convergence,

  • atol: absolute tolerance to determine solution convergence,

  • max_num_step: maximum number of steps allowed.

Though Stan and Torsten provide default values, we highly recommend that the user define the ODE solver control parameters in the data block:

  real<lower = 0> rtol;
  real<lower = 0> atol;
  int<lower = 0> max_num_step;

Users should make problem-dependent decisions on rtol and atol, according to the expected scale of the unknowns, so that the error does not affect our inference. For example, when an unknown can be neglected below a certain threshold without affecting the rest of the dynamic system, setting atol greater than that threshold avoids spurious and error-prone computation. For more details, see [Stan_users_guide:2021, Chapter 13] and [Torsten:2021, Section 3.7.5] and references therein.

As before, we solve the ODE within the event schedule in the transformed parameters block:

transformed parameters{
  vector[nt] cHat;
  vector[nObsPK] cHatObs;
  vector[nt] neutHat;
  vector[nObsPD] neutHatObs;
  matrix[nCmt, nt] y;
  real<lower = 0> parms[9];
  parms = {CL, Q, V1, V2, ka, mtt, circ0, gamma, alpha};
  y = pmx_solve_rk45(twoCptNeutModelODE, nCmt, time, amt, rate, ii, evid, cmt, addl, ss, parms, F, tLag, rtol, atol, max_num_step);
  cHat = y[2, ]’ / V1;
  neutHat = y[8, ]’ + circ0;
  cHatObs = cHat[iObsPK]; // predictions for observed data records
  neutHatObs = neutHat[iObsPD]; // predictions for observed data records
}

4.1.2 Solving PKPD ODEs as a coupled system

The approach in the last section applies to all models that involve ODE solutions, but we will not use it here. An acute observer may have noticed the PKPD model here exhibits a particular one-way coupling structure. That is, the PK (Eq. 1) and PD (Eq. 3) are coupled through the proliferation cell count and , such that the PK can be solved independently from the PD. This is what motivates Torsten’s coupled solvers which analytically solves the PK ODEs before passing the PK solution to the PD ODE. The PD ODE is then solved numerically. Since the dimension of the numerical ODE solution is reduced, in general this coupled strategy is more efficient than numerically solving a full ODE system. To see it in action, let us apply the coupled solver pmx_solve_twocpt_rk45 [Torsten:2021, Section 3.5] to the same model. We need only make two changes. First, we modify the ODE function to reflect that only the PD states are to be solved.

functions{
  vector twoCptNeutModelODE(real t, vector y, vector y_pk, real[] theta, real[] rdummy, int[] idummy){
    /* PK variables */
    real V1 = theta[3];
    /* PD variable */
    real mtt      = theta[6];
    real circ0    = theta[7];
    real gamma    = theta[8];
    real alpha    = theta[9];
    real ktr      = 4.0 / mtt;
    real prol     = y[1] + circ0;
    real transit1 = y[2] + circ0;
    real transit2 = y[3] + circ0;
    real transit3 = y[4] + circ0;
    real circ     = fmax(machine_precision(), y[5] + circ0);
    real conc     = y_pk[2] / V1;
    real EDrug    = alpha * conc;
    vector[5] dydt;
    dydt[1] = ktr * prol * ((1 - EDrug) * ((circ0 / circ)^gamma) - 1);
    dydt[2] = ktr * (prol - transit1);
    dydt[3] = ktr * (transit1 - transit2);
    dydt[4] = ktr * (transit2 - transit3);
    dydt[5] = ktr * (transit3 - circ);
    return dydt;
  }
}

Note that we pass in PD and PK states as separate arguments, and respectively. The above function only returns , while is solved internally using an analytical solution, meaning users do not need to explicitly call pmx_solve_twocpt.

Then we replace pmx_solve_rk45 with pmx_solve_twocpt_rk45 call.

  x = pmx_solve_twocpt_rk45(twoCptNeutModelODE, nOde, time, amt, rate, ii, evid, cmt, addl, ss, parms, biovar, tlag, rtol, atol, max_num_step);

4.2 Building the remaining coding blocks

We omit the data block, but note that it is similar to the one we constructed in previous sections. The key difference is we now include measurements for the absolute neutrophil count. The parameters block now contains the PD variables:

parameters{
  real<lower = 0> CL;
  real<lower = 0> Q;
  real<lower = 0> V1;
  real<lower = 0> V2;
  //  real<lower = 0> ka; // ka unconstrained
  real<lower = (CL / V1 + Q / V1 + Q / V2 +
                sqrt((CL / V1 + Q / V1 + Q / V2)^2 -
                     4 * CL / V1 * Q / V2)) / 2> ka; // ka > lambda_1
  real<lower = 0> mtt;
  real<lower = 0> circ0;
  real<lower = 0> alpha;
  real<lower = 0> gamma;
  real<lower = 0> sigma;
  real<lower = 0> sigmaNeut;
}

The model block is similar to that in Section 2.1:

model{
  CL ~ lognormal(log(CLPrior), CLPriorCV);
  Q ~ lognormal(log(QPrior), QPriorCV);
  V1 ~ lognormal(log(V1Prior), V1PriorCV);
  V2 ~ lognormal(log(V2Prior), V2PriorCV);
  ka ~ lognormal(log(kaPrior), kaPriorCV);
  sigma ~ normal(0, 1);
  mtt ~ lognormal(log(mttPrior), mttPriorCV);
  circ0 ~ lognormal(log(circ0Prior), circ0PriorCV);
  alpha ~ lognormal(log(alphaPrior), alphaPriorCV);
  gamma ~ lognormal(log(gammaPrior), gammaPriorCV);
  sigmaNeut ~ normal(0, 1);
  cObs ~ lognormal(log(cHatObs), sigma); // observed data likelihood
  neutObs ~ lognormal(log(neutHatObs), sigmaNeut);
}

4.2.1 Posterior predictive checks

We hope by now reader has developed the habit of performing PPC on every model. Since we have both PK (drug concentration) and PD (neutrophil count) observations, the PPC should be conducted on both.

generated quantities{
  real cObsPred[nt];
  real neutObsPred[nt];
  for(i in 1:nt){
    if(time[i] == 0){
      cObsPred[i] = 0;
    }else{
      cObsPred[i] = lognormal_rng(log(cHat[i]), sigma);
    }
    neutObsPred[i] = lognormal_rng(log(neutHat[i]), sigmaNeut);
  }
}

It is possible to only run the generated quantities block based on a fitted model using cmdstanr’s generate_quantities routine. This is useful when we change the generated quantities, but not the rest of a model we have already fitted. The compiled model and the fit are respectively stored in the mod and fit objects in R. We then run:

fit.gq <- mod$generate_quantities(fit)

and use the results for PPC (Figure 7(a) and 7(b)).

(a) PK: drug concentration
(b) PD: neutrophil count
Figure 8: Posterior predictive checks for the PKPD model

5 Discussion

Stan provides an expressive language to build models, state-of-the-art algorithms to fit these models, and a host of easy-to-use diagnostics. Torsten complements Stan with a suite of routines which solve ODEs within the context of a clinical event schedules. Together, Stan and Torsten are potent tools when working through the tangled steps of a Bayesian workflow for PKPD modeling.

5.1 Current and potential role for Stan and Torsten for pharmacometrics applications

We can apply Stan/Torsten to a large palette of generative models, both for inference and simulation. Applications range from simple linear regression to complex multi-scale Quantitative Systems Pharmacology models. Compared to specialized pharmacometrics tools such as NONMEM

®, Stan/Torsten is particularly well–suited for cases where more flexibility is desired. This includes models with

  • random effects distributions other than normal,

  • prior distributions other than the limited set available in existing pharmacometrics tools,

  • multiple submodels with different random effect structures.

It’s important to recognize that MCMC, including the HMC scheme used by Stan/Torsten, can be computationally intensive, notably when fitting hierarchical models which require us to numerically solve ODEs. This can be especially frustrating during the initial model exploration stage of a project. For such exploratory analyses access to a rapid approximate Bayesian inference engine may be desirable. Stan/Torsten includes two optimization-based inference engines, one for estimation of posterior modes and one for variational inference. These algorithms attempt to simultaneously optimize over the entire joint posterior distribution of all model parameters. This process can be relatively slow and error prone when trying to optimize over the large number of population and individual–level parameters of a typical population pharmacometrics model. This contrasts with typical mixed effects modeling programs that use algorithms specialized for a more limited range of models—usually employing an alternating sequence of lower dimensional optimization problems.

For applications that may be implemented with typical pharmacometrics tools, the choice between those and Stan/Torsten comes down to the trade-offs between flexibility, doing accurate Bayesian inference and computation time.

We would also like to point out that Stan is not the only probabilistic programing language that is actively under development. PyMC3 [salvatier_probabilistic_2016]

, TensorFlow Probability

[Dillon:2017, Lao:2020], and Turing [ge_turing_2018], among others, provide similar modeling capabilities. A full review and comparison of these languages is however beyond the scope of this paper.

5.2 Preview of part 2

In part 2 of this tutorial, we plan to build on the material we have covered thus far and tackle more advanced topics, including:

  • Improving the performance of HMC, using within-chain parallelization for population models and Torsten’s dedicated group solvers.

  • Advanced diagnostic tools, namely divergent transitions which can flag bias in our posterior samples. Stan makes these diagnostics readily available.

  • Fake data simulation and analysis, in particular prior predictive checks as a way to understand and build priors, fitting the model to fake data as an imperfect tool to troubleshoot Bayesian inference, and an overview of the more sophisticated but computationally demanding simulation-based calibration [Talts:2020].

  • Performance tuning of ODE models, such as solver selection, accurarcy control, as well as stability issues.

We will dive into these subjects by examining more advanced models and utilizing techniques such as reparameterization, within-chain parallelization, and pooling multiple data sources. We will also discuss ongoing developments with Stan and Torsten, such as tools to handle larger scale ODEs and plans to leverage parallelization.

References