1 Introduction
In drug development the comparison of two different formulations of the same drug is a frequently addressed issue. In this regard bioequivalence studies investigating the difference between two treatments are performed. According to current guidelines by the U.S. Food and Drug Administration (2003) and the EMA (2014) this question is commonly addressed by comparing the ratios of the geometric means of the pharmacokinetic (PK) parameters area under the curve () and the maximal concentration () to a prespecified threshold. More precisely, bioequivalence is established if the boundaries of the confidence intervals for these ratios fall between and which is equivalent to performing two onesided tests (TOST) proposed by Schuirmann (1987).
As the data are usually logtransformed, we consider the logratio (also defined as the treatment effect) and hence the commonly used equivalence margin is given by .
When performing bioequivalence studies, the classical approach to analyze PK data is given by noncompartmental analysis (NCA), see for example Gabrielsson and Weiner (2001), followed by a linear mixed effect analysis of the AUC or Cmax.
The advantage of this approach is that it is very simple and comes without any further assumptions or knowledge of the data.
However, it requires a sufficiently large number of samples and subjects which cannot be provided in each trial.
As pointed out by Dubois et al. (2011) and Hu et al. (2004) the estimates obtained by NCA are biased
if these conditions are not fulfilled.
Further, in numerous studies a sufficiently large number of samples cannot be guaranteed. For instance, in pediatric research, ethical considerations lead to difficulties in the planning of studies which are therefore typically very small in size (for an example see Mentré et al. (2001)).
But also in other areas where patients are especially frail, as for example in cancer research, these requirements are often not met and therefore methods for sparse designs are required.
In such situations the Nonlinear Mixed Effects Models (NLMEM) have become very popular for analyzing pharmacokinetic data (see Sheiner and Wakefield (1999)). NLMEM turned out to be a promising alternative to the classical approach as the estimation of individual effects allows for incorporating variabilities, as the Betweensubjectvariability (BSV) and the Withinsubjectvariability (WSV), for a detailed comparison see Pentikis et al. (1996); Combrink et al. (1997); Panhard and Mentré (2005).
Consequently the main advantage of the NLMEM consists in the improved accuracy of the estimates in particular when dealing with sparse designs (see also Hu et al. (2004)).
In order to assess bioequivalence between two products typically the two onesided tests (TOST) proposed by Schuirmann (1987) is performed, where two level tests are combined for testing two seperate subhypotheses. This method is based on the IntersectionUnion Principle (see Berger (1982)) and one concludes bioequivalence if for both
onesided tests the null hypotheses can be rejected. Due to its simplicity, this approach which is still recommended in the FDA guidelines has become very popular and is common practice nowadays (see for example Bristol (1993), Brown et al. (1997) and Midha and McKay (2009) among many others).
However, it was demonstrated by Phillips (1990) and Tsai et al. (2014) that for a small number of individuals, high variability in the data or only few samples per patient this method is
rather conservative and suffers from a lack of power.
The present paper addresses this problem. Here we propose a new modelbased approach for the assessment of bioequivalence which turns out to have always more power than the corresponding TOST. The superiority of the new approach is particularly visible in situations with a large variability in the data in parallel designs. The motivation of the new methodology is given by the uniformly most powerful test for normally distributed data with known variance, which can be found in many text books on mathematical statistics (see for example,
Lehmann and Romano (2006), or Wellek (2010)). We argue that the superiority of this methodology for NCA also carries over to modelbased inference for reasonable large sample sizes and demonstrate this fact by means of a simulation study.This paper is organized as follows. In Section 2 we present the classical problem of bioequivalence
and review two tests for this problem, including the commonly used TOST for NCAbased inference. In Section 3 we introduce the NLMEM, then we present the model based TOST as first introduced by Panhard and Mentré (2005) and Dubois et al. (2011) and after that the new model based approach.
Subsequently, these tests are compared by means of a simulation study in Section 4 with NCAbased tests both for parallel and crossover designs varying BSV and WSV. In particular
we demonstrate that the new approach (model and NCAbased) usually yields larger power than methodology
based on the TOST. Some theoretical arguments for these finding can be found in the Appendix, where we review properties of both
methods in the problem of comparing the means from two normal distributions with known variance. This
scenario corresponds to some kind of asymptotic regime for the problems considered in practice, if the sample sizes are
reasonably large.
Summarizing, the new approach introduced in the present paper improves the commonly used TOST for bioequivalence testing based on NCA or modelbased inference.
It has never lower power than this test, but substantially larger power in scenarios with a large variability in parallel designs.
2 Review of bioequivalence tests
In this section we will briefly review a commonly used approach for bioequivalence testing, which is based on the wellknown two onesided test (TOST) introduced by Schuirmann (1987). We further present another more powerful method for testing bioequivalence (see for example Wellek (2010)) which gives the motivation for the newly developed modelbased test in Section 3.3. For the sake of simplicity, both methods are described here in the case of a two groups parallel design, but can be applied to crossover design, more standard in BE.
In a bioavailability/bioequivalence study a test (T) and a reference product (R) are administered and it is investigated whether the two formulations of the drug have similar properties with respect to average bioavailabilty in the population. Exposure, in this context, is usually characterized by blood concentration profile variables and summarized by the area under the time concentration curve (AUC) and the maximum concentration (). More precisely, let and denote the average means of the test and reference product for or , then the common testing problem in bioequivalence is defined by the hypotheses
(2.1) 
where is a given threshold. For example, according to the rule considered in the guidelines by EMA (2014) and U.S. Food and Drug Administration (2003) the threshold is given by .
For the problem of testing for PK bioequivalence the metrics of interest are given by and , which means that we consider
(2.2) 
in (2.1), where and are the treatment effects on and respectively.
2.1 The two onesided Tests (TOST)
We consider the following subhypotheses of as described in (2.1) given by
(2.3) 
The idea of the TOST consists in testing each of these hypotheses separately by a onesided test. The global null hypothesis
in (2.1) is rejected with a type I error if both onesided hypotheses are rejected with a type I error . To be precise let and denote the samples from the test (T) and a reference product (R) respectively and denote by () the mean measured endpoints (over all individuals for the two treatments). Under the assumption that the random variables
are independent and normally distributed with a common (but unknown) variance , that is ; , ; we have for the corresponding means(2.4) 
In applications and usually represent and , , which are typically assumed to be lognormally distributed (see Lacey et al. (1997)). We denote by the pooled variance and by the difference between the expectations of the reference and the treatment group. This yields for the difference of the means
(2.5) 
The unknown variance is estimated by
(2.6) 
where
Consequently the null hypothesis in (2.1) is rejected if
(2.7) 
where is the
quantile of the
distribution with degrees of freedom (see for example Chow and Liu (1992)). This method is equivalent to constructing a confidence interval for and concluding bioequivalence if its completely contained in the equivalence interval (see Schuirmann (1987)).2.2 An efficient alternative to TOST
In this section we will review an alternative test which is (asymptotically) the most powerful test in this setting. In the case of known variances this property is well known in the literature on mathematical statistics (see, for example, Romano et al. (2005)), and, for the sake of completeness, a proof of optimality will be given in the Appendix A.2, where we also review some aspects of the power of the TOST. These considerations motivate the modelbased method, which we will propose in the following Section 3.3.
To be precise, let denote the folded normal distribution with parameters , that is the distribution of the random variable , where . Due to (2.5) we have for the absolute difference
This result motivates the choice of the quantile determining the decision rule of the test, which is described in the following algorithm.
Algorithm 2.1.
(A more powerful test)

Estimate the parameters of interest and by and (for instance by noncompartmental analysis) and estimate the variance of the difference by the statistic defined in (2.6).

Reject the null hypothesis, whenever
(2.8) where is the quantile of the folded normal distribution .
The quantile can be calculated solving the equation
Alternatively, it can directly be obtained by using statistical software, as for example the package by Yee (2015) in R.
The approach presented in Algorithm 2.1 is extended in Algorithm 3.1 for modelbased bioequivalence inference, where we will estimate the parameters of interest and by fitting
a nonlinear mixed model to the data.
2.3 Noncompartmental analysis
If we are testing for PK bioequivalence considering the hypotheses in (2) we need to calculate estimates of and directly from the data. In this regard the classical approach is given by NCA, as described for example in Gabrielsson and Weiner (2001). More precisely, is directly obtained from the data, whereas is approximated by the linear trapezoidal rule. This means that the total area under the curve is obtained by separating it into several smaller trapezoids and summing up these areas. Of course the accuracy of this approach strongly depends on the number of measurements as this gives the number of trapezoids but it does not require a model assumption and is widely applicable. As these methods do not take the profile of the blood concentrationtime curve into account, we call them NCAbased methods throughout this paper and they will be discussed in more detail in Section 3.
3 Modelbased Bioequivalence Tests
Classical NCAbased tests are a useful tool to establish bioequivalence if the blood concentration profile variables and can be calculated with a reasonable precision without using information about the form of the concentration profiles. For this purpose one usually needs a relatively dense design to determine the area under the curve or the maximum of the profile. However, there are many situations, where only a sparse design is available (for some examples see Hu et al. (2004)) and the NCAbased calculation of and might be misleading as the estimates are biased in this case (see Dubois et al. (2011)). In such situations where NCA is not reliable a modelbased approach as proposed for the TOST by Panhard and Mentré (2005) and Dubois et al. (2011) might have important advantages.
Roughly speaking they proposed to use nonlinear mixed effects models (NLMEM) to describe the blood concentration profile and derive and estimates. These quantities are then further analyzed using the methodology introduced in Section 2. By this approach they were able to increase the accuracy of bioequivalence tests in the case of sparse designs.
We will use the same methodology to extend the approach presented in Section 2.1 to situations with sparse designs. This new test achieves more power and simultaneously controls the type I error.
3.1 Nonlinear mixed effects models (NLMEM)
We first consider crossover trials with periods and subjects, investigating the difference between a test and a reference treatment. A classical situation is given by the (balanced) twoperiod, twosequence crossover design (), where the patients receive treatment in the first period and treatment in the second one while the other patients
receive the treatments in the reverse order.
For each subject concentrations of the drug are measured in all periods and at different sampling points.
In order to represent the dependence of the concentration on time for one subject we follow Dubois et al. (2011) and use a nonlinear function, say in order to fit one global model to the data, that is
(4.1) 
where denotes the concentration of the th subject () at sampling time () of period (). In (4.1) the residual errors are independent and standardnormally distributed random variables and the function
is used to model heteroscedasticity. In particular we consider a combined error model with
(4.2) 
where the parameters account for the additive and the proportional part of the error respectively. This gives for the variance of the errors in (4.1)
The individual parameters (of length ) are defined by
(4.3) 
where
denotes a vector of fixed effects,
, and the (known) vectors of treatment, period and sequence covariates respectively and , and the vectors of coefficients of treatment, period and sequence effects. In order to account for the variability between individuals, denoted as betweensubjectvariability (BSV), and the variability of one subject between two periods respectively, that is the withinsubjectvariability (WSV), we introduce random effects and . More precisely, the random effect represents the BSV of subject and the WSV of subject at period respectively. Throughout this section we assume that the random effects are normal distributed, that is(4.4) 
with dimensional covariance matrices and and denote the diagonal elements of these matrices by and , respectively. Finally, the vector of all parameters in model (4.1) is given by
(4.5) 
For biologics with a long halflife, such as monoclonal antibodies, a parallel group design, that is each individual receives only the test or the reference treatment, may be necessary (Dubois et al. (2012)). In that case, we consider only one period and the WSV can be omitted and (4.3) simplifies to
(4.6) 
Note that in this case we do not assume any period or sequence effects and hence the vector in (4.5) simplifies to . For the sake of simplicity we now introduce a vector which is defined by in case of parallel designs and for crossover designs. Consequently we can write for the vector of all parameters in model (4.1) , where disappears in case of parallel design.
Considering now the hypotheses in (2) the treatment effects and on and respectively can be directly obtained from the parameters of the global NLMEM. In other words, there exist functions, , , such that
(4.7) 
By this we obtain an estimate for its variance using the delta method (Oehlert (1992)), which has been proposed by Panhard et al. (2007). With these notations the hypotheses in (2.1) can be rewritten as
(4.8) 
where we do the same for and .
3.2 Modelbased TOST
A modelbased version introduced by Panhard and Mentré (2005); Panhard et al. (2007) and Dubois et al. (2011) of the TOST for bioequivalence can be obtained by fitting the NLMEM (4.1) to the data and calculate the estimate of the treatment effect , . We can assume from the theory of mixed effects modeling (see for example Demidenko (2013)) that this estimate is asymptotically normal distributed and following the discussion in Section 2.1 the null hypothesis in (4.8) is rejected whenever
(4.9) 
where is the quantile of the standard normal distribution and
is an estimate of the standard error of the estimate
.We obtain by using an asymptotic approximation based on the estimated covariance matrix of the fixed effects (given by a submatrix of the inverse of the Fisher information matrix) and the Deltamethod (see Oehlert (1992) and Dubois et al. (2011) for the concrete calculation). More precisely, considering (4.7) and denoting the estimated covariance matrix of the fixed effects by , we have
(4.10) 
where denotes the gradient of the function , expressing as a function of the model parameters ( or ). As the functions and are known, all quantities of the rejection rule given in (4.9) can be directly obtained from the estimates of the parameters in model (4.1).
3.3 Modelbased optimal Bioequivalence Test
In this section we extend the bioequivalence test described in Section 2.2 to NLMEM. It will be shown in Section 4 that the new method significantly improves currently used tests for bioequivalence of concentration curves measured by the pharmacokinetic parameters and as it can also be applied in the case of sparse designs. Further this test turns out to be more powerful than the modelbased TOST described in Section 3.2, in particular for small sample sizes or data with high variability. The adaption of Algorithm 2.1 to modelbased bioequivalence is very straight forward and is summarized in the following algorithm:
Algorithm 3.1.
(A modelbased optimal bioequivalence test on and )

Estimate a NLMEM to the data, resulting in the parameter estimate . This can be done for example for parallel designs using the package by Comets et al. (2011)
. The test statistic can be directly calculated as secondary parameter of the model parameters (see (
4.7)) and is given byApproximate the standard error of the estimate by using the DeltaMethod as describred in (4.10).

Reject the null hypothesis, whenever
(4.11) where is the quantile of the folded normal distribution .
Finite sample properties of this method are given in Section 4.
4 Numerical comparison of NCA and modelbased approaches
In this section we investigate the finite sample properties of the different methods by means of a simulation study. For this purpose we consider eight different scenarios for parallel designs and for twoperiodstwosequencecrossover studies respectively. Note that the latter represent the standard design for bioequivalence trials. More precisely, we will use the models as described in Section 3.1 in order to simulate pharmacokinetic (PK) data using a population PK model with several scenarios varying the study design, the number of sampling times per subject and the magnitude of BSV and WSV (for the crossover designs). The threshold for bioequivalence in (2.1) is as explained in Section 2 chosen as in all cases under consideration.
4.1 Settings
We use the same PK model as described in Dubois et al. (2011), which describes concentrations () of the antiasthmatic drug theophylline, for both reference and test group. More precisely, we consider a onecompartment model with firstorder absorption and firstorder elimination and hence the pharmacokinetic function in (4.1) is defined by
(5.1) 
where is the dose, the bioavailability, the absorption rate constant, the clearance of the drug, and the volume of distribution and hence is composed of , and .
The value for the residual error model in (4.2) were set to mg/l and .
The dose is fixed to mg for all subjects, and the fixed effects for the reference treatment group are , , and .
The variancecovariance matrices and were chosen to be diagonal and
we investigate two different levels of variability for the parallel and crossover design as specified in Table 1.
To evaluate the type I error of the approaches, we simulate a treatment effect on parameters and given by , which affects the and similarly, that is The power of the bioequivalence test will be evaluated for .
We will study two sampling time designs

Rich design: , samples taken at times hours after dosing,

Sparse design: , samples taken at times hours after dosing
as described in Dubois et al. (2011), where all subjects have the same vector of sampling times.
Note that in this situation the sparse design reflects the most critical case as three sampling points are the minimum required for estimating a model with three parameters given in (5.1).
For each scenario, we simulate data sets. For the estimation of the model parameters we use the SAEM algorithm (see Kuhn and Lavielle (2005)). More precisely, in case of parallel designs, we use the R package developed by Comets et al. (2011) with chains and iterations. For crossover studies, we used Monolix 2018 R2 developed by Lixoft (2018) to fit the model to the data with the same number of chains and interations as for parallel designs.
For the standard NCA analysis (see for example Gabrielsson and Weiner (2001)) we used the R package developed by Ekstrom (2019). As this technique is not appropriate for sparse samples we only report results for NCAbased methods based on rich design.
We start considering a parallel design. Twotreatments parallel trials are simulated, that is subjects receive the reference treatment R and the other subjects are allocated to the test treatment T. Illustrations of the simulated concentrations in groups R and T under and in (4.8) are presented in Figure . Secondly, we observe a twoperiods twosequences crossover design. For each trial, the subjects allocated to the first sequence receive the reference treatment first and then the test treatment. The other subjects allocated to the second sequence receive treatments in the reverse order. Table 1 displays all variabilities under consideration.
Design  Variability Scenario  
Parallel  Low BSV  22  11  22  NA  NA  NA 
High BSV  52  NA  NA  NA  
Crossover 
Low  
High  50  15 
4.2 Results
4.2.1 Type I error
In Table 2 we show the results for all tests proposed in Sections 2 and 3. For parallel designs it becomes obvious that both the NCAbased and the modelbased TOST are conservative in settings with a high variability, while the new approach yields a very accurate approximation of the level. This corresponds to the empirical findings in Section 2
and the theoretical arguments given in the Appendix. However, we observe a slightly increased type I error for the sparse design with low variability for both modelbased methods, probably due to standard error underestimation as mentioned by
Dubois et al. (2011). For rich samples and low variability all four tests under consideration perform well and yield an accurate approximation of the nominal level at boundary of the hypotheses, that is .In the case of crossover designs the approximation of the level is very precise for all four tests under consideration, even in the case of high variability. This can be explained by the fact that each individual receives a test and a reference treatment and hence we have twice as much data as for the parallel designs data. However, there is a slight type I error inflation () for the modelbased TOST considering a sparse design with high variability. Concluding, the type I error rates are close to in almost all scenarios under consideration. For increasing variances both versions of the TOST become very conservative whereas the new approach approximates the level still very precisely.
Study Design  Parallel  Crossover  
Sampling time  Rich  Sparse  Rich  Sparse  
Variability  Low  High  Low  High  Low  High  Low  High  
NCATOST  AUC  0.052  0.022      0.046  0.042     
0.062  0.012      0.062  0.070      
NCABOT  AUC  0.052  0.054      0.046  0.042     
0.062  0.052      0.062  0.070      
MBTOST  AUC  0.056  0.004  0.076  0.006  0.056  0.042  0.038  0.050 
0.058  0.008  0.066  0.002  0.064  0.070  0.044  0.078  
MBBOT  AUC  0.056  0.064  0.076  0.034  0.056  0.044  0.038  0.056 
0.070  0.060  0.070  0.058  0.064  0.054  0.044  0.056 
4.2.2 Power
In order to investigate the power of the proposed methods we consider the scenarios summarized in Table 1 with a treatment effect of and . In Table 3 we display the results for the four tests under consideration. In the case of parallel designs we observe that a sparse design does not affect the performance of the tests as much as the level of variability, which when high leads to a huge loss of power for all methods. Although in these settings the power is only close to for the new modelbased approach, a noticeable improvement compared to the modelbased TOST is visible, as for this test the power is practically zero. For low variability the modelbased tests perform very similarly, which confirms again the empirical findings in Section 2 and some theoretical explanation for these observations is given in the appendix. When considering rich designs the NCAbased methods achieve more power than the modelbased ones but the difference turns out to be quite small. However, for sparse designs NCAbased methods are not applicable and in case of low variability we obtain a very high power for both modelbased approaches. For the crossover designs all tests under consideration yield a power of one, irrespective of the sampling time, design and variability. This effect can again be explained by the larger sample size and each individual receiving both treatments.
Study Design  Parallel Design  Crossover Design  
Sampling Time  Rich  Sparse  Rich  Sparse  
Variability  Low  High  Low  High  Low  High  Low  High  
NCATOST  AUC  0.998  0.132      1.000  1.000     
0.998  0.056      1.000  1.000      
NCABOT  AUC  0.998  0.228      1.000  1.000     
0.998  0.154      1.000  1.000      
MBTOST  AUC  0.830  0.008  0.804  0.004  1.000  1.000  1.000  0.998 
1.000  0.024  1.000  0.016  1.000  1.000  1.000  1.000  
MBBOT  AUC  0.838  0.140  0.808  0.132  1.000  1.000  1.000  1.000 
1.000  0.138  1.000  0.116  1.000  1.000  1.000  1.000 
5 Conclusions
In this paper we addressed the problem of sparse designs and high variability in bioequivalence studies. As described by Phillips (1990) and Tsai et al. (2014) we demonstrated that in general for data with high variability methods based on the TOST suffer from a lack of power. To address this problem we introduced a new method using quantiles of the folded normal distribution, which we called bioequivalence optimal testing in this paper. In the case of known variances we proved in the Appendix that this test is uniformly most powerful in this setting and has consequently more power than the TOST. These arguments can be transferred to general bioequivalence testing using NCA or NLMEM if the sample variances can be estimated with reasonable accuracy.
By means of a simulation study we compared the new procedure to the TOST, considering them both based on NCA and NLMEM. We demonstrated that bioequivalence testing based on the new approach is a more powerful alternative to the commonly used TOST if the and are obtained by NCA. This superiority is also observed if these parameters are obtained by fitting an NLMEM, in particular for data with large variability.
Acknowledgements
This work has also been supported in part by the Collaborative Research Center “Statistical modeling of nonlinear dynamic processes” (SFB 823, Teilprojekt T1) of the German Research Foundation (DFG) and by the Food and Drug Administration (FDA) under contract 10110C. The authors would also like to thank Dr. Martin Fink and Dr. Frank Bretz for their helpful discussions and comments on an earlier version of this paper.
Disclaimer
This article reflects the views of the authors and should not be construed to represent FDA’s views or policies.
References
 Berger (1982) Berger, R. L. (1982). Multiparameter hypothesis testing and acceptance sampling. Technometrics, 24:295–300.
 Bristol (1993) Bristol, D. R. (1993). Probabilities and sample sizes for the two onesided tests procedure. Communications in StatisticsTheory and Methods, 22(7):1953–1961.
 Brown et al. (1997) Brown, L. D., Hwang, J. G., and Munk, A. (1997). An unbiased test for the bioequivalence problem. The annals of Statistics, pages 2345–2367.
 Chow and Liu (1992) Chow, S.C. and Liu, P.J. (1992). Design and Analysis of Bioavailability and Bioequivalence Studies. Marcel Dekker, New York.
 Combrink et al. (1997) Combrink, M., McFadyen, M., and Miller, R. (1997). A comparison of the standard approach and the nonmem approach in the estimation of bioavailability in man. Journal of pharmacy and pharmacology, 49(7):731–733.
 Comets et al. (2011) Comets, E., Lavenu, A., and Lavielle, M. (2011). Saemix, an r version of the saem algorithm. In 20th meeting of the Population Approach Group in Europe, Athens, Greece. Abstr, volume 2173.
 Demidenko (2013) Demidenko, E. (2013). Mixed models: theory and applications with R. John Wiley & Sons.
 Dubois et al. (2012) Dubois, A., Gsteiger, S., Balser, S., Pigeolet, E., Steimer, J. L., Pillai, G., and Mentré, F. (2012). Pharmacokinetic similarity of biologics: analysis using nonlinear mixedeffects modeling. Clin. Pharmacol. Ther., 91(2):234–242.
 Dubois et al. (2011) Dubois, A., Lavielle, M., Gsteiger, S., Pigeolet, E., and Mentré, F. (2011). Modelbased analyses of bioequivalence crossover trials using the stochastic approximation expectation maximisation algorithm. Statistics in medicine, 30(21):2582–2600.
 Ekstrom (2019) Ekstrom, C. (2019). Mess: Miscellaneous esoteric statistical scripts. R package version 0.5.5, available at https://CRAN.Rproject.org/package=MESS.
 EMA (2014) EMA (2014). Guideline on the investigation of bioequivalence. available at http://www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2010/01/WC500070039.pdf.
 Gabrielsson and Weiner (2001) Gabrielsson, J. and Weiner, D. (2001). Pharmacokinetic and pharmacodynamic data analysis: concepts and applications, volume 2. CRC Press.
 Hu et al. (2004) Hu, C., Moore, K. H., Kim, Y. H., and Sale, M. E. (2004). Statistical issues in a modeling approach to assessing bioequivalence or pk similarity with presence of sparsely sampled subjects. Journal of pharmacokinetics and pharmacodynamics, 31(4):321–339.
 Kuhn and Lavielle (2005) Kuhn, E. and Lavielle, M. (2005). Maximum likelihood estimation in nonlinear mixed effects models. Computational Statistics & Data Analysis, 49(4):1020–1038.

Lacey et al. (1997)
Lacey, L., Keene, O., Pritchard, J., and Bye, A. (1997).
Common noncompartmental pharmacokinetic variables: are they normally or lognormally distributed?
Journal of biopharmaceutical statistics, 7(1):171–178.  Lehmann and Romano (2006) Lehmann, E. L. and Romano, J. P. (2006). Testing statistical hypotheses. Springer Science & Business Media.
 Lixoft (2018) Lixoft (2018). The monolix software, version r2.
 Mentré et al. (2001) Mentré, F., Dubruc, C., and Thénot, J.P. (2001). Population pharmacokinetic analysis and optimization of the experimental design for mizolastine solution in children. Journal of pharmacokinetics and pharmacodynamics, 28(3):299–319.
 Midha and McKay (2009) Midha, K. K. and McKay, G. (2009). Bioequivalence; its history, practice, and future.
 Oehlert (1992) Oehlert, G. W. (1992). A note on the delta method. The American Statistician, 46(1):27–29.
 Panhard and Mentré (2005) Panhard, X. and Mentré, F. (2005). Evaluation by simulation of tests based on nonlinear mixedeffects models in pharmacokinetic interaction and bioequivalence crossover trials. Statistics in medicine, 24(10):1509–1524.
 Panhard et al. (2007) Panhard, X., Taburet, A.M., Piketti, C., and Mentré, F. (2007). Impact of modelling intrasubject variability on tests based on nonlinear mixedeffects models in crossover pharmacokinetic trials with application to the interaction of tenofovir on atazanavir in hiv patients. Statistics in medicine, 26(6):1268–1284.
 Pentikis et al. (1996) Pentikis, H. S., Henderson, J. D., Tran, N. L., and Ludden, T. M. (1996). Bioequivalence: individual and population compartmental modeling compared to the noncompartmental approach. Pharmaceutical research, 13(7):1116–1121.
 Phillips (1990) Phillips, K. F. (1990). Power of the two onesided tests procedure in bioequivalence. Journal of pharmacokinetics and biopharmaceutics, 18(2):137–144.
 Romano et al. (2005) Romano, J. P. et al. (2005). Optimal testing of equivalence hypotheses. The Annals of Statistics, 33(3):1036–1047.
 Schuirmann (1987) Schuirmann, D. J. (1987). A comparison of the two onesided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of pharmacokinetics and biopharmaceutics, 15(6):657–680.
 Sheiner and Wakefield (1999) Sheiner, L. and Wakefield, J. (1999). Population modelling in drug development. Statistical methods in medical research, 8(3):183–193.
 Tsai et al. (2014) Tsai, C.A., Huang, C.Y., and Liu, J.p. (2014). An approximate approach to sample size determination in bioequivalence testing with multiple pharmacokinetic responses. Statistics in medicine, 33(19):3300–3317.
 U.S. Food and Drug Administration (2003) U.S. Food and Drug Administration (2003). Guidance for industry: bioavailability and bioequivalence studies for orally administered drug productsgeneral considerations. Food and Drug Administration, Washington, DC. available at http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/ucm070124.pdf.
 Wellek (2010) Wellek, S. (2010). Testing statistical hypotheses of equivalence and noninferiority. CRC Press.
 Yee (2015) Yee, T. W. (2015). Vector Generalized Linear and Additive Models: With an Implementation in R. Springer, New York, USA.
Appendix A Theoretical comparison of tests for bioequivalence
In this section we provide some theoretical explanation, why the approach presented in Section 2.2 has more power than the TOST. For this purpose we now assume that variances in the reference and treatment group are known. In this case the quantiles of the distribution in (2.7) can be replaced by those of a normal distribution and the power functions of all tests can be calculated explicitly. We also note that this assumption is very well justified, if the sample sizes in both groups are sufficiently large. In other words: all arguments presented in this section can be applied to the NCAbased tests discussed in Section 2.1 and 2.2 provided that the sample sizes are sufficiently large. A similar comment applies to the modelbased test for bioequivalence introduced in Section 3. We begin with a discussion of the TOST.
a.1 The two onesided Test (TOST)
Consider the rejection rule of the TOST defined in (2.7), where we replace the estimate of the (pooled) variance by its true value and the quantile by the quantile of the standard normal distribution denoted by . If the probability of rejection is (because the conditions in (2.7) are contradicting). On the other hand, and more importantly, if the probability of rejection for the test (2.7) is given by
(3.1)  
where denotes the distribution function of the standardnormal distribution. From this formula we draw the following conclusions (if ):

The test (2.7) controls its level. For example, if we have
and with a similar argument the same inequality can be derived for .

At the ”boundary” of the null hypothesis (that is ) we have
As converges to if converges to infinity, we expect that the level of the test (2.7) is close to at the ”boundary” of the null hypothesis, if is small. This happens, for example, if the variance (and hence the pooled variance ) is small or, alternatively, if the sample sizes and in both groups are very large. On the other hand the test (2.7) is conservative if the variance is large. In the extreme case we have
a.2 The uniformly most powerful approach
Similar to the TOST the test proposed in Section 2.2 simplifies under the additional assumption of a known variance. As the variance is assumed to be known, the null hypothesis is rejected, whenever,
(3.2) 
where denotes the quantile of the folded normal distribution . The following result shows that the test defined by (2.8) is the uniformly most powerful test for the hypotheses (2.1). It is well known in the mathematical statistics literature and we present a proof here for the sake of completeness (see also Lehmann and Romano (2006), Romano et al. (2005) or Wellek (2010)).
Theorem A.1.
Proof: In order to prove optimality recall (2.5), that is , and note that the hypotheses in (2.1) can be rewritten as
(3.3) 
The test (2.8) rejects the null hypothesis whenever , where is the quantile of the folded normal distribution with parameters , which is defined by
(3.4) 
The probability of rejection is now given by
(3.5)  
where denotes the distribution function of the standardnormal distribution.
On the other hand the uniformly most powerful test for the problem (3.3) is well known, see for example Theorem 6 in Section 3.7 of Lehmann and Romano (2006) or Example 1.1 in Romano et al. (2005) This test reject the null hypothesis in (3.3), whenever
where the constant is the unique solution of the equation
(3.6) 
[see Example 1.1 in Romano et al. (2005)]. As the equations (3.4) and (3.6) coincide, it follows that and the test (2.8) coincides with the UMP test for the hypotheses (2.1).
As a consequence of Theorem A.1 the test proposed in Section 2.2 has always more power than the test defined by (2.7). This is indicated in Figure 2, where we display the power of both tests in different scenarios (, ). The left panel shows the power curves for . In this case the curves basically coincide (although the power of the test (2.8) is slightly larger as stated in Theorem A.1). For increasing variance () it becomes obvious that the power of the test (2.8) is much higher than that for the TOST. This effect becomes even clearer in the right panel (), where the power curve of the TOST is identical to zero.