1 Introduction
In many problems, including the application (see Section 2) that motivated this paper, clustered data are collected via a rotating sampling plan. Such a plan provides administrative convenience and informativeness (Nijman et al., 1991) and is a common tool in sampling practice (Visagie, 2019). Under rotating plans, the observations in the same clusters are correlated, and observations on the same unit collected on different occasions are also correlated. Ignoring this correlation structure may lead to invalid inference procedures. One must develop specific statistical theory and methods to correctly and effectively draw information from the data.
However, in the above context, population parameters such as percentiles of practical interest e.g., in structural engineering, lie outside the domain of applicability of classical statistical methods. Clustered data prove particularly challenging, and ignoring their structure leads to inflated type I errors (Verrill et al., 2015; Datta and Satten, 2008). The result will be unacceptably high frequencies of false rejections, i.e., false alarms. It is difficult to build defensible and easytouse parametric models that accommodate complex cluster structures. Some recent results such as those of Chen et al. (2016) are limited to handling independent crosssectional samples from multiple populations with or without clusters. The goal of this paper is to develop a method capable of handling the longitudinal random effects implied by the rotating sampling plan.
We develop a method for testing a null hypothesis
concerning population parameters against an alternative hypothesis. However, we diverge from the conventional Neyman–Pearson (NP) approach of developing a test statistic and then finding its rejection region and thereby choosing the appropriate action in a decisiontheoretical framework, selecting either the null or alternative hypothesis. The complexity of the inferential problems forces us to seek an alternative approach. We begin by borrowing from the Fisherian alternative to the NP approach: significance testing
(Johnstone, 1987). Thus, we first define a test statistic to partner with the null hypothesis but with the alternative in mind. Given the data, we then generate a pvalue for that statistic. We depart from significance testing by using our pvalue merely as a way to find the rejection region for the NP test, since an analytical determination of that region is generally not feasible except possibly by invoking asymptotic theory.Finding the pvalue presents its own challenges, if we follow the conventional route of specifying a sampling distribution and then its cousin, the likelihood function, and so on. So we again borrow from Fisher, who argued in favor of separating the sampling model from the inferential model. Thus, his socalled exact test for contingency tables was born. Our approach uses a permutation scheme for hypothesis testing that follows from the assumption that the sequences of samples are exchangeable over time or space in data collected via a rotating sampling plan. Surprisingly, this modest assumption enables us to empirically compare the performance of each member of a rich class of possible test statistics. Moreover, the approach proves to make modest computational demands.
In fact, the permutation is a general approach that plays an active role in modern statistical practice (Pesarin and Salmaso, 2010; Hemerik and Goeman, 2018; Hemerik et al., 2019). To provide the stochastic foundation for the test, we use the semiparametric density ratio model or DRM (Anderson, 1979; Qin and Zhang, 1997) to accommodate the multiple population structure. Then we use the empirical likelihood (EL; Owen, 2001) to construct nonparametric test statistics. A simulation study indicates that the tests proposed in this paper firmly control the type I errors whether or not the data are clustered. The use of the DRM improves the power of the tests. An investigation of the influence of the choice of basis function in the DRM suggests that the efficiency gain is widely observed.
The paper is organized as follows. In the next section, we describe the rotating sampling plan and its implied random effects. We then reveal the structural symmetry in multiple populations and their clusters. Section 4 gives details of the proposed generic permutation test. Section 5 proposes a number of test statistics to be used for the permutation tests, including the straightforward and classical tstatistic and Wilcoxon ranksum statistic. Their type I errors become well controlled when their pvalues are computed through a permutation scheme, rather than through asymptotic theory developed for independent and identically distributed (IID) observations. The simulation experiments in Section 6
show that the classical ttest and Wilcoxon ranksum test have inflated type I errors if the pvalues are computed by the asymptotic theory developed for IID observations. In contrast, the permutation tests based on various test statistics, including the tstatistics and Wilcoxon statistics, are shown to have wellcontrolled type I errors. A combination of the DRM and the EL leads to tests for percentiles that have much improved powers. This advantage obtains even when the basis function underlying the DRM is mildly misspecified. The paper concludes with a discussion and an appendix that gives technical details for the numerical methods employed in the simulation experiments.
2 Motivating application
Lumber is manufactured from the trees of a forest. The trees are cut down and trimmed to get logs that are transported to mills where they are sawn in an optimal way to get pieces of lumber. These pieces are classified into grades, primarily according to their strength for engineering applications. Each grade has a published design value (DV) for each type of strength, notably under stretching (ultimate tensile strength or UTS), compression, or bending (modulus of rupture or MOR). Its stiffness (modulus of elasticity or MOE), which is related to all these other characteristics, is, unlike MOR, not measured by destructive testing.
The DV is a specified quantile of the strength distribution, commonly a median or the fifth percentile. Thus, the grade of a piece of lumber for engineering applications depends on its intended use. The top grade is both strong and expensive. The development of the modern grading system has been a triumph of structural engineering since it has standardized lumber properties. Thus, wood, a heterogeneous material unlike say aluminum, can be used with the assurance that the lumber made from it has a low probability of failure when used for its intended purpose.
Given the changing climate and its consequences such as changes in the way trees grow, forest fires, and insect infestations, the quality of wood may vary across regions and over time. Along with this has come changes in processing techniques. Thus, concern has arisen about a possible decline in the strength of lumber. This has led to the need to estimate and hence monitor DVs over time. The first such longterm monitoring program was established in 1994 in the southeastern United States. Crosssectional samples were taken annually using a stratifiedbyregion sampling plan. The number of mills in each region was determined, and the primary sampling units (PSUs) within a region were chosen by simple random sampling. One or two bundles, i.e., secondary sampling units (SSUs), of about
pieces each were selected. From each, a “lot” with ten pieces was chosen in a prescribed way and its MOE was measured. Since the MOE does not involve destructive testing, the sampling plan was relatively inexpensive. Moreover, the MOE was predictive of the other strengths. The subsequent analyses used paired ttests, since the MOE has an approximately normal distribution.
Canada also established a longterm monitoring program. Planning for a pilot program, also measuring the MOE, began in 2005; a preliminary analysis showed a substantial variation between mills, within mills, and between lots. The goal at the time was to measure temporal trends in the MOE, and so a rotating panel design was selected for a specified grade of lumber, with a sixyear rotation. This plan limited the mill response burden, made a consistent random mill effect over time (six years) plausible, and refreshed the sample to maintain some degree of crosssectional validity. On the other hand, as for the paired ttest, changes over time could with confidence be ascribed to changes in strength rather than merely changes in the sample of mills. It soon became apparent that the plan established for the MOE could be used for monitoring the MOR as well.
This led to new challenges: the statistical theory needed to assess trends in the MOR under a rotating sampling plan did not exist. The Forest Products Stochastic Modelling Group (FPSMG), based at the University of British Columbia, was therefore established. It was cofunded by FPInnovations, a nonprofit industrial research lab, and the Natural Sciences and Engineering Research Council of Canada. The FPSMG, which has involved engineers and wood scientists at FPInnovations working in collaboration with statistical faculty and students, is in its tenth year at the time of writing. It has made numerous contributions to the theory and practice of strength measurement and monitoring for forest products: see for example Zidek and Lum (2018), Cai et al. (2017), Chen et al. (2016), and Chen and Liu (2013).
Meanwhile, for reasons beyond the scope of this paper, a separate North American longterm monitoring program has been specified in a revision of an ASTM standards document (D1990). It assumes a crosssectional sample once every five years and specifies among other things that a Wilcoxin test is to be used to assess change in the fifth percentile of the MOR. The document ignores both the PSU and SSU cluster effects induced by their random effects. In a companion article to this one, an alternative method has been proposed for use in the new ASTM monitoring plan. This paper addresses the assessment of trends in strength percentiles for rotating panel designs where samples are taken every year. Its genesis lies in the need for a method that can handle the cluster effect across time and across space.
3 Rotating sampling plans and their random effects
Our research focuses on the analysis of data from rotating sampling plans. At its foundation lies a grand population made up of PSUs. Each such unit is made up of a number of SSUs. The size of the population is so large that it can be regarded as infinite for practical purposes. The grand population itself remains stable in terms of its PSU composition. Multiple samples are formed by data collected from SSUs whose responses evolve over time, space, or other dimensions. For definiteness and expository simplicity, we will assume that the dimension is time.
3.1 Rotating sampling plan
In a rotating sampling plan, we first randomly select (primary) sampling units from the population. For instance, in year , we may sample schools as the PSUs. We then collect data from students at each of the sampled schools. In year we refresh the sample by selecting new schools while retaining of the original schools. Again, students are selected from each school. This process continues until all of the original schools are replaced, which occurs at year .
Since the populations in applications are very large, we may consider the sampling at this stage to be done with replacement. This is well justified in the survey context (Rao and Shao, 1992). In general, on each new occasion, units from the rest of the population are randomly selected to replace units in the original sample. Conceptually, this procedure will continue forever. After or more occasions, all the units in the original sample are flushed out.
The adoption of the rotational sampling plan leads to both longitudinal and crosssectional clustering structures in the data. On each occasion, we collect data on only ultimate sample units in the th sampling unit. The response values can therefore be denoted . For instance, we may collect data from students at each sampled school, and each school will stay in the sample for a specific number of years. In the motivating example, 10 wood pieces from each sampled mill may have their mechanical strength measured. For ease of presentation, we further simplify the notation and let with where is the set of primary units in the sample on occasion .
The components within each (for fixed ) are naturally dependent. This leads to withinpopulation cluster/random effects. When , and are data collected from the same PSU on two occasions. They are therefore naturally dependent, leading to longitudinal cluster/random effects.
Let
be the response variable of a randomly selected SSU on occasion
. We denote its population distribution by , for . Since the population size is very large, for many types of response variables, we may regard as a continuous distribution. This is also true when the values of in the finite population are random samples from a superpopulation with continuous . In some applications, is naturally discrete and there is no need to regard it as continuous. In this case, our subsequent discussion remains applicable.3.2 Population and sampling plan
Let be the indices of the PSUs included in the th sample (). To fix the idea, we highlight the following properties of the population and data from rotating sampling plans:

The multiple samples are collected on several occasions from the same grand population via a rotating sampling plan, and the response values for the same unit may evolve.

Each cluster
forms a vectorvalued time series in response
over . The time series formed by different clusters are mutually independent. 
When , the joint distributions of and are identical for any . In other words, for fixed , the distribution of is exchangeable for the th and th entries of the time series.
The properties above, except No. 4, are not too technical and are plausible in the targeted applications. The DRM assumption in No. 4 is also reasonable: its validity mostly relies on the nonradical evolution of the population characteristics. Using this model leads to improved efficiency when it is approximately satisfied. The efficiency gain remains when this assumption is mildly violated, as we will show in the simulation section.
We note that is the distribution of the response value of a single SSU randomly selected from the th population. In this paper, we propose a permutation test for hypotheses concerning functionals of based on multiple samples collected via the rotating sampling plan described above.
4 Permutation tests
Let be the datagenerating distribution and a test statistic designed to test a null hypothesis against a specific alternative hypothesis: and . We assume that a larger supports . To construct a test of size , we look for a constant such that
(1) 
Let the observed value of be . The test rejects if . One may equivalently compute a value
(2) 
and reject when , for that will imply and hence rejection by the NP hypothesistesting criterion.
In view of the above, the ultimate task of developing a valid test is to find an effective statistic and a way to compute the resulting value while bypassing the need to specify explicitly. In the context of tests based on multiple samples from a rotating sampling plan, let
be the test statistic of choice, with the subindex added to highlight its dependence on the sample size. Suppose the population distribution does not change from occasion 0 to occasion 1: namely, . Then and have the same distribution for all . Taking advantage of this exchangeability, we design a permutation procedure as follows:
 Step I.

For each , generate a random permutation
, independent of all other random variables, such that
(3) and let . Let and for and respectively.
 Step I+.

Permute to create clustered observations and .
 Step II.

Form a permuted multiplesample , , and for for .
We now present the proposed permutation test.
Permutation test. For each permuted multiplesample, compute the value of the test statistic
Generate permutation samples repeatedly and independently, say times. Compute the permutation test pvalue
(4) 
where is the realization of . Reject the null hypothesis if where is the nominal level of the test.
In applications, the practitioner conducts the test on a single data set whereas in research projects analyses may be done with thousands of simulated data sets. Hence, it is computationally affordable to choose a large in applications. The margin of error of with the currently recommended is about . Allowing
to be an odd number helps to avoid minor operational issues. In our simulation study, we use a much smaller
to allow for a large number of simulation repetitions. Our reliance on the average performance of the tests, rather than on accurate approximations in each repetition, validates our choice of a smaller .Theorem 1.
Let be a permutation multiplesample obtained via Steps I and II above. Assume that the null hypothesis is true and the model assumptions specified in the summary subsection hold. Then we have the following results:
(a) has the same distribution as .
Proof: (a) When the null hypothesis holds, the joint distribution of is the same as that of for all including all . At the same time, with different ’s are mutually independent. Therefore, the permutation Step I results in a new data set whose joint distribution remains the same as that of . Therefore, has the same distribution as .
(b) The permutation prescribed in Equation (3) ensures that every permutation outcome has an equal probability. Hence, has a uniform distribution on these possible values. This argument ignores rare but possible ties among these values. In such cases, we interpret the uniform distribution as a distribution with probabilities proportional to the cardinality of each distinct permutation outcome. ∎
Remark 1.
The observed value of may be regarded as one random outcome of . The conclusions in the above theorem hence ensure that the type I error of the permutation test equals the nominal level, excluding the roundoff error.
Remark 2.
The alternative hypothesis does not appear relevant in the proof or theorem statement, but it matters for the actual test. It determines the choice of the test statistic . We choose the that is the most sensitive to the departure of the distribution in the direction of , rather than arbitrary departures from the null hypothesis. For this reason, the stochastic size of should increase when the datagenerating distribution is conceptually deep in and far from . In our target application, for example, if states that the population mean of is larger than that of , then an effective choice of is the difference in the two sample means (namely ). A larger difference in the population means leads to a stochastically larger difference in the sample means. If one chooses
to be the difference in the two sample variances, the resulting test may also suggest that
(unequal mean) should be rejected, but for the wrong reason.Remark 3.
Step I+ permutes the units in . The conclusion in Theorem 1 breaks down when Step I+ is included: namely, may have a slightly different distribution from under . However, under the null hypothesis, the difference introduced by this extra step is minor. At the same time, the units in contain crucial information when is true. Hence, we recommend that Step I+ be included. Our simulation study shows that the type I errors are not affected.
Remark 4.
In applications, things may not go as planned. A few primary units may drop out from the rotating sampling plan. A small modification is needed: permute only units sampled on both occasions.
5 Statistics of choice in permutation tests
In this section, we propose some promising statistics for the permutation test. The choice of affects the statistical efficiency but not the validity of the test.
5.1 Straightforward choices of test statistics
Let the null hypothesis be and the alternative be with being the mean, the quantile at some level of , or another population parameter.
Two immediate choices are the classical and Wilcoxon ranksum statistics with the cluster structure in the data ignored:
(5) 
Here are the sample means, is the pooled sample variance ignoring the cluster structure, and
(6) 
where
is the indicator function and the summation is over all observations on occasions 0 and 1. The Wilcoxon statistic is usually normalized in order to use the central limit theorem, but this is unnecessary when the permutation approach is applied.
These two tests were originally designed to handle IID data. The test further requires that the data are from a normal distribution, and it detects the difference in the population means. The Wilcoxon ranksum test is nonparametric and primarily used to detect a location shift in two distributions, although in theory it works only on the size of . Such limitations are often overlooked in applications, and the tests serve general purposes surprisingly well. However, this is not true for clustered data. For such data, the tests have inflated sizes (higher type I errors) if applied directly without the proposed permutation procedure. The generalization of the Wilcoxon test to independent clusters can be found in Datta and Satten (2008), Datta and Satten (2005), and Rosner et al. (2006). Their results are not applicable to clustered data with longitudinal random effects.
Let and be the distributions fitted by any reasonable method. We may use a straightforward statistic for the permutation test:
(7) 
Obvious choices for and are the empirical distributions ignoring the cluster structure based on samples from and . Another possibility will be given in the next section. We are most interested in this type of statistic for population percentiles.
5.2 DRMassisted choices
Under rotating sampling plans, the multiplesamples are collected from closely related populations. They naturally share some intrinsic latent structure. Accounting for this structure leads to more efficient estimates of and and therefore more powerful permutation tests. We recommend the DRM introduced by Anderson (1979); we believe that it fits a broad range of situations. The DRM has been successfully used by many researchers, including Qin and Zhang (1997), Qin (1998), and Keziou and LeoniAubin (2008).
The DRM links the population distributions by
for some prespecified basis function and parameter . Note that when is chosen as the base distribution. We require the first component of to be to make the first component of a normalization parameter. We use the EL of Owen (2001) as the platform for the inference. In the spirit of the EL, we require to have the form . We construct the composite log likelihood function
(8) 
with the summation over all possible indices . The DRM assumption implies the constraints
(9) 
for all . The log likelihood is “composite” because the observations involved are dependent. See Lindsay (1988) and Varin et al. (2011) for an introduction to and a general discussion of the composite likelihood.
Given , maximizing with respect to leads to the profile log empirical likelihood function (in the same notation):
(10)  
Suppose are maximum EL estimators under DRM. The corresponding fitted distribution functions are
(11) 
where The distribution function estimators can then be used in (7) to form statistics for the permutation tests. We give some specific statistics next.
Detecting changes in percentiles under DRM
Let be the percentile of with and being and respectively. The solution for the twosided alternative follows the same principle.
Under the DRM assumption, we give two choices. The first choice is to let
(12) 
where and are the fitted distribution functions given in (11).
The second choice is the empirical likelihood ratio statistic with a computationally friendly alternation. We first pool the samples from and to obtain the th sample percentile: . We then compute the profile constrained composite empirical likelihood
(13)  
The recommended statistic for a permutation test is then
We use an Rfunction called RootSolve to solve the optimization problem. It solves equations formed by the Lagrange multiplier method for constrained maximization. With the corresponding derivative functions provided, this Rfunction works well. The details are given in the Appendix.
5.3 Populations satisfying model assumptions
To support use of the proposed permutation test under the DRM with clustered data and as preparation for a meaningful simulation study, we consider the following examples.
Example 1.
Normal Data. Let , for ; ; be IID standard normal random variables. Let be IID standard normal random variables and another set of IID standard normal random variables, where these are mutually independent of . Let
for some nonrandom constants and .
Based on this construction, the random variables with fixed are not independent but are identically and normally distributed. Their joint distribution is exchangeable within the cluster indexed by . Furthermore, observations on the units in the same cluster taken on different occasions, e.g., and , , are correlated through the shared random effect . Given , the random variables over and have identical marginal distributions. We denote this distribution by . It is easy to verify that satisfy the DRM conditions with the basis function .
In this model, is the nonrandom effect specific to the population on occasion . The random effect is specific to the th cluster and shared over different occasions through the moderator . The random effect is specific to cluster and independent over different occasions. The response value of the th unit in the th cluster on occasion is given by .
In summary, the longitudinal random effects are and the crosssectional random effects are . The marginal distributions satisfy the DRM with the basis function . ∎
Example 2.
Gamma Data.
A oneparameter Gamma distribution has a degree of freedom parameter
with density functionwhere is the wellknown Gamma function.
Let be a vector and and two real numbers. We denote the vector comprised of as . With this convention, we create a complex cluster structure through the operation for and : . The elements of the stochastic models for are specified as follows:

The are independent with distribution . Given cluster , its value remains the same for all , so this term leads to a longitudinal random effect.

The are independent with distribution . It is shared by the entries in cluster on occasion , and this design leads to the crosssectional random effect.

is a vector of independent random variables with distribution where is the degrees of freedom of occasion . They contribute most of the variations in the response vector . The difference in leads to changes in the marginal distribution.

introduces additional scale fluctuations over the occasions.
The marginal distributions of are specific and denoted by . Because of the independence between , , and , and the property of the Gamma distribution, is also a Gamma distribution with rate parameter and degrees of freedom . Gamma distributions satisfy the DRM specified in (5.2) with .
In summary, by generating multiple samples from this model, we obtain with both longitudinal and crosssectional random effects as described in Section 3. ∎
Example 3.
General data. Consider a population made of a large number of realized values of a random sample from a superpopulation . Denote these as with
carrying no structural information at the moment. Let
, where
is an rdimensional vectorvalued function, symmetric in ;

: are IID;

are independent for different , and they are identically distributed given .
In this general setting, the multiple samples have the crosssectional and longitudinal random effects described in Section 3. In addition, when for some , exchanging and for any subset of in does not change the joint distribution of the multiple sample. At the same time, the population distributions clearly share some general properties. A DRM with an appropriately rich basis function , such as when takes positive values, will be a good approximation for the population distributions . ∎
6 Simulation
In this section, we present simulation results to illustrate the effectiveness and necessity of the proposed permutation test. We reveal the longitudinal random effects on the type I errors for the ttest and Wilcoxon test (wtest).
6.1 Data with normal distributions
We generate data from the normal model as described in the last section. The specific model parameters are chosen as follows:

The number of occasions/populations is .

The number of units per cluster is either or .

The population means vector is one of: , , , and with unspecified means randomly generated on each repetition as .

The number of clusters (primary units) in each sample is either or . The rotating sample plan replaces clusters on each occasion.
The above choices lead to distinct settings. Compounded with the permutations, this leads to a computationally challenging analysis. We must reduce the computational burden. Since the overall sample size increases either with more clusters or with larger cluster sizes, in the simulation we avoid the option of increasing both sample size and cluster size. These settings cover a broad range of qualitatively different situations:

With different values of , we learn the performance of these tests for both relatively weak and strong crosssectional cluster effects.

We learn if DRMbased methods benefit from their ability to borrow strength compared with methods that use only information in the samples from the populations of interest.

With different cluster sizes or numbers of clusters, we learn about the consistency of these tests. That is, the power increases to 1 when the sample size goes to infinity.
We consider the problem of testing whether the second population has a smaller mean/percentile than the first population. When the population means vector is set to , the first two population distributions are identical. The rejection rate of a test in this case reflects its size. When the vector is , the population mean of the second year is higher. The tests should therefore have rejection rates that are lower than the nominal level. To control the amount of computation, this setting is done only for and . The rejection rate of any effective test should be higher when the population means vector changes to or , where the alternative hypothesis holds.
In the simulations, we set the nominal level to , the number of repetitions to , and the number of permutations to . We recommend a much larger number of permutations in applications. In the simulations, the rejection rates are averages of 1000 repetitions. The precision of the individual values has little impact on the overall performance of the permutation test.
6.1.1 To permute or not to permute
We first demonstrate that the classical ttest and wtest ignoring cluster structure indeed have inflated type I errors, and that their desirable sizes are restored with permutation. Here claims that the first two populations have equal means, and the test is onesided so claims that the second mean is smaller. We compute the values of the ttest and the wtest using the inapplicable asymptotic results that ignore the cluster structure, as well as the permutation approach. Table 1 gives the simulation results; the rejection rates are in the NonPerm and Permutation columns.
()  () = (1, 1, 2)  () = (1, 2, 3)  

t.test  w.test  t.test  w.test  t.test  w.test  t.test  w.test  
NonPerm  Permutation  NonPerm  Permutation  
(8.0, 8.4)  0.3  0.6  0.1  0.1  4.4  4.8  1.7  1.6 
(8.0, 8.0)  9.0  9.6  5.1  4.9  14.2  13.8  6.2  6.4 
(8.0, 7.6)  45.8  44.3  28.8  28.9  31.6  30.9  16.5  16.3 
(8.0, 7.2)  85.2  84.2  73.4  72.7  61.4  60.3  37.5  37.6 
(8.0, 8.0)  18.6  17.4  5.1  5.3  21.8  20.9  4.4  4.0 
(8.0, 7.6)  64.3  62.4  39.8  39.4  46.6  45.7  18.8  18.8 
(8.0, 7.2)  95.8  95.6  84.6  82.7  75.2  74.7  45.0  44.9 
(8.0, 8.0)  9.9  8.8  4.5  5.0  13.3  13 5  5.3  5.3 
(8.0, 7.6)  56.3  55.2  39.9  38.5  38.2  36.6  20.2  19.7 
(8.0, 7.2)  94.1  93.7  87.4  86.0  69.7  68.4  48.2  46.7 
The setting with lies on the boundary of the null hypothesis. When and , these two tests have rejection rates as high as %, %, %, and %, if the cluster structure is ignored (NonPerm). These rates are much closer to the nominal 5% when the cluster structure is handled via the permutation paradigm. The worst case is a null rejection rate of 6.4%, which is still in the range of the simulation error given the 1000 repetitions. Other entries for and show that when , which makes the setting an interior point of , the rejection rates are below the nominal level for all the tests. Likewise, all the rejection rates increase when and increase further when . These additional results are as expected and therefore give general support to the validity of our simulation experiments.
Likewise, the results for , and for , are in the expected range. We get the same message that ignoring the cluster structure leads to inflated type I errors for classical tests. Our permutation procedure is an effective way to handle the clustering induced by the rotating sampling plan.
6.1.2 Percentiles
Percentiles are of particular interest in many applications, but neither the ttest nor the wtest are designed to detect their changes. In this section, we examine permutation tests based on the following statistics introduced earlier:
(14)  
(15)  
(16) 
The first is based on the empirical distributions and . The second is based on the fitted distributions and under the DRM with the basis function vector since we know that the marginal distributions are normal. The third is the likelihood ratio statistic under the DRM. No corresponding asymptotic theory is available, but the permutationbased methods do not rely on asymptotic theory. We denote these tests by EM, EL, and ELR in Table 2.
We consider two null hypotheses: the first two populations have the same percentile or the same percentile. The alternative hypotheses claim that the second population has lower percentiles. The rejection rates of these tests are presented in Table 2.
Clearly, the permutation tests tightly control the sizes of these tests. When the first two populations are identical, the rejection rates range from 4.1 to 6.1, with the vast majority between 4.5 and 5.5. When the first population has larger percentiles than the second one (first line of the first block), the type I errors are much smaller than the nominal level, as expected. When the second population has lower percentiles, the powers are higher than the nominal level, as they must be. The power increases further when the difference gets larger, and it increases when either the cluster size or the sample size increases.
A comparison of the results in the left and right halves of the table shows that the sizes of the permutation tests are not affected by the strength of the random effects. When the random effects are strong, the data contain less information. Hence, the powers on the right half of the table are generally lower. The powers increase when the cluster size increases or the number of clusters increases.
We expected the DRMbased tests to have higher powers, and this is clearly true. Both EL and ELR are better than EM. The improvement is more apparent when the overall sample size is large and for the hypothesis regarding the percentile. The simulation results in this table show that the performance of the ELR is about the same as that of EL.
Finally, it is clear that change in a lower percentile is harder to detect than change in the median under a nonparametric model assumption. This explains the power differences for testing the changes in the and percentiles. The simulation results are consistent with this intuition, and they also serve as a sanity check. One may also conclude from the results in Tables 1 and 2 that detecting changes in the median ( percentile) is harder than detecting those in the mean.
() = (1, 1, 2)  () = (1, 2, 3)  
percentile  percentile  percentile  percentile  
EM  EL  ELR  EM  EL  ELR  EM  EL  ELR  EM  EL  ELR  
1  0.8  0.6  0.8  0.3  0.1  0.1  1.7  1.9  2.2  1.5  1.8  1.9 
2  5.4  4.5  4.7  4.5  5.0  5.2  5.6  5.2  5.0  6.0  6.1  6.1 
3  17.7  19.2  19.1  25.6  29.0  29.5  12.0  11.7  11.4  15.2  16.7  16.3 
4  41.8  49.5  48.4  65.9  75.5  75.6  22.5  26.2  26.0  33.3  37.3  37.4 
2  5.7  5.6  5.9  5.2  5.4  5.4  5.3  5.0  5.1  4.2  4.2  4.1 
3  13.2  13.9  13.8  18.8  18.8  18.9  21.8  25.3  24.6  36.5  39.9  39.7 
4  52.1  59.7  58.2  79.0  85.0  84.6  27.1  30.5  30.2  42.0  44.9  44.9 
2  5.0  4.2  4.2  4.1  4.5  4.6  5.3  5.1  4.9  4.7  4.9  5.2 
3  21.4  24.9  24.6  34.6  39.8  39.9  12.8  14.4  14.1  16.6  19.9  19.4 
4  49.3  60.9  59.4  79.7  87.4  87.2  25.5  31.5  31.2  41.1  47.9  47.7 
6.2 Data with Gamma distributions
In this section, we generate data from the Gamma model with the parameters chosen as follows:

The degrees of freedom vector , , , or with unspecified entries randomly generated in each repetition from .

The degrees of freedom vector is or .

The scale parameter is with unspecified entries randomly generated in each repetition as , being a uniform[0, 1] random variable.
Similar considerations apply to this case. The above settings enable us to examine the performance of these tests in a broad range of situations. We use under the DRM assumption. The other specifications are the same as those in the section on normal data.
6.2.1 To permute or not to permute
We now mimic the simulation conducted with the normal data. The null hypothesis is that the first two populations have the same mean, and the alternative is that the second population has a smaller mean. The qualitative findings from the results in Table 3 generally mirror those in Table 1. We again see the inflated type I errors for the nonpermutation tests and wellcontrolled type I errors for the permutation tests. The powers of the permutation tests increase with increased cluster size or increased number of clusters. The difference between the left and right sides of the table is not remarkable, and it is not easy to decide which side has stronger cluster effects.
()  () = (2, 1.5)  () = (1.5, 2.0)  

t.test  w.test  t.test  w.test  t.test  w.test  t.test  w.test  
NonPerm  Permutation  NonPerm  Permutation  
(8.0, 8.4)  1.1  0.9  0.3  0.4  1.9  1.6  0.7  0.9 
(8.0, 8.0)  8.4  8.5  3.9  4.2  7.7  7.3  3.2  3.3 
(8.0, 7.6)  30.6  31.0  17.9  19.0  32.5  32.7  23.1  22.5 
(8.0, 7.2)  66.2  67.1  48.8  50.1  69.0  68.2  56.5  56.2 
(8.0, 8.0)  16.2  15.3  4.9  5.3  16.4  16.4  4.5  4.9 
(8.0, 7.6)  48.2  48.4  22.5  23.6  50.3  50.2  26.9  26.7 
(8.0, 7.2)  85.7  85.1  61.3  61.0  69.0  68.2  56.5  56.2 
(8.0, 8.0)  11.7  11.9  4.5  4.8  9.5  9.0  5.0  5.3 
(8.0, 7.6)  42.9  42.4  26.3  26.5  37.2  36.3  26.5  26.4 
(8.0, 7.2)  77.2  77.1  64.1  63.0  79.0  78.4  56.5  56.2 
6.2.2 Percentiles
We now move to testing hypotheses specifying equal or percentiles for the first two populations in the multiple samples. The alternative hypotheses are onesided: . The simulation results are in Table 4. We could almost repeat here our earlier comments for the simulation results based on normal data. The type I errors of these tests are well controlled. In fact, they are slightly low when and . They are closer to the nominal level as either the cluster size or the number of clusters increases.
In all cases, the powers of these tests increase when either the cluster size or the number of clusters increases. The EL and ELR have very similar powers. They improve markedly over the EM test that does not use model information in the DRM assumption. The differences between the left and right halves do not give much more information, and both support the general claims in this paper. In summary, the simulation results are as expected.
() = (2, 1.5)  () = (1.5, 2.0)  
percentile  percentile  percentile  percentile  
EM  EL  ELR  EM  EL  ELR  EM  EL  ELR  EM  EL  ELR  
1  0.9  0.6  0.6  0.5  0.2  0.3  1.3  0.9  1.0  0.6  0.8  0.9 
2  4.2  4.8  4.5  4.5  3.6  3.9  4.8  4.3  4.2  3.5  4.1  3.7 
3  13.4  16.7  16.7  16.2  19.0  18.5  14.4  16.7  16.2  18.1  22.8  22.8 
4  29.6  34.9  34.7  42.8  50.0  49.5  29.6  37.3  36.3  46.8  56.6  56.8 
2  4.5  5.3  5.4  5.7  5.2  5.4  6.7  5.7  5.7  5.4  4.8  4.8 
3  18.4  19.9  19.6  21.1  24.4  23.9  18.8  21.9  21.7  25.2  26.9  26.8 
4  39.5  46.6  46.4  55.7  63.2  63.4  43.3  51.4  50.5  61.6  68.9  69.0 
2  4.1  5.5  5.3  3.8  5.1  5.2  5.3  5.9  5.7  5.3  5.1  5.1 
3  15.9  20.0  19.3  22.8  27.3  27.2  14.6  19.8  18.7  23.2  27.3  27.2 
4  38.6  47.9  46.8  56.0  64.7  65.0  35.8  46.1  46.1  56.5  68.3  68.1 
6.3 Data from noname distributions
In many applications, historical data sets of the same nature are available. This paper recommends the use of the DRM to extract latent information from multiple samples to enhance efficiency. When applying the DRM we must choose a basis function. In simulations, we usually generate data from classical distributions, and so an appropriate basis function is readily available. In a parallel research project on a dataadaptive choice of the basis function, we have found that the performance is enhanced under the DRM assumption with . We demonstrate this point in this section: the permutation tests still have wellcontrolled type I errors, and they gain some efficiency under the DRM assumption.
We generate clustered data with all the features under a rotational sampling plan. We ensure that the population distributions share some latent features, but simple basis functions are not available. Nevertheless, we complete the simulation as in the last two sections for EM, EL, and ELR with when the DRM is assumed. Specifically, the data are generated as follows:

Form a finite population having a considerable size, based on data from a realworld application.

Randomly generate , from the standard uniform distribution. Let for some positive constants and .
Sample values, , from a Gamma distribution with degrees of freedom and scale parameter . Randomly draw values from with probability proportional to to form a cluster .

Form multiple samples from a rotational sampling plan as
Here are some explanations. Step 1 creates a grand population, and Step 2 uses a biased sample technique to mimic the evolution of the population distribution over occasions . The random numbers and induce longitudinal and crosssectional random effects; we use and to adjust the strength of these effects. The values are introduced to avoid identical observed values in multiple samples. Since these values have mean 1 and small variance, this does not change the expected value from the case where and is not random.
A large value of leads to stochastically large sampled values. Hence, the hypotheses where the two populations have equal percentiles (at some level) and so on can be formulated through their values. In this section, we consider testing only for equal or percentiles. The alternative hypotheses remain onesided: .
We simulated data with both and to examine the influence of the strength of the longitudinal random effects. We successively set , , , and with unspecified entries generated in each repetition from , where is a uniform [0, 1] random variable. These are labeled as 1, 2, 3, 4 in the first column of the table reporting the results.
We defined the base case to have , cluster size , and number of clusters . We then repeated the simulation with an increased cluster size in one setting and with an increased number of clusters in another setting.
The finite population in this simulation is formed from data collected by students running experiments in a lab located at FPInnovations, Vancouver (Cai et al., 2016). It is made up of 825 observed values of the MOR of a specific type of wood product. The sample mean of this data set is and the sample variance is .
The rest of the simulation settings are the same as before. We do not include tests for changes in the population means because they do not involve the DRM assumption. The simulation results are given in Table 5.
The results confirm the points we intended to make, and they provide routine sanity checks. First, inference with nontailored basis functions under the DRM assumption remains solid. The type I error fluctuates around the nominal level in a low range. The powers of all the tests increase with the cluster size and with the number of clusters. When the datagenerating distributions are not on the boundary of the null hypothesis, the type I errors are below the nominal level, as shown in the first row of the table. Moreover, even when the data are from distributions that do not fully conform to the DRM specifications, the inferences made under the DRM assumptions remain valid; the power comparison to EM remains favorable, although not decisively so. Thus, we recommend the use of the DRM without reservation.
= 2  = 4  
percentile  percentile  percentile  percentile  
EM  EL  ELR  EM  EL  ELR  EM  EL  ELR  EM  EL  ELR  
1  1.4  1.6  2.0  2.1  1.7  1.8  1.9  2.8  2.8  2.3  2.2  2.2 
2  5.7  5.5  5.9  4.2  4.6  4.6  4.8  5.0  5.2  5.1  4.8  4.6 
3  8.3  9.2  9.4  11.2  10.8  10.8  9.0  9.1  9.0  11.2  11.6  11.6 
4  17.4  18.2  17.7  24.2  24.3  24.6  16.9  17.9  17.9  23.8  24.8  24.2 
2  4.8  5.4  5.6  4.4  4.6  4.4  6.2  5.6  5.5  6.7  6.4  6.6 
3  10.7  11.4  11.2  13.3  13.0  13.3  9.1  8.8  8.6  9.4  9.9  9.9 
4  16.4  17.1  17.0  23.7  25.2  24.6  18.4  18.9  18.7  22.7  22.4  22.6 
2  4.9  5.2  5.4  5.0  5.7  5.9  4.1  5.2  5.0  5.9  5.3  5.4 
3  11.8  11.7  11.4  14.3  14.6  14.3  10.1  10.1  9.8  12.0  12.0  12.3 
4  19.3  20.9  20.6  29.4  32.2  31.7  19.4  20.1  19.1  27.6  29.0  28.5 
Acknowledgements. The research was funded by FPInnovations and a Collaborative Research and Development Grant from the Natural Sciences and Research Council of Canada. We are indebted to Mr. Conroy Lum.
References
 Anderson (1979) Anderson, J. (1979). Multivariate logistic compounds. Biometrika 66(1), 17–26.
 Cai et al. (2017) Cai, S., J. Chen, and J. V. Zidek (2017). Hypothesis testing in the presence of multiple samples under density ratio models. Statistica Sinica 27, 716–783.
 Cai et al. (2016) Cai, Y., J. Cai, J. Chen, S. Golchi, M. Guan, M. E. Karim, Y. Liu, J. Tomal, C. Xiong, Y. Zhai, et al. (2016). An empirical experiment to assess the relationship between the tensile and bending strengths of lumber. Technical Report 276, The University of British Columbia Department of Statistics.
 Chen et al. (2016) Chen, J., P. Li, Y. Liu, and J. V. Zidek (2016). Monitoring test under nonparametric random effects model. arXiv preprint arXiv:1610.05809.
 Chen and Liu (2013) Chen, J. and Y. Liu (2013). Quantile and quantilefunction estimations under density ratio model. The Annals of Statistics 41(3), 1669–1692.
 Chen et al. (2008) Chen, J., A. M. Variyath, and B. Abraham (2008). Adjusted empirical likelihood and its properties. Journal of Computational and Graphical Statistics 17(2), 426–443.
 Datta and Satten (2005) Datta, S. and G. A. Satten (2005). Ranksum tests for clustered data. Journal of the American Statistical Association 100(471), 908–915.
 Datta and Satten (2008) Datta, S. and G. A. Satten (2008). A signedrank test for clustered data. Biometrics 64(2), 501–507.
 Hemerik and Goeman (2018) Hemerik, J. and J. Goeman (2018). Exact testing with random permutations. TEST 27(4), 811–825.
 Hemerik et al. (2019) Hemerik, J., A. Solari, and J. Goeman (2019). Permutationbased simultaneous confidence bounds for the false discovery proportion. Biometrika 106(3), 635–649.
 Johnstone (1987) Johnstone, D. (1987). Tests of significance following RA Fisher. The British Journal for the Philosophy of Science 38(4), 481–499.
 Keziou and LeoniAubin (2008) Keziou, A. and S. LeoniAubin (2008). On empirical likelihood for semiparametric twosample density ratio models. Journal of Statistical Planning and Inference 138(4), 915–928.
 Lindsay (1988) Lindsay, B. G. (1988). Composite likelihood methods. Contemporary Mathematics 80(1), 221–239.
 Nijman et al. (1991) Nijman, T., M. Verbeek, and A. van Soest (1991). The efficiency of rotatingpanel designs in an analysisofvariance model. Journal of Econometrics 49(3), 373–399.
 Owen (2001) Owen, A. (2001). Empirical Likelihood. Chapman & Hall/CRC, New York.
 Owen (2013) Owen, A. B. (2013). Selfconcordance for empirical likelihood. Canadian Journal of Statistics 41(3), 387–397.
 Pesarin and Salmaso (2010) Pesarin, F. and L. Salmaso (2010). Permutation Tests for Complex Data: Theory, Applications and Software. John Wiley & Sons.
 Qin (1998) Qin, J. (1998). Inferences for casecontrol and semiparametric twosample density ratio models. Biometrika 85(3), 619–630.

Qin and
Zhang (1997)
Qin, J. and B. Zhang (1997).
A goodnessoffit test for logistic regression models based on casecontrol data.
Biometrika 84(3), 609–618. 
Rao and
Shao (1992)
Rao, J. N. K. and J. Shao (1992).
Jackknife variance estimation with survey data under hot deck imputation.
Biometrika 79(4), 811–822.  Rosner et al. (2006) Rosner, B., R. J. Glynn, and M.L. T. Lee (2006). The Wilcoxon signed rank test for paired comparisons of clustered data. Biometrics 62(1), 185–192.

Soetaert (2009)
Soetaert, K. (2009).
rootSolve: Nonlinear root finding, equilibrium and steadystate analysis of ordinary differential equations
. R package 1.6.  Soetaert and Herman (2009) Soetaert, K. and P. M. Herman (2009). A Practical Guide to Ecological Modelling: Using R as a Simulation Platform. Springer. ISBN 9781402086236.
 Varin et al. (2011) Varin, C., N. Reid, and D. Firth (2011). An overview of composite likelihood methods. Statistica Sinica 21, 5–42.
 Verrill et al. (2015) Verrill, S., D. E. Kretschmann, and J. W. Evans (2015). Simulations of strength property monitoring tests. Unpublished manuscript. Forest Products Laboratory, Madison, Wisconsin. Available at http://www1.fpl.fs.fed.us/monit.pdf.
 Visagie (2019) Visagie, J. (2019). Measuring regional labour markets in South Africa: How robust are subnational estimates from the Quarterly Labour Force Survey? Development Southern Africa 36(4), 461–475.
 Zidek and Lum (2018) Zidek, J. V. and C. Lum (2018). Statistical challenges in assessing the engineering properties of forest products. Annual Review of Statistics and its Application 5(1), 237–264.
Appendix: Computational issues
The numerical implementation of most of our proposed permutation tests is straightforward. The implementation of the ELR is conceptually simple but involves some tedious steps. To compute defined following (13), we must solve the optimization problem . The constraints in the definition of can be rewritten as
Given , there always exists a , the vector formed by , that solves the above equation system provided vector is an interior point of the convex hull of
Comments
There are no comments yet.