1 Introduction
Consider a randomized clinical trial, where each subject is randomized to one out of groups. Frequently, multiple outcomes (e.g., one primary and one coprimary outcome, several secondary outcomes) need to be compared between the groups. One straightforward way of accomplishing this goal would be to conduct univariate tests (e.g., tests) and adjust for multiplicity. Nevertheless, this approach has the drawback of not taking potential correlations between the outcomes (e.g., due to physiological reasons, there might be correlations between laboratory measurements) into account. Moreover, when analyzing global assessment variables, further safety and efficacy outcomes, or objective variables related to the global assessment, should be analyzed [International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use, 1998]
. In these instances, employing a multivariate analysis of variance model would be an attractive option. Still, however, the estimators of the treatment effects might be biased, and the power of the test might be low, if covariates (
i.e., variables that are thought to influence the outcomes) are not included in the model. Therefore, a multivariate analysis of covariance (MANCOVA) model might be a better choice. Indeed, MANCOVA models have been used in several recent publications in medical research [RoldanValadez et al., 2013, Jackson et al., 2018, Setyowibowo et al., 2018, Tournikioti et al., 2018] and psychology [Memarmoghaddam et al., 2016, Lyndon et al., 2017, Freire et al., 2018, Hyams et al., 2018].In those papers, the “classical” multivariate tests, which are based on the determinant or the eigenvalues of certain matrices (
e.g., Wilks’ Lambda, Roy’s largest root test, see Rencher and Christensen [2002], Anderson [2003]), were used. Analogously to their univariate counterparts, they rely on the assumption that the error vectors are i.i.d. multivariate normal, with expectation
and identical covariance matrix across groups (i.e., homoskedasticity). However, assessing multivariate normality and homoskedasticity is difficult, in particular when the sample sizes are small. Moreover, even with reasonable sample sizes, there might be good reasons for assuming nonnormal errors or heterogeneous covariance matrices. In contrast to the classical multivariate tests discussed in standard textbooks [Rencher and Christensen, 2002, Huitema, 2011], there are only few methods available that are potentially applicable in a more general setting. Recently, Fan and Zhang [2017] have proposed a rank repeated measures analysis of covariance method; however, only limited simulation evidence regarding the performance in small and possibly unbalanced sample size settings is provided. Apart from that, despite the very broad applicability of the approach of Fan and Zhang [2017], which renders this method useful for analyzing ordinal data, too, it cannot be regarded as generalizing the scope of classical meanbased MANCOVA because adjusted mean vectors are not used as effect measures. As an alternative, the sandwich variance estimation technique that was proposed in the context of regression models with clustered errors [Arellano, 1987] and later examined in combination with various bootstrap procedures [Cameron et al., 2008] stays within a semiparametric framework, but allows for heteroskedasticity. Although this considerably extends the scope of applications beyond the classical assumptions that were mentioned before, some requirements might still be crucial in a considerable number of practically relevant cases:Firstly, the performance in MANCOVA models with possibly small and unequal group sizes combined with heteroskedasticity and/or nonnormality has not been sufficiently examined so far. Secondly, the model under consideration in the clusterrobust estimation approach [Cameron et al., 2008] assumes that the regression parameter vector is one and the same across all subjects and clusters. In the context of regression with clustered errors, this is perfectly fine. However, when being translated to the MANCOVA setting, this means that the model does not allow for regression coefficients, which might vary between the respective coordinates of the outcome vector. But, most likely, the association between a particular covariate and the outcomes might not be the same across all components of the outcome. Thirdly, from a theoretical perspective, it should be noted that the existing methods are based on Waldtype statistics and, therefore, require the covariance matrices to be positive definite. Nevertheless, this assumption might be violated in several ways:
Especially in psychology, but also in medical research (e.g., quality of life assessments [Hyams et al., 2018, Setyowibowo et al., 2018]), groups of subjects are often compared with respect to a score. Provided that it is sensible to interpret the score as a metric variable – which requires careful thoughts casebycase – employing an ANCOVA model would be appropriate. Most likely, a multivariate ANCOVA would be used, in order to allow for betweengroup comparisons of, for example, the overall score (i.e., a linear combination of the subscores) as well as the subscales on which the score is based.
However, the covariance matrix would be singular, then. A similar problem might arise in epilepsy research: Consider the situation of examining whether a particular antiepileptic drug is more efficient than placebo with respect to reducing the number of seizures compared to baseline. In this setting, the change from baseline represents the outcome, and the baseline seizure frequency is regarded as a covariate. However, since it might be sensible to consider not only the overall seizure frequency, but also the reduction in the number of focal and generalized seizures, respectively, we would have a trivariate outcome, and again, the covariance matrix would be singular, then. Apart from singularity due to linear dependencies between the components of the outcome vector, computational issues can also lead to singular covariance structures.
In order to overcome the potential problems mentioned above, we propose a modified ANCOVA ANOVAtype statistic (MANCATS), which is inspired by the modified ANOVAtype statistic proposed for heteroskedastic multivariate analysis of variance without covariates [Friedrich and Pauly, 2018]
. The main feature of this test statistic is that the “full” covariance matrix estimator, which is used in the Waldtype statistic, is replaced by a diagonal matrix that contains only the groupspecific variances on the diagonal. So, the “full” covariance matrix estimator is not used, yet not being removed completely, as in the ANOVAtype approach (see, for example,
Brunner et al. [1997]). Consequently, our proposed method shares the advantages of both the Waldtype and the ANOVAtype statistic, since the MANCATS is invariant with respect to the scales of the outcome variables, yet not requiring the assumption of positive definite covariance matrices within the groups. Moreover, it should be noted that, in contrast to repeated measures ANCOVA models, the outcomes are allowed to be measured on different scales. In order to approximate the distribution of the MANCATS and improve its finitesample performance, we apply two different bootstrap procedures.The present manuscript is organized as follows: In Section 2, we introduce some notations and set up the model as well as the assumptions that are required to ensure the asymptotic validity of the multivariate Waldtype statistic and the two bootstrap MANCATS approaches, respectively. Actually, it turns out that those assumptions are quite weak, which renders our proposed method applicable to a broad range of practically relevant settings. Then, in Section 3, we describe how the two proposed bootstrap procedures work. The finitesample performance in terms of type I error rates and power is investigated in an extensive simulation study discussed in Section 4. In order to further substantiate the findings of the previous section and to illustrate the aforementioned practical applicability, simulations that are based on reallife data from standardized student achievement tests are discussed in Section 5. Finally, Section 6 contains some closing remarks and ideas for future research. The proofs of the theorems stated in the main body of the manuscript as well as additional simulation results are included in the supplementary material.
2 The general multivariate analysis of covariance (MANCOVA) model
In the sequel, let denote the
dimensional identity matrix, and let
and denote the direct sum and the Kronecker product of matrices, respectively. Let and denote the  and dimensional outcome and covariate vectors of subject in group , ,. We assume that the covariates are fixed, whereas the outcomes are random variables, satisfying the model equation
(1) 
Thereby, denotes the vector of adjusted means in group , and , where contains the regression coefficients modelling the association between the th covariate and the components of the outcome, . It should be noted that model (1) allows for unequal regression coefficients for different components of the outcome. However, the coefficients are assumed to be the same across groups, which corresponds to the classical assumption of equal regression slopes in the simple univariate ANCOVA setting. Regarding the errors, we assume that the are independent, with and groupspecific covariance matrix , .
Model (1) can be expressed in a more compact form by using matrix notation: Let , , , and . Moreover, let denote the dimensional vector containing all ’s. Then, model (1) is equivalent to
(2) 
where , and . So, in particular,
can be estimated by the ordinary least squares estimator
, that is,where the dot notation indicates averaging over all subjects in the respective group, and denotes the OLS estimator of , . We are interested in testing hypotheses about the adjusted mean vectors of the form , where denotes a contrast matrix (i.e.,) of full row rank . In the present manuscript, however, we specify the hypothesis matrix in its projector form, that is, we use . Thereby, denotes the MoorePenrose inverse of . Note that is uniquely defined, and . For example, the hypothesis is equivalent to , with . Thereby, , where denotes the quadratic dimensional matrix containing all ’s. A more detailed discussion of various multifactorial and hierarchically nested designs, which are frequently used in practice, can be found in Konietschke et al. [2015].
Now, for testing the hypothesis , one may consider the Waldtype statistic (WTS)
(3) 
where denotes the upperleft block of the block diagonal matrix . Thereby, , and (i.e., is a blockdiagonal matrix with matrices ,, on the diagonal), and , , , . Observe that this covariance matrix estimator basically represents the multivariate generalization of the approach proposed in White [1980] for univariate regression models. Since the publication of this seminal work, several rescaled estimators have been proposed, with the aim of improving the performance in moderate and small samples [MacKinnon and White, 1985, CribariNeto, 2004]. Regarding the theorems stated in the present manuscript, modifications of this type are covered as well, because the scaling factors converge uniformly to . For example, in our simulation studies discussed in Section 4, we propose a multivariate generalization of the socalled HC4 version of , by multiplying the latter with , , where denotes the diagonal element of the hat matrix corresponding to individual in group , . Thereby, . This adjustment can be interpreted as a natural generalization of the univariate HC4 estimator, which was proposed by CribariNeto [CribariNeto, 2004] because for , the covariance matrix estimator actually reduces to one single value (i.e., the estimator of the subjectspecific variance in the univariate case), and the scaling factor defined above does not depend on the dimension of the outcome.
In order to derive the asymptotic distribution of under , the following assumptions are required, where all convergences are understood for :

The errors are independent, with , , and , uniformly for all .

for all .

For all .

The columns of are linearly independent of each other and of the columns of .

For all .

For all .
The following theorem establishes the basic result concerning the asymptotic distribution of the Waldtype statistic :
Theorem 1.
The assumptions (M1) and (M3) represent standard requirements in an asymptotic framework. Likewise, a violation of (M4) would mean that collinearity was present; in such a case, the validity of the results would be questionable anyway. Conditions (M5) and (M6) are most likely met in virtually any practical application, too, because an unstable average and/or varying spread of the covariates “on the long run” (i.e., as more and more subjects are enrolled) would indicate a serious flaw in the design or the conduct of the study. However, (M2) might be violated in a considerable number of practically relevant settings, as outlined in Section 1. This restricts the scope of potential applications of the asymptotic Waldtype test statistic, since it requires positive definiteness. Therefore, we propose a different approach, allowing for possibly singular covariance matrices: Analogously to the modified ANOVAtype statistic (MATS) [Friedrich and Pauly, 2018], we consider a modified ANCOVA ANOVAtype statistic (MANCATS), which is defined as
(4) 
Thereby,
(5) 
where
and is the residual of outcome in group . It should be noted that if no covariates were included in the model, we would have , resulting in . Consequently, would be equal to the empirical variance estimator, which was used in Friedrich and Pauly [2018]. So, (5) is a natural generalization of the MATS approach to the MANCOVA setting. As already mentioned before, when using instead of , assumption (M2) can be replaced by the weaker requirement

and , for all and .
This assumption is supposed to be met in a very broad range of practically relevant settings, excluding only cases where, for example, at least one component of the outcome vector is a discrete variable with very few distinct values.
Since the “full” covariance matrix estimator is replaced by in the MANCATS, the asymptotic chisquared limit distribution from Theorem 1 will not hold any more. Nevertheless, there is still at least a formal result regarding the asymptotic distribution of .
Theorem 2.
However, the weights in Theorem 2 cannot be calculated, because they represent the eigenvalues of a matrix that contains unknown quantities. Therefore, we will discuss two bootstrapbased approximations in the next section.
3 Bootstrapping the MANCATS
Firstly, we consider a socalled wild (or multiplier) bootstrap approach, which has been developed and put forward by the work of Wu [Wu, 1986], Liu [Liu, 1988] and Mammen [Mammen, 1993]. Wild bootstrap techniques have already been applied to Waldtype test statistics that are based on Whitetype covariance matrix estimation techniques in heteroskedastic univariate settings [CribariNeto, 2004, Zimmermann et al., 2019], in regression models with clustered errors [Cameron et al., 2008] and in multivariate factorial designs [Friedrich et al., 2017, Friedrich and Pauly, 2018], in order to improve the smallsample performance of the respective tests. In the present work, we use the following procedure: We generate i.i.d. random variables independently from the data, with , , and , . Then, the wild bootstrap observations are defined as
where denotes the diagonal element of the hat matrix corresponding to individual in group , and is the dimensional residual vector of individual in group , , . The scaling factor has been introduced by Wu [1986] in case of . Observe that we only generate one single random variable per subject, because coordinatewise bootstrapping would potentially destroy the dependence structure within subjects. Moreover, it should be mentioned that our proof of Theorem 3 works for any particular choice of
, as long as the fundamental moment conditions mentioned above are met.
Once the bootstrap observations have been generated, the bootstrap least squares estimator , the bootstrap covariance matrix estimator and, by plugging in the bootstrap versions instead of the original estimators in (4), the bootstrap analogon of the MANCATS test statistic are calculated. The following theorem states that using the conditional
quantile of the empirical distribution of the wild bootstrap test statistic
as the critical value indeed yields an asymptotically valid test.Theorem 3.
For any parameter vector and any with , we have that given the data, as ,
holds, provided that (M1), (M2) and (M3)–(M6) are fulfilled. Thereby,
denotes convergence in probability.
Secondly, we propose a parametric bootstrap approach, which follows an idea that is similar to existing methods for multivariate analysis of variance models [Konietschke et al., 2015, Friedrich and Pauly, 2018]. Given the observed data, we impose a groupwise i.i.d. structure in the bootstrap world by drawing
where
Once the bootstrap observations have been generated, the parametric bootstrap version of the MANCATS is obtained analogously to the procedure for the wild bootstrap that was outlined above. Again, the bootstrap test is asymptotically valid:
4 Simulations
In order to investigate the finitesample performance of the three methods, which have been discussed in Sections 2 and 3, we have conducted simulations for a broad range of settings, using R version 3.5.1 [R Development Core Team, 2018]. In addition to the three test statistics under consideration, we also included the Wilks’ Lambda test for comparison. We chose Rademacher variables as wild bootstrap weights, which means that independently from the data, we generated i.i.d. random variables with . These weights have already been applied successfully with respect to the performance of the corresponding tests in the univariate ANCOVA setting [Zimmermann et al., 2019]. For each scenario, the data generation process was repeated times, and within each simulation run, bootstrap iterations were performed. We considered data from groups, with a bivariate outcome (i.e., ) and fixed covariates for each subject. More precisely, the values of the first covariate were equally spaced between and , whereas the first and second half of the components of the second covariate vector were equally spaced in and , respectively, sorted in descending order. In order to generate the error vectors , at first we drew a random sample of independent observations from one out of several distributions (normal, , lognormal, double exponential). Then, we transformed them to standardized random variables and subsequently calculated the products
where denotes the matrix square root of the covariance matrix of group , and , . Doing so, it is ensured that and . For each of the aforementioned error distributions, we considered three different choices of and , namely (I), (II), and (III). Observe that the latter covariance matrix is singular. Finally, the observations were obtained as , where
. In order to check whether these specifications are sensible, for some scenarios considered in this section, we exemplarily investigated the significance of the conditional associations between the covariates and each of the two outcomes in univariate linear regression models. Exactly speaking, we simulated data for two groups of size
each, assuming lognormal or normal errors, and covariance matrix scenarios I and II. In each of the 4 settings, all simulation runs yielded univariate conditional associations between the components of the outcome and the covariates that were significant at the 5 percent level. Hence, when comparing the means between the groups, an adjustment for covariates is definitely warranted. For the singular scenario III, we set , in order to reflect the positive linear dependence between the two components of the outcome vector (observe that ). For each of the 12 combinations of distribution and covariancestructure, 4 different sample size scenarios were simulated. As mentioned before, for estimating the respective variances and covariances, we multiplied the residual vector of subject in group with . Thereby, , where denotes the diagonal element of the hat matrix corresponding to individual in group , . The adjustment was carried out for all variance and covariance estimators that were used in the three test statistics under consideration.At first, we investigated the empirical type I error rates. Without loss of generality, we set , . The results are displayed in Table LABEL:Table:TypeIbasic.
Normal  Chisquared(5)  Lognormal  Double exp.  

WI  WT  MW  MP  WI  WT  MW  MP  WI  WT  MW  MP  WI  WT  MW  MP  
I  
II  
III  
, standard lognormal, or double exponential distribution.
The Wilks’ Lambda test showed its wellknown conservative behavior in case of positive pairing (i.e., the smallest group has the “smallest” variance) and, on the other hand, exceeded the target 5 percent level if negative pairing (i.e., the smallest group has the “largest” variance) was present. Although the asymptotic Waldtype test might be considered as a remedy, it tended to be either liberal (covariance settings I and II) or conservative (covariance setting III), even under standard assumptions (i.e., normality and homoskedasticity). As indicated by the results that are displayed in Table S3 in the supplementary material, quite large sample sizes are required in order to achieve accurate type I error control for the Waldtype test. By contrast, the wild bootstrap version of the MANCATS showed a good performance also in small samples, yet being more liberal than the parametric bootstrap MANCATS, which maintained the target level very well in all cases. For neither the asymptotic nor the bootstrap approaches, positive and negative pairing (i.e., setting II with and , respectively) affected the results substantially, which could be expected due to the fact that the White estimator and its modified versions have been intended to be “heteroskedasticityconsistent” estimators [White, 1980, MacKinnon, 2012]. However, the results might be dependent on the error distribution, since we see that all approaches yielded slightly smaller type I error rates in the lognormal case compared to the other scenarios.
Secondly, we simulated the empirical power of the three tests under consideration by staying with the setup that was described at the beginning of this section, but specifying and , with . For the singular setting III, we set , in order to reflect the positive linear dependence between the components of the outcome vector. Although it was difficult to find scenarios where all approaches provided sufficient type I error level control, we tried to cover different distributions and covariance structures. The results are displayed in Figure 1. Additionally, in order to account for differences in the type I error rates, we provide a plot of the “achieved power” (i.e.
, the difference between empirical power and the scenariospecific type I error rate) in the supplementary material (Figure S1). The two bootstrap MANCATS approaches showed a similar performance, yet having lower power than the Waldtype test. It should be noted, however, that in case of an underlying normal distribution, the Waldtype approach tended to be liberal (Table
LABEL:Table:TypeIbasic). Therefore, the power comparison must be interpreted with caution. Still, in the lognormal case, there was a remarkable difference especially for the homoskedastic setting. However, for the two lognormal scenarios and commonly used target power values around and percent (i.e., ), the average power loss of the bootstrap methods compared to the asymptotic Waldtype approach was about percentage points. The average decrease in achieved power was even lower (aboutpercentages points). So, summing up, we admit that some caution is needed when applying our proposed bootstrap methods in cases with potentially skewed errors; nevertheless, the price to pay in terms of power might still be acceptable, given the generality of the MANCATS approach. In case of singular covariance matrices, the Waldtype approach should not be used at all, because it had considerably lower power (Figure
2). In these scenarios, the bootstrapbased tests showed a similar performance, with the wild bootstrap approach being somewhat more powerful, which might be due to its slightly liberal behavior (see Figure S2 in the supplementary material).Normal  Chisquared(5)  Lognormal  Double exp.  

WI  WT  MW  MP  WI  WT  MW  MP  WI  WT  MW  MP  WI  WT  MW  MP  
I  
II  
III  
Furthermore, we conducted several additional simulations, with the aim of investigating whether or not our main findings change substantially under an even broader range of settings. Firstly, all simulation parameters were specified as described above, except from increasing the number of groups to , with group size settings , , , and . The results are reported in Table LABEL:Table:TypeI4gr2dim. Compared to the results for the twogroup setting, the Waldtype test was even slightly more liberal for covariance matrix scenarios I and II, whereas the wild bootstrap MANCATS yielded type I error rates that were closer to the target 5 percent level. The parametric bootstrap MANCATS again performed well in all scenarios. The most prominent finding was, however, that the conservative behavior of the Waldtype test was even more severe than for groups, with very
low empirical type I error rates for some scenarios. By contrast, the MANCATSbased approaches were close to the target 5 percent level.
Secondly, we stayed with the twogroup setting, but increased the number of dimensions to . We modified the covariance matrices accordingly, so, I: , II: , and III: , where , , . The vectors modelling the associations between a particular covariate and the components of the outcome were set to and , respectively. For the singular setting III, we set and . Apart from that, all other specifications were the same as before. From Table LABEL:Table:TypeI2gr4dim, it is obvious that in particular the liberality of Wilks’ Lambda and the asymptotic Waldtype test was more pronounced than for dimensions. By contrast, the parametric bootstrap MANCATS again yielded type I error rates that were very close to the prespecified 5 percent level. Moreover, interestingly, the conservative behavior of the asymptotic Waldtype test in case of singular covariance matrices (scenario III) was slightly ameliorated. When further increasing the dimension to , the test even became liberal in small and unbalanced group size settings, yet being quite close to the target level for (see Table S1 in the supplementary material). These results might indicate that for very small sample sizes, the asymptotic Waldtype test is getting more and more liberal for higher dimensions, but performs reasonably well for moderate sample sizes. Anyway, the parametric bootstrap MANCATS showed very good type I error control regardless of the dimension .
Finally, in order to examine whether the results are sensitive to the choice of the covariates, we considered the aforementioned settings for dimension and groups once more, but replaced the values of the second covariate by a random sample from either the standard normal or the standard lognormal distribution, respectively. The results of the simulations for the homoskedastic singular setting III are displayed in Table S2 in the supplementary material. Compared to the corresponding part of Table LABEL:Table:TypeIbasic, the results were similar. The parametric bootstrap MANCATS appeared to perform even slightly better, whereas the asymptotic Waldtype test was somewhat more conservative.
Normal  Chisquared(5)  Lognormal  Double exp.  

WI  WT  MW  MP  WI  WT  MW  MP  WI  WT  MW  MP  WI  WT  MW  MP  
I  
II  
III  
Comments
There are no comments yet.