Consider a randomized clinical trial, where each subject is randomized to one out of groups. Frequently, multiple outcomes (e.g., one primary and one co-primary outcome, several secondary outcomes) need to be compared between the groups. One straightforward way of accomplishing this goal would be to conduct univariate tests (e.g., tests) and adjust for multiplicity. Nevertheless, this approach has the drawback of not taking potential correlations between the outcomes (e.g., due to physiological reasons, there might be correlations between laboratory measurements) into account. Moreover, when analyzing global assessment variables, further safety and efficacy outcomes, or objective variables related to the global assessment, should be analyzed [International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use, 1998]
. In these instances, employing a multivariate analysis of variance model would be an attractive option. Still, however, the estimators of the treatment effects might be biased, and the power of the test might be low, if covariates (i.e., variables that are thought to influence the outcomes) are not included in the model. Therefore, a multivariate analysis of covariance (MANCOVA) model might be a better choice. Indeed, MANCOVA models have been used in several recent publications in medical research [Roldan-Valadez et al., 2013, Jackson et al., 2018, Setyowibowo et al., 2018, Tournikioti et al., 2018] and psychology [Memarmoghaddam et al., 2016, Lyndon et al., 2017, Freire et al., 2018, Hyams et al., 2018].
In those papers, the “classical” multivariate tests, which are based on the determinant or the eigenvalues of certain matrices (e.g., Wilks’ Lambda, Roy’s largest root test, see Rencher and Christensen , Anderson 
), were used. Analogously to their univariate counterparts, they rely on the assumption that the error vectors are i.i.d. multivariate normal, with expectationand identical covariance matrix across groups (i.e., homoskedasticity). However, assessing multivariate normality and homoskedasticity is difficult, in particular when the sample sizes are small. Moreover, even with reasonable sample sizes, there might be good reasons for assuming non-normal errors or heterogeneous covariance matrices. In contrast to the classical multivariate tests discussed in standard textbooks [Rencher and Christensen, 2002, Huitema, 2011], there are only few methods available that are potentially applicable in a more general setting. Recently, Fan and Zhang  have proposed a rank repeated measures analysis of covariance method; however, only limited simulation evidence regarding the performance in small and possibly unbalanced sample size settings is provided. Apart from that, despite the very broad applicability of the approach of Fan and Zhang , which renders this method useful for analyzing ordinal data, too, it cannot be regarded as generalizing the scope of classical mean-based MANCOVA because adjusted mean vectors are not used as effect measures. As an alternative, the sandwich variance estimation technique that was proposed in the context of regression models with clustered errors [Arellano, 1987] and later examined in combination with various bootstrap procedures [Cameron et al., 2008] stays within a semiparametric framework, but allows for heteroskedasticity. Although this considerably extends the scope of applications beyond the classical assumptions that were mentioned before, some requirements might still be crucial in a considerable number of practically relevant cases:
Firstly, the performance in MANCOVA models with possibly small and unequal group sizes combined with heteroskedasticity and/or non-normality has not been sufficiently examined so far. Secondly, the model under consideration in the cluster-robust estimation approach [Cameron et al., 2008] assumes that the regression parameter vector is one and the same across all subjects and clusters. In the context of regression with clustered errors, this is perfectly fine. However, when being translated to the MANCOVA setting, this means that the model does not allow for regression coefficients, which might vary between the respective coordinates of the outcome vector. But, most likely, the association between a particular covariate and the outcomes might not be the same across all components of the outcome. Thirdly, from a theoretical perspective, it should be noted that the existing methods are based on Wald-type statistics and, therefore, require the covariance matrices to be positive definite. Nevertheless, this assumption might be violated in several ways:
Especially in psychology, but also in medical research (e.g., quality of life assessments [Hyams et al., 2018, Setyowibowo et al., 2018]), groups of subjects are often compared with respect to a score. Provided that it is sensible to interpret the score as a metric variable – which requires careful thoughts case-by-case – employing an ANCOVA model would be appropriate. Most likely, a multivariate ANCOVA would be used, in order to allow for between-group comparisons of, for example, the overall score (i.e., a linear combination of the sub-scores) as well as the subscales on which the score is based.
However, the covariance matrix would be singular, then. A similar problem might arise in epilepsy research: Consider the situation of examining whether a particular antiepileptic drug is more efficient than placebo with respect to reducing the number of seizures compared to baseline. In this setting, the change from baseline represents the outcome, and the baseline seizure frequency is regarded as a covariate. However, since it might be sensible to consider not only the overall seizure frequency, but also the reduction in the number of focal and generalized seizures, respectively, we would have a trivariate outcome, and again, the covariance matrix would be singular, then. Apart from singularity due to linear dependencies between the components of the outcome vector, computational issues can also lead to singular covariance structures.
In order to overcome the potential problems mentioned above, we propose a modified ANCOVA ANOVA-type statistic (MANCATS), which is inspired by the modified ANOVA-type statistic proposed for heteroskedastic multivariate analysis of variance without covariates [Friedrich and Pauly, 2018]
. The main feature of this test statistic is that the “full” covariance matrix estimator, which is used in the Wald-type statistic, is replaced by a diagonal matrix that contains only the group-specific variances on the diagonal. So, the “full” covariance matrix estimator is not used, yet not being removed completely, as in the ANOVA-type approach (see, for example,Brunner et al. ). Consequently, our proposed method shares the advantages of both the Wald-type and the ANOVA-type statistic, since the MANCATS is invariant with respect to the scales of the outcome variables, yet not requiring the assumption of positive definite covariance matrices within the groups. Moreover, it should be noted that, in contrast to repeated measures ANCOVA models, the outcomes are allowed to be measured on different scales. In order to approximate the distribution of the MANCATS and improve its finite-sample performance, we apply two different bootstrap procedures.
The present manuscript is organized as follows: In Section 2, we introduce some notations and set up the model as well as the assumptions that are required to ensure the asymptotic validity of the multivariate Wald-type statistic and the two bootstrap MANCATS approaches, respectively. Actually, it turns out that those assumptions are quite weak, which renders our proposed method applicable to a broad range of practically relevant settings. Then, in Section 3, we describe how the two proposed bootstrap procedures work. The finite-sample performance in terms of type I error rates and power is investigated in an extensive simulation study discussed in Section 4. In order to further substantiate the findings of the previous section and to illustrate the aforementioned practical applicability, simulations that are based on real-life data from standardized student achievement tests are discussed in Section 5. Finally, Section 6 contains some closing remarks and ideas for future research. The proofs of the theorems stated in the main body of the manuscript as well as additional simulation results are included in the supplementary material.
2 The general multivariate analysis of covariance (MANCOVA) model
In the sequel, let denote the
-dimensional identity matrix, and letand denote the direct sum and the Kronecker product of matrices, respectively. Let and denote the - and -dimensional outcome and covariate vectors of subject in group , ,
. We assume that the covariates are fixed, whereas the outcomes are random variables, satisfying the model equation
Thereby, denotes the vector of adjusted means in group , and , where contains the regression coefficients modelling the association between the -th covariate and the components of the outcome, . It should be noted that model (1) allows for unequal regression coefficients for different components of the outcome. However, the coefficients are assumed to be the same across groups, which corresponds to the classical assumption of equal regression slopes in the simple univariate ANCOVA setting. Regarding the errors, we assume that the are independent, with and group-specific covariance matrix , .
where , and . So, in particular,
can be estimated by the ordinary least squares estimator, that is,
where the dot notation indicates averaging over all subjects in the respective group, and denotes the OLS estimator of , . We are interested in testing hypotheses about the adjusted mean vectors of the form , where denotes a contrast matrix (i.e.,) of full row rank . In the present manuscript, however, we specify the hypothesis matrix in its projector form, that is, we use . Thereby, denotes the Moore-Penrose inverse of . Note that is uniquely defined, and . For example, the hypothesis is equivalent to , with . Thereby, , where denotes the quadratic -dimensional matrix containing all ’s. A more detailed discussion of various multi-factorial and hierarchically nested designs, which are frequently used in practice, can be found in Konietschke et al. .
Now, for testing the hypothesis , one may consider the Wald-type statistic (WTS)
where denotes the upper-left block of the block diagonal matrix . Thereby, , and (i.e., is a block-diagonal matrix with matrices ,, on the diagonal), and , , , . Observe that this covariance matrix estimator basically represents the multivariate generalization of the approach proposed in White  for univariate regression models. Since the publication of this seminal work, several re-scaled estimators have been proposed, with the aim of improving the performance in moderate and small samples [MacKinnon and White, 1985, Cribari-Neto, 2004]. Regarding the theorems stated in the present manuscript, modifications of this type are covered as well, because the scaling factors converge uniformly to . For example, in our simulation studies discussed in Section 4, we propose a multivariate generalization of the so-called HC4 version of , by multiplying the latter with , , where denotes the diagonal element of the hat matrix corresponding to individual in group , . Thereby, . This adjustment can be interpreted as a natural generalization of the univariate HC4 estimator, which was proposed by Cribari-Neto [Cribari-Neto, 2004] because for , the covariance matrix estimator actually reduces to one single value (i.e., the estimator of the subject-specific variance in the univariate case), and the scaling factor defined above does not depend on the dimension of the outcome.
In order to derive the asymptotic distribution of under , the following assumptions are required, where all convergences are understood for :
The errors are independent, with , , and , uniformly for all .
for all .
For all .
The columns of are linearly independent of each other and of the columns of .
For all .
For all .
The following theorem establishes the basic result concerning the asymptotic distribution of the Wald-type statistic :
The assumptions (M1) and (M3) represent standard requirements in an asymptotic framework. Likewise, a violation of (M4) would mean that collinearity was present; in such a case, the validity of the results would be questionable anyway. Conditions (M5) and (M6) are most likely met in virtually any practical application, too, because an unstable average and/or varying spread of the covariates “on the long run” (i.e., as more and more subjects are enrolled) would indicate a serious flaw in the design or the conduct of the study. However, (M2) might be violated in a considerable number of practically relevant settings, as outlined in Section 1. This restricts the scope of potential applications of the asymptotic Wald-type test statistic, since it requires positive definiteness. Therefore, we propose a different approach, allowing for possibly singular covariance matrices: Analogously to the modified ANOVA-type statistic (MATS) [Friedrich and Pauly, 2018], we consider a modified ANCOVA ANOVA-type statistic (MANCATS), which is defined as
and is the residual of outcome in group . It should be noted that if no covariates were included in the model, we would have , resulting in . Consequently, would be equal to the empirical variance estimator, which was used in Friedrich and Pauly . So, (5) is a natural generalization of the MATS approach to the MANCOVA setting. As already mentioned before, when using instead of , assumption (M2) can be replaced by the weaker requirement
and , for all and .
This assumption is supposed to be met in a very broad range of practically relevant settings, excluding only cases where, for example, at least one component of the outcome vector is a discrete variable with very few distinct values.
Since the “full” covariance matrix estimator is replaced by in the MANCATS, the asymptotic chi-squared limit distribution from Theorem 1 will not hold any more. Nevertheless, there is still at least a formal result regarding the asymptotic distribution of .
However, the weights in Theorem 2 cannot be calculated, because they represent the eigenvalues of a matrix that contains unknown quantities. Therefore, we will discuss two bootstrap-based approximations in the next section.
3 Bootstrapping the MANCATS
Firstly, we consider a so-called wild (or multiplier) bootstrap approach, which has been developed and put forward by the work of Wu [Wu, 1986], Liu [Liu, 1988] and Mammen [Mammen, 1993]. Wild bootstrap techniques have already been applied to Wald-type test statistics that are based on White-type covariance matrix estimation techniques in heteroskedastic univariate settings [Cribari-Neto, 2004, Zimmermann et al., 2019], in regression models with clustered errors [Cameron et al., 2008] and in multivariate factorial designs [Friedrich et al., 2017, Friedrich and Pauly, 2018], in order to improve the small-sample performance of the respective tests. In the present work, we use the following procedure: We generate i.i.d. random variables independently from the data, with , , and , . Then, the wild bootstrap observations are defined as
where denotes the diagonal element of the hat matrix corresponding to individual in group , and is the -dimensional residual vector of individual in group , , . The scaling factor has been introduced by Wu  in case of . Observe that we only generate one single random variable per subject, because coordinate-wise bootstrapping would potentially destroy the dependence structure within subjects. Moreover, it should be mentioned that our proof of Theorem 3 works for any particular choice of
, as long as the fundamental moment conditions mentioned above are met.
Once the bootstrap observations have been generated, the bootstrap least squares estimator , the bootstrap covariance matrix estimator and, by plugging in the bootstrap versions instead of the original estimators in (4), the bootstrap analogon of the MANCATS test statistic are calculated. The following theorem states that using the conditional
-quantile of the empirical distribution of the wild bootstrap test statisticas the critical value indeed yields an asymptotically valid test.
Secondly, we propose a parametric bootstrap approach, which follows an idea that is similar to existing methods for multivariate analysis of variance models [Konietschke et al., 2015, Friedrich and Pauly, 2018]. Given the observed data, we impose a group-wise i.i.d. structure in the bootstrap world by drawing
Once the bootstrap observations have been generated, the parametric bootstrap version of the MANCATS is obtained analogously to the procedure for the wild bootstrap that was outlined above. Again, the bootstrap test is asymptotically valid:
In order to investigate the finite-sample performance of the three methods, which have been discussed in Sections 2 and 3, we have conducted simulations for a broad range of settings, using R version 3.5.1 [R Development Core Team, 2018]. In addition to the three test statistics under consideration, we also included the Wilks’ Lambda test for comparison. We chose Rademacher variables as wild bootstrap weights, which means that independently from the data, we generated i.i.d. random variables with . These weights have already been applied successfully with respect to the performance of the corresponding tests in the univariate ANCOVA setting [Zimmermann et al., 2019]. For each scenario, the data generation process was repeated times, and within each simulation run, bootstrap iterations were performed. We considered data from groups, with a bivariate outcome (i.e., ) and fixed covariates for each subject. More precisely, the values of the first covariate were equally spaced between and , whereas the first and second half of the components of the second covariate vector were equally spaced in and , respectively, sorted in descending order. In order to generate the error vectors , at first we drew a random sample of independent observations from one out of several distributions (normal, , lognormal, double exponential). Then, we transformed them to standardized random variables and subsequently calculated the products
where denotes the matrix square root of the covariance matrix of group , and , . Doing so, it is ensured that and . For each of the aforementioned error distributions, we considered three different choices of and , namely (I), (II), and (III). Observe that the latter covariance matrix is singular. Finally, the observations were obtained as , where
. In order to check whether these specifications are sensible, for some scenarios considered in this section, we exemplarily investigated the significance of the conditional associations between the covariates and each of the two outcomes in univariate linear regression models. Exactly speaking, we simulated data for two groups of sizeeach, assuming lognormal or normal errors, and covariance matrix scenarios I and II. In each of the 4 settings, all simulation runs yielded univariate conditional associations between the components of the outcome and the covariates that were significant at the 5 percent level. Hence, when comparing the means between the groups, an adjustment for covariates is definitely warranted. For the singular scenario III, we set , in order to reflect the positive linear dependence between the two components of the outcome vector (observe that ). For each of the 12 combinations of distribution and covariance-structure, 4 different sample size scenarios were simulated. As mentioned before, for estimating the respective variances and covariances, we multiplied the residual vector of subject in group with . Thereby, , where denotes the diagonal element of the hat matrix corresponding to individual in group , . The adjustment was carried out for all variance and covariance estimators that were used in the three test statistics under consideration.
At first, we investigated the empirical type I error rates. Without loss of generality, we set , . The results are displayed in Table LABEL:Table:TypeIbasic.
, standard lognormal, or double exponential distribution.
The Wilks’ Lambda test showed its well-known conservative behavior in case of positive pairing (i.e., the smallest group has the “smallest” variance) and, on the other hand, exceeded the target 5 percent level if negative pairing (i.e., the smallest group has the “largest” variance) was present. Although the asymptotic Wald-type test might be considered as a remedy, it tended to be either liberal (covariance settings I and II) or conservative (covariance setting III), even under standard assumptions (i.e., normality and homoskedasticity). As indicated by the results that are displayed in Table S3 in the supplementary material, quite large sample sizes are required in order to achieve accurate type I error control for the Wald-type test. By contrast, the wild bootstrap version of the MANCATS showed a good performance also in small samples, yet being more liberal than the parametric bootstrap MANCATS, which maintained the target level very well in all cases. For neither the asymptotic nor the bootstrap approaches, positive and negative pairing (i.e., setting II with and , respectively) affected the results substantially, which could be expected due to the fact that the White estimator and its modified versions have been intended to be “heteroskedasticity-consistent” estimators [White, 1980, MacKinnon, 2012]. However, the results might be dependent on the error distribution, since we see that all approaches yielded slightly smaller type I error rates in the lognormal case compared to the other scenarios.
Secondly, we simulated the empirical power of the three tests under consideration by staying with the setup that was described at the beginning of this section, but specifying and , with . For the singular setting III, we set , in order to reflect the positive linear dependence between the components of the outcome vector. Although it was difficult to find scenarios where all approaches provided sufficient type I error level control, we tried to cover different distributions and covariance structures. The results are displayed in Figure 1. Additionally, in order to account for differences in the type I error rates, we provide a plot of the “achieved power” (i.e.
, the difference between empirical power and the scenario-specific type I error rate) in the supplementary material (Figure S1). The two bootstrap MANCATS approaches showed a similar performance, yet having lower power than the Wald-type test. It should be noted, however, that in case of an underlying normal distribution, the Wald-type approach tended to be liberal (TableLABEL:Table:TypeIbasic). Therefore, the power comparison must be interpreted with caution. Still, in the lognormal case, there was a remarkable difference especially for the homoskedastic setting. However, for the two lognormal scenarios and commonly used target power values around and percent (i.e., ), the average power loss of the bootstrap methods compared to the asymptotic Wald-type approach was about percentage points. The average decrease in achieved power was even lower (about
percentages points). So, summing up, we admit that some caution is needed when applying our proposed bootstrap methods in cases with potentially skewed errors; nevertheless, the price to pay in terms of power might still be acceptable, given the generality of the MANCATS approach. In case of singular covariance matrices, the Wald-type approach should not be used at all, because it had considerably lower power (Figure2). In these scenarios, the bootstrap-based tests showed a similar performance, with the wild bootstrap approach being somewhat more powerful, which might be due to its slightly liberal behavior (see Figure S2 in the supplementary material).
Furthermore, we conducted several additional simulations, with the aim of investigating whether or not our main findings change substantially under an even broader range of settings. Firstly, all simulation parameters were specified as described above, except from increasing the number of groups to , with group size settings , , , and . The results are reported in Table LABEL:Table:TypeI4gr2dim. Compared to the results for the two-group setting, the Wald-type test was even slightly more liberal for covariance matrix scenarios I and II, whereas the wild bootstrap MANCATS yielded type I error rates that were closer to the target 5 percent level. The parametric bootstrap MANCATS again performed well in all scenarios. The most prominent finding was, however, that the conservative behavior of the Wald-type test was even more severe than for groups, with very
low empirical type I error rates for some scenarios. By contrast, the MANCATS-based approaches were close to the target 5 percent level.
Secondly, we stayed with the two-group setting, but increased the number of dimensions to . We modified the covariance matrices accordingly, so, I: , II: , and III: , where , , . The vectors modelling the associations between a particular covariate and the components of the outcome were set to and , respectively. For the singular setting III, we set and . Apart from that, all other specifications were the same as before. From Table LABEL:Table:TypeI2gr4dim, it is obvious that in particular the liberality of Wilks’ Lambda and the asymptotic Wald-type test was more pronounced than for dimensions. By contrast, the parametric bootstrap MANCATS again yielded type I error rates that were very close to the pre-specified 5 percent level. Moreover, interestingly, the conservative behavior of the asymptotic Wald-type test in case of singular covariance matrices (scenario III) was slightly ameliorated. When further increasing the dimension to , the test even became liberal in small and unbalanced group size settings, yet being quite close to the target level for (see Table S1 in the supplementary material). These results might indicate that for very small sample sizes, the asymptotic Wald-type test is getting more and more liberal for higher dimensions, but performs reasonably well for moderate sample sizes. Anyway, the parametric bootstrap MANCATS showed very good type I error control regardless of the dimension .
Finally, in order to examine whether the results are sensitive to the choice of the covariates, we considered the aforementioned settings for dimension and groups once more, but replaced the values of the second covariate by a random sample from either the standard normal or the standard lognormal distribution, respectively. The results of the simulations for the homoskedastic singular setting III are displayed in Table S2 in the supplementary material. Compared to the corresponding part of Table LABEL:Table:TypeIbasic, the results were similar. The parametric bootstrap MANCATS appeared to perform even slightly better, whereas the asymptotic Wald-type test was somewhat more conservative.