To estimate a regression when the errors have a non-identity covariance matrix, we usually turn first to generalized least squares (GLS). Somewhat surprisingly, GLS proves to be computationally challenging in the very simple setting of the unbalanced crossed random effects models that we study here. For that problem, the cost to compute the GLS estimate on data points grows at best like under the usual algorithms. If we additionally assume Gaussian errors, then Gao and Owen (2019) show that even evaluating the likelihood one time costs at least a multiple of . These costs make the usual algorithms for GLS infeasible for large data sets such as those arising in electronic commerce.
In this paper, we present an iterative algorithm based on a backfitting approach from Buja et al. (1989). This algorithm is known to converge to the GLS solution. The cost of each iteration is and so we also study how the number of iterations grows with .
The crossed random effects model we consider has
for random effects and and an error with a fixed effects regression parameter for the covariates . We assume that , , and are all independent. It is thus a mixed effects model in which the random portion has a crossed structure. The GLS estimate is also the MLE, when , and are Gaussian. Because we assume that is fixed as grows, we often leave out of our cost estimates, giving instead the complexity in .
The GLS estimate for crossed random effects can be efficiently estimated if all values are available. Our motivating examples involve ratings data where people rate items and then it is usual that the data are very unbalanced with a haphazard observational pattern in which only of the pairs are observed. The crossed random effects setting is significantly more difficult than a hierarchical model with just but no term. Then the observations for index are ‘nested within’ those for each level of index . The result is that the covariance matrix of all observed values has a block diagonal structure allowing GLS to be computed in time.
Hierarchical models are very well suited to Bayesian computation (Gelman and Hill, 2006). Crossed random effects are a much greater challenge. Gao and Owen (2017) find that the Gibbs sampler can take iterations to converge to stationarity, with each iteration costing leading once again to cost. For more examples where the costs of Bayesian and frequentist approaches have the same rate, see Fox (2013)
. Other Markov chain Monte Carlo (MCMC) algorithms studied inGao and Owen (2017) had similar problems. As further evidence of the difficulty of this problem, the Gibbs sampler was one of nine MCMC algorithms that Gao and Owen (2017) found to be unsatisfactory and Bates et al. (2015) removed the MCMC option from the R package lme4 because it was considered unreliable.
Papaspiliopoulos et al. (2020) present an exception to the high cost of a Bayesian approach for crossed random effects. They propose a collapsed Gibbs sampler that can potentially mix in iterations. To prove this rate, they make an extremely stringent assumption that every index appears in the same number of observed data points and similarly every appears in data points. Such a condition is tantamount to requiring a designed experiment for the data and it is much stronger than what their algorithm seems to need in practice. Under that condition their mixing rate asymptotes to a quantity , described in our discussion section, that in favorable circumstances is . They find empirically that their sampler has a cost that scales well in many data sets where their balance condition does not hold.
In this paper we study an iterative linear operation, known as backfitting, for GLS. Each iteration costs . The speed of convergence depends on a certain matrix norm of that iteration, which we exhibit below. If the norm remains bounded strictly below as , then the number of iterations to convergence is . We are able to show that the matrix norm is
with probability tending to one, under conditions where the number of observations per row (or per column) is random and even the expected row or column counts may vary, though in a narrow range. While this is a substantial weakening of the conditions inPapaspiliopoulos et al. (2020), it still fails to cover many interesting cases. Like them, we find empirically that our algorithm scales much more broadly than under the conditions for which scaling is proved.
We suspect that the computational infeasibility of GLS leads many users to use ordinary least squares (OLS) instead. OLS has two severe problems. First, it isinefficient with larger than . This is equivalent to OLS ignoring some possibly large fraction of the information in the data. Perhaps more seriously, OLS is naive. It produces an estimate of that can be too small by a large factor. That amounts to overestimating the quantity of information behind , also by a potentially large factor.
The naivete of OLS can be countered by using better variance estimates. One can bootstrap it by resampling the row and column entities as inOwen (2007). There is also a version of Huber-White variance estimation for this case in econometrics. See for instance Cameron et al. (2011). While these methods counter the naivete of OLS, the inefficiency of OLS remains.
The algorithm in Gao and Owen (2019) gets consistent asymptotically normal estimates of , , and . The proof for asymptotic normality of the variance component estimates is in the dissertation of Gao (2017). The estimate in Gao and Owen (2019) was from a GLS model that accounted only for and just one of or , and so by the Gauss-Markov theorem it was not efficient. The estimate of in that paper did take account of all three variance components, so it was not naive. In this paper we get a GLS estimate that takes account of all three variance components as well as an estimate of that accounts for all three, so our estimate is efficient and not naive.
The rest of this paper is organized as follows. Section 2 introduces our notation and assumptions for missing data. Section 3 presents the backfitting algorithm from Buja et al. (1989). That algorithm was defined for smoothers, but we are able to cast the estimation of random effect parameters as a special kind of smoother. Section 4 proves our result about backfitting being convergent with a probability tending to one as the problem size increases. Section 5 shows numerical measures of the matrix norm of the backfitting operator. It remains bounded below and away from one under more conditions than our theory shows. We find that the lmer function in lme4 pacakage Bates et al. (2015) has a cost that grows like in one setting and like in another, sparser one. The backfitting algorithm has cost in both of these cases. Section 6 illustrates our GLS algorithm on some data provided to us by Stitch Fix. These are customer ratings of items of clothing on a ten point scale. Section 7 has a discussion of these results.
We adopt the notation from Gao and Owen (2019). We let take the value if is observed and otherwise, for and . In many of the contexts we consider, the missingness is not at random and is potentially informative. Handling such problems is outside the scope of this paper, apart from a brief discussion in Section 7. It is already a sufficient challenge to work without informative missingness.
The matrix , with elements has observations in ‘row ’ and observations in ‘column ’. We often drop the limits of summation so that is always summed over and over . When we need additional symbols for row and column indices we use for rows and for columns. The total sample size is .
There are two co-observation matrices, and . Here gives the number of rows in which data from both columns and were observed, while gives the number of columns in which data from both rows and were observed.
In our regression models, we treat as nonrandom. We are conditioning on the actual pattern of observations in our data. When we study the rate at which our backfitting algorithm converges, we consider drawn at random. That is, the analyst is solving a GLS conditionally on the pattern of observations and missingness, while we study the convergence rates that analyst will see for data drawn from a given missingness mechanism.
If we place all of the
into a vectorand compatibly into a matrix , then
where contains all of the in an ordering compatible with and . A naive algorithm costs to solve for . It can actually be solved through a Cholesky decomposition of an matrix (Searle et al., 1992). That has cost . Now , with equality only for completely observed data. Therefore , and so . When the data are sparsely enough observed it is possible that grows more rapidly than . In a numerical example in Section 5 we have growing like . In a hierarchical model, with but no we would find to be block diagonal and then could be computed in work.
We can write our crossed effects model as
for matrices and . The ’th column of has ones for all of the observations that come from row and zeroes elsewhere. The definition of is analogous. The observation matrix can be written . The vector has all values of in compatible order. Vectors and contain the row and column random effects and . In this notation
Our main computational problem is to get a value for . To do that we iterate towards a solution of , where is one of the columns of . After that, finding
is not expensive, because and we suppose that is not large.
If the data ordering in and elsewhere sorts by index , breaking ties by index , then is a block matrix with blocks of ones of size along the diagonal and zeroes elsewhere. The matrix will not be block diagonal in that ordering. Instead will be block diagonal with blocks of ones on the diagonal, for a suitable permutation matrix .
3 Backfitting algorithms
Our first goal is to develop computationally efficient ways to solve the GLS problem (5
) for the linear mixed model (3). We use the backfitting algorithm that Hastie and Tibshirani (1990) and Buja et al. (1989) use to fit additive models. We write in (4) as with and , and define . Then the GLS estimate of is
It is well known (e.g., Robinson (1991)) that we can obtain by solving the following penalized least-squares problem
Then and and are the BLUP “estimates” for the random effects. This derivation works for any number of factors, but it is instructive to carry it through initially for one.
3.1 One factor
For a single factor, we simply drop the term from (3) to get
Then , and as before. The penalized least squares problem is to solve
We show the details as we need them for a later derivation.
The normal equations from (8) yield
Solving (10) for and multiplying the solution by yields
Using the Sherman-Morrison-Woodbury (SMW) identity, one can show that and hence above equals from (6). This is not in itself a new discovery; see for example Robinson (1991) or Hastie and Tibshirani (1990) (Section 5.3.3).
To compute the solution in (11), we need to compute and . The heart of the computation in is . But and we see that all we are doing is computing an -vector of shrunken means of the elements of at each level of the factor ; the th element is . This involves a single pass through the elements of , accumulating the sums in the registers, followed by an elementwise scaling of the components. Post multiplication by simply puts these shrunken means back into an -vector in the appropriate positions. The total cost is . Likewise does the same separately for each of the columns of . Hence the entire computational cost for (11) is , the same order as regression on .
What is also clear is that the indicator matrix is not actually needed here; instead all we need to carry out these computations is the factor vector that records the level of factor for each of the observations. In the R language (R Core Team, 2015) the following pair of operations does the computation:
hat_a = tapply(y,fA,sum)/(table(fA)+lambdaA) hat_y = hat_a[fA]
3.2 Two factors
A derivation similar to that used in the one-factor case gives
where the hat matrix is written in terms of a smoother matrix
We can again use SMW to show that and hence the solution in (6). But in applying we do not enjoy the computational simplifications that occurred in the one factor case, because
Rather than group and , we keep them separate, and develop an algorithm to apply the operator efficiently. Consider a generic response vector and the optimization problem
where is again the ridge regression smoothing matrix for row effects and similarly the smoothing matrix for column effects. We solve these equations iteratively by block coordinate descent, also known as backfitting. The iterations converge to the solution of (15) (Buja et al., 1989; Hastie and Tibshirani, 1990).
Here the simplifications we enjoyed in the one-factor case once again apply. Each step applies its operator to a vector (the terms in parentheses on the right hand side in (16)). For both and these are simply the shrunken-mean operations described for the one-factor case, separately for factor and each time. As before, we do not need to actually construct , but simply use a factor that records the level of factor for each of the observations.
This was for a generic response ; we apply the same algorithm (in parallel) to each column of to obtain in (13). Now solving (13) is . These computations deliver ; if the BLUP estimates and are also required, the same algorithm can be applied to the response , retaining the and at the final iteration. We can also write
It is also clear that we can trivially extend this approach to accommodate any number of factors.
3.3 Centered operators
The matrices and
both have row sums all ones, since they are factor indicator matrices (“one-hot encoders”). This creates a nontrivial intersection between their column spaces, and that ofsince we always include an intercept, that can cause backfitting to converge more slowly. In this section we show how to counter this intersection of column spaces to speed convergence. We work with this two-factor model
If in model (18) includes a column of ones (intercept), and and , then the solutions for and satisfy and .
It suffices to show this for one factor and with . The objective is now
Notice that for any candidate solution , the alternative solution leaves the loss part of (19) unchanged, since the row sums of are all one. Hence if , we would always improve by picking to minimize the penalty term , or . ∎
It is natural then to solve for and with these constraints enforced, rather than waiting for the iterative algorithm to discover them.
Consider the generic optimization problem
Define the partial sum vector with components , and let
Then the solution is given by
Moreover, the fit is given by
where is a symmetric operator.
The computations are a simple modification of the non-centered case.
Let be an orthogonal matrix with first column . Then . So reparametrizing in this way leads to the equivalent problem
To solve (21), we simply drop the first column of . Let where is the matrix omitting the first column, and the corresponding subvector of having components. We now solve
with no constraints, and solution . The fit is given by , and is clearly a symmetric operator.
3.4 Covariance matrix for with centered operators
In Section 3.2 we saw in (17) that we get a simple expression for . This simplicity relies on the fact that , and the usual cancelation occurs when we use the sandwich formula to compute this covariance. When we backfit with our centered smoothers we get a modified residual operator such that the analog of (13) still gives us the required coefficient estimate:
However, , and so now we need to resort to the sandwich formula with from (13). Expanding this we find that equals
While this expression might appear daunting, the computations are simple. Let , the residual matrix after backfitting each column of using these centered operators. Then because is symmetric, we have
Since (two low-rank matrices plus the identity), we can compute very efficiently, and hence also the covariance matrix in (26).
4 Convergence of the matrix norm
In this section we prove a bound on the norm of the matrix that implements backfitting for our random effects and . To focus on essentials we do not consider the updates for here. It was necessary to take account of intercept adjustments, because the intercept is in the space spanned by these random effects.
Let the matrix update the vector at one iteration of backfiting. The updates take the form
for some . They will converge if holds for some , where
We already know from Buja et al. (1989) that backfitting will converge. However, we want more. We want to avoid having the number of iterations required grow with . We can write the solution as
and in computations we truncate this sum after steps. If approaches as then the number of iterations required to make negligible can grow with . We want for some as . For our induced norms is no smaller than , the spectral radius of . For symmetric the spectral radius . Our update matrices are not necessarily symmetric. It can be shown, via Gelfand’s formula, where , that if , then the iteration will converge.
We seek conditions under which with probability tending to one. The empirically most favorable norm is , but the most tractable one for our theoretical approach is
The update for is linear in , so if converges, then we could take a single step for . In practice we alternate those updates.
Recall that describes the pattern of observations. In a model with no intercept we could use the following update:
For this update, we have for where
The update alternates shrinkage estimates for and but does no centering.
In the presence of an intercept, we know that should hold at the solution and we can impose this by centering the , taking
The intercept estimate will then be which we can subtract from upon convergence. Our iteration has the update matrix with
after replacing a sum over by an equivalent one over .
In practice, we prefer to use the weighted centering from Section 3.3 to center the . For the generic response there, we take a vector of values , and so Then and the updated is
Using shrunken averages of , the new are
Now for , where
As above, we will leave the intercept in as the average of the because a second centering produces unnecessarily unwieldy expressions.
4.2 Model for
We will state conditions on under which is bounded below with probability tending to one, as the problem size grows. Similarly derived conditions for are somewhat less restrictive. We need the following exponential inequalities.
If , then for any ,
This follows from Hoeffding’s theorem. ∎
Let for , not necessarily independent. Then for any ,
This is from the union bound applied to Lemma 2. ∎
Here is our sampling model. We index the size of our problem by . The sample size will satisfy . The number of rows and columns in the data set are
respectively, for positive numbers and . Because our application domain has , we assume that . We ignore that and above are not necessarily integers.
In our model, independently with
That is . Letting depend on and allows the probability model to capture stylistic preferences affecting the missingness pattern in the ratings data.
4.3 Bounds for row and column size
Letting mean that is stochastically smaller than , we know that