Backfitting for large scale crossed random effects regressions

07/21/2020 ∙ by Swarnadip Ghosh, et al. ∙ 0

Regression models with crossed random effect error models can be very expensive to compute. The cost of both generalized least squares and Gibbs sampling can easily grow as N^3/2 (or worse) for N observations. Papaspiliopoulos et al. (2020) present a collapsed Gibbs sampler that costs O(N), but under an extremely stringent sampling model. We propose a backfitting algorithm to compute a generalized least squares estimate and prove that it costs O(N) under greatly relaxed though still strict sampling assumptions. Empirically, the backfitting algorithm costs O(N) under further relaxed assumptions. We illustrate the new algorithm on a ratings data set from Stitch Fix.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

To estimate a regression when the errors have a non-identity covariance matrix, we usually turn first to generalized least squares (GLS). Somewhat surprisingly, GLS proves to be computationally challenging in the very simple setting of the unbalanced crossed random effects models that we study here. For that problem, the cost to compute the GLS estimate on data points grows at best like under the usual algorithms. If we additionally assume Gaussian errors, then Gao and Owen (2019) show that even evaluating the likelihood one time costs at least a multiple of . These costs make the usual algorithms for GLS infeasible for large data sets such as those arising in electronic commerce.

In this paper, we present an iterative algorithm based on a backfitting approach from Buja et al. (1989). This algorithm is known to converge to the GLS solution. The cost of each iteration is and so we also study how the number of iterations grows with .

The crossed random effects model we consider has

(1)

for random effects and and an error with a fixed effects regression parameter for the covariates . We assume that , , and are all independent. It is thus a mixed effects model in which the random portion has a crossed structure. The GLS estimate is also the MLE, when , and are Gaussian. Because we assume that is fixed as grows, we often leave out of our cost estimates, giving instead the complexity in .

The GLS estimate for crossed random effects can be efficiently estimated if all values are available. Our motivating examples involve ratings data where people rate items and then it is usual that the data are very unbalanced with a haphazard observational pattern in which only of the pairs are observed. The crossed random effects setting is significantly more difficult than a hierarchical model with just but no term. Then the observations for index are ‘nested within’ those for each level of index . The result is that the covariance matrix of all observed values has a block diagonal structure allowing GLS to be computed in time.

Hierarchical models are very well suited to Bayesian computation (Gelman and Hill, 2006). Crossed random effects are a much greater challenge. Gao and Owen (2017) find that the Gibbs sampler can take iterations to converge to stationarity, with each iteration costing leading once again to cost. For more examples where the costs of Bayesian and frequentist approaches have the same rate, see Fox (2013)

. Other Markov chain Monte Carlo (MCMC) algorithms studied in

Gao and Owen (2017) had similar problems. As further evidence of the difficulty of this problem, the Gibbs sampler was one of nine MCMC algorithms that Gao and Owen (2017) found to be unsatisfactory and Bates et al. (2015) removed the MCMC option from the R package lme4 because it was considered unreliable.

Papaspiliopoulos et al. (2020) present an exception to the high cost of a Bayesian approach for crossed random effects. They propose a collapsed Gibbs sampler that can potentially mix in iterations. To prove this rate, they make an extremely stringent assumption that every index appears in the same number of observed data points and similarly every appears in data points. Such a condition is tantamount to requiring a designed experiment for the data and it is much stronger than what their algorithm seems to need in practice. Under that condition their mixing rate asymptotes to a quantity , described in our discussion section, that in favorable circumstances is . They find empirically that their sampler has a cost that scales well in many data sets where their balance condition does not hold.

In this paper we study an iterative linear operation, known as backfitting, for GLS. Each iteration costs . The speed of convergence depends on a certain matrix norm of that iteration, which we exhibit below. If the norm remains bounded strictly below as , then the number of iterations to convergence is . We are able to show that the matrix norm is

with probability tending to one, under conditions where the number of observations per row (or per column) is random and even the expected row or column counts may vary, though in a narrow range. While this is a substantial weakening of the conditions in

Papaspiliopoulos et al. (2020), it still fails to cover many interesting cases. Like them, we find empirically that our algorithm scales much more broadly than under the conditions for which scaling is proved.

We suspect that the computational infeasibility of GLS leads many users to use ordinary least squares (OLS) instead. OLS has two severe problems. First, it is

inefficient with larger than . This is equivalent to OLS ignoring some possibly large fraction of the information in the data. Perhaps more seriously, OLS is naive. It produces an estimate of that can be too small by a large factor. That amounts to overestimating the quantity of information behind , also by a potentially large factor.

The naivete of OLS can be countered by using better variance estimates. One can bootstrap it by resampling the row and column entities as in

Owen (2007). There is also a version of Huber-White variance estimation for this case in econometrics. See for instance Cameron et al. (2011). While these methods counter the naivete of OLS, the inefficiency of OLS remains.

The algorithm in Gao and Owen (2019) gets consistent asymptotically normal estimates of , , and . The proof for asymptotic normality of the variance component estimates is in the dissertation of Gao (2017). The estimate in Gao and Owen (2019) was from a GLS model that accounted only for and just one of or , and so by the Gauss-Markov theorem it was not efficient. The estimate of in that paper did take account of all three variance components, so it was not naive. In this paper we get a GLS estimate that takes account of all three variance components as well as an estimate of that accounts for all three, so our estimate is efficient and not naive.

The rest of this paper is organized as follows. Section 2 introduces our notation and assumptions for missing data. Section 3 presents the backfitting algorithm from Buja et al. (1989). That algorithm was defined for smoothers, but we are able to cast the estimation of random effect parameters as a special kind of smoother. Section 4 proves our result about backfitting being convergent with a probability tending to one as the problem size increases. Section 5 shows numerical measures of the matrix norm of the backfitting operator. It remains bounded below and away from one under more conditions than our theory shows. We find that the lmer function in lme4 pacakage Bates et al. (2015) has a cost that grows like in one setting and like in another, sparser one. The backfitting algorithm has cost in both of these cases. Section 6 illustrates our GLS algorithm on some data provided to us by Stitch Fix. These are customer ratings of items of clothing on a ten point scale. Section 7 has a discussion of these results.

2 Missingness

We adopt the notation from Gao and Owen (2019). We let take the value if is observed and otherwise, for and . In many of the contexts we consider, the missingness is not at random and is potentially informative. Handling such problems is outside the scope of this paper, apart from a brief discussion in Section 7. It is already a sufficient challenge to work without informative missingness.

The matrix , with elements has observations in ‘row ’ and observations in ‘column ’. We often drop the limits of summation so that is always summed over and over . When we need additional symbols for row and column indices we use for rows and for columns. The total sample size is .

There are two co-observation matrices, and . Here gives the number of rows in which data from both columns and were observed, while gives the number of columns in which data from both rows and were observed.

In our regression models, we treat as nonrandom. We are conditioning on the actual pattern of observations in our data. When we study the rate at which our backfitting algorithm converges, we consider drawn at random. That is, the analyst is solving a GLS conditionally on the pattern of observations and missingness, while we study the convergence rates that analyst will see for data drawn from a given missingness mechanism.

If we place all of the

into a vector

and compatibly into a matrix , then

(2)

where contains all of the in an ordering compatible with and . A naive algorithm costs to solve for . It can actually be solved through a Cholesky decomposition of an matrix (Searle et al., 1992). That has cost . Now , with equality only for completely observed data. Therefore , and so . When the data are sparsely enough observed it is possible that grows more rapidly than . In a numerical example in Section 5 we have growing like . In a hierarchical model, with but no we would find to be block diagonal and then could be computed in work.

We can write our crossed effects model as

(3)

for matrices and . The ’th column of has ones for all of the observations that come from row and zeroes elsewhere. The definition of is analogous. The observation matrix can be written . The vector has all values of in compatible order. Vectors and contain the row and column random effects and . In this notation

(4)

Our main computational problem is to get a value for . To do that we iterate towards a solution of , where is one of the columns of . After that, finding

(5)

is not expensive, because and we suppose that is not large.

If the data ordering in and elsewhere sorts by index , breaking ties by index , then is a block matrix with blocks of ones of size along the diagonal and zeroes elsewhere. The matrix will not be block diagonal in that ordering. Instead will be block diagonal with blocks of ones on the diagonal, for a suitable permutation matrix .

3 Backfitting algorithms

Our first goal is to develop computationally efficient ways to solve the GLS problem (5

) for the linear mixed model (

3). We use the backfitting algorithm that Hastie and Tibshirani (1990) and Buja et al. (1989) use to fit additive models. We write in (4) as with and , and define . Then the GLS estimate of is

(6)

and .

It is well known (e.g., Robinson (1991)) that we can obtain by solving the following penalized least-squares problem

(7)

Then and and are the BLUP “estimates” for the random effects. This derivation works for any number of factors, but it is instructive to carry it through initially for one.

3.1 One factor

For a single factor, we simply drop the term from (3) to get

Then , and as before. The penalized least squares problem is to solve

(8)

We show the details as we need them for a later derivation.

The normal equations from (8) yield

(9)
(10)

Solving (10) for and multiplying the solution by yields

for an ridge regression “smoother matrix” . Substituting this value of into equation (9) yields

(11)

Using the Sherman-Morrison-Woodbury (SMW) identity, one can show that and hence above equals from (6). This is not in itself a new discovery; see for example Robinson (1991) or Hastie and Tibshirani (1990) (Section 5.3.3).

To compute the solution in (11), we need to compute and . The heart of the computation in is . But and we see that all we are doing is computing an -vector of shrunken means of the elements of at each level of the factor ; the th element is . This involves a single pass through the elements of , accumulating the sums in the registers, followed by an elementwise scaling of the components. Post multiplication by simply puts these shrunken means back into an -vector in the appropriate positions. The total cost is . Likewise does the same separately for each of the columns of . Hence the entire computational cost for (11) is , the same order as regression on .

What is also clear is that the indicator matrix is not actually needed here; instead all we need to carry out these computations is the factor vector that records the level of factor for each of the observations. In the R language (R Core Team, 2015) the following pair of operations does the computation:

  hat_a = tapply(y,fA,sum)/(table(fA)+lambdaA)
  hat_y = hat_a[fA]

3.2 Two factors

With two factors we face the problem of incompatible block diagonal matrices discussed in Section 2. Define ( columns), , and . Then solving (7) is equivalent to

(12)

A derivation similar to that used in the one-factor case gives

(13)

where the hat matrix is written in terms of a smoother matrix

(14)

We can again use SMW to show that and hence the solution in (6). But in applying we do not enjoy the computational simplifications that occurred in the one factor case, because

where is the observation matrix which has no special structure. Therefore we need to invert an matrix to apply and hence to solve (13), at a cost of at least (see Section 2).

Rather than group and , we keep them separate, and develop an algorithm to apply the operator efficiently. Consider a generic response vector and the optimization problem

(15)

From (12) and (14) it is clear that the fitted values are given by . Solving (15) would result in two blocks of estimating equations similar to (9)–(10). These can be written

(16)

where is again the ridge regression smoothing matrix for row effects and similarly the smoothing matrix for column effects. We solve these equations iteratively by block coordinate descent, also known as backfitting. The iterations converge to the solution of (15) (Buja et al., 1989; Hastie and Tibshirani, 1990).

Here the simplifications we enjoyed in the one-factor case once again apply. Each step applies its operator to a vector (the terms in parentheses on the right hand side in (16)). For both and these are simply the shrunken-mean operations described for the one-factor case, separately for factor and each time. As before, we do not need to actually construct , but simply use a factor that records the level of factor for each of the observations.

This was for a generic response ; we apply the same algorithm (in parallel) to each column of to obtain in (13). Now solving (13) is . These computations deliver ; if the BLUP estimates and are also required, the same algorithm can be applied to the response , retaining the and at the final iteration. We can also write

(17)

It is also clear that we can trivially extend this approach to accommodate any number of factors.

3.3 Centered operators

The matrices and

both have row sums all ones, since they are factor indicator matrices (“one-hot encoders”). This creates a nontrivial intersection between their column spaces, and that of

since we always include an intercept, that can cause backfitting to converge more slowly. In this section we show how to counter this intersection of column spaces to speed convergence. We work with this two-factor model

(18)
Lemma 1.

If in model (18) includes a column of ones (intercept), and and , then the solutions for and satisfy and .

Proof.

It suffices to show this for one factor and with . The objective is now

(19)

Notice that for any candidate solution , the alternative solution leaves the loss part of (19) unchanged, since the row sums of are all one. Hence if , we would always improve by picking to minimize the penalty term , or . ∎

It is natural then to solve for and with these constraints enforced, rather than waiting for the iterative algorithm to discover them.

Theorem 1.

Consider the generic optimization problem

(20)

Define the partial sum vector with components , and let

Then the solution is given by

Moreover, the fit is given by

where is a symmetric operator.

The computations are a simple modification of the non-centered case.

Proof.

Let be an orthogonal matrix with first column . Then . So reparametrizing in this way leads to the equivalent problem

(21)

To solve (21), we simply drop the first column of . Let where is the matrix omitting the first column, and the corresponding subvector of having components. We now solve

(22)

with no constraints, and solution . The fit is given by , and is clearly a symmetric operator.

To obtain the simplified expression for , we write

(23)

with . We write and , and let

(24)

Now (24) is a projection matrix in onto a dimensional subspace. Let Then , and so

Unraveling this expression we get

With in (23), this gives the expressions for each in the statement of the theorem. ∎

3.4 Covariance matrix for with centered operators

In Section 3.2 we saw in (17) that we get a simple expression for . This simplicity relies on the fact that , and the usual cancelation occurs when we use the sandwich formula to compute this covariance. When we backfit with our centered smoothers we get a modified residual operator such that the analog of (13) still gives us the required coefficient estimate:

(25)

However, , and so now we need to resort to the sandwich formula with from (13). Expanding this we find that equals

While this expression might appear daunting, the computations are simple. Let , the residual matrix after backfitting each column of using these centered operators. Then because is symmetric, we have

(26)

Since (two low-rank matrices plus the identity), we can compute very efficiently, and hence also the covariance matrix in (26).

4 Convergence of the matrix norm

In this section we prove a bound on the norm of the matrix that implements backfitting for our random effects and . To focus on essentials we do not consider the updates for here. It was necessary to take account of intercept adjustments, because the intercept is in the space spanned by these random effects.

Let the matrix update the vector at one iteration of backfiting. The updates take the form

for some . They will converge if holds for some , where

We already know from Buja et al. (1989) that backfitting will converge. However, we want more. We want to avoid having the number of iterations required grow with . We can write the solution as

and in computations we truncate this sum after steps. If approaches as then the number of iterations required to make negligible can grow with . We want for some as . For our induced norms is no smaller than , the spectral radius of . For symmetric the spectral radius . Our update matrices are not necessarily symmetric. It can be shown, via Gelfand’s formula, where , that if , then the iteration will converge.

We seek conditions under which with probability tending to one. The empirically most favorable norm is , but the most tractable one for our theoretical approach is

The update for is linear in , so if converges, then we could take a single step for . In practice we alternate those updates.

4.1 Updates

Recall that describes the pattern of observations. In a model with no intercept we could use the following update:

For this update, we have for where

The update alternates shrinkage estimates for and but does no centering.

In the presence of an intercept, we know that should hold at the solution and we can impose this by centering the , taking

The intercept estimate will then be which we can subtract from upon convergence. Our iteration has the update matrix with

(27)

after replacing a sum over by an equivalent one over .

In practice, we prefer to use the weighted centering from Section 3.3 to center the . For the generic response there, we take a vector of values , and so Then and the updated is

Using shrunken averages of , the new are

Now for , where

(28)

As above, we will leave the intercept in as the average of the because a second centering produces unnecessarily unwieldy expressions.

4.2 Model for

We will state conditions on under which is bounded below with probability tending to one, as the problem size grows. Similarly derived conditions for are somewhat less restrictive. We need the following exponential inequalities.

Lemma 2.

If , then for any ,

Proof.

This follows from Hoeffding’s theorem. ∎

Lemma 3.

Let for , not necessarily independent. Then for any ,

Proof.

This is from the union bound applied to Lemma 2. ∎

Here is our sampling model. We index the size of our problem by . The sample size will satisfy . The number of rows and columns in the data set are

respectively, for positive numbers and . Because our application domain has , we assume that . We ignore that and above are not necessarily integers.

In our model, independently with

(29)

That is . Letting depend on and allows the probability model to capture stylistic preferences affecting the missingness pattern in the ratings data.

4.3 Bounds for row and column size

Letting mean that is stochastically smaller than , we know that