Scalable inference for crossed random effects models

03/26/2018 ∙ by Omiros Papaspiliopoulos, et al. ∙ Universitat Pompeu Fabra Università Bocconi University of Warwick 0

We analyze the complexity of Gibbs samplers for inference in crossed random effect models used in modern analysis of variance. We demonstrate that for certain designs the plain vanilla Gibbs sampler is not scalable, in the sense that its complexity is worse than proportional to the number of parameters and data. We thus propose a simple modification leading to a collapsed Gibbs sampler that is provably scalable. Although our theory requires some balancedness assumptions on the data designs, we demonstrate in simulated and real datasets that the rates it predicts match remarkably the correct rates in cases where the assumptions are violated. We also show that the collapsed Gibbs sampler, extended to sample further unknown hyperparameters, outperforms significantly alternative state of the art algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Crossed random effect models are additive models that relate a response variable to categorical predictors. In the literature they appear under various names, e.g. crossclassified data, variance component models or multiway analysis of variance. They provide the canonical framework for understanding the relative importance of different sources of variation in a data set as argued in

Gelman (2005). For the purposes of this article we focus on linear models according to which

(1)

where

is a variance component, i.e., the vector of

levels, , for the -th categorical factor; is a global mean with level. These variance components might correspond to both main and interaction effects in categorical data analysis. We work with exchangeable Gaussian random effects, , for , which is by far the most standard choice, although interesting alternative priors exist as in Volfovsky & Hoff (2014). We call a data design one with balanced levels if the same number of observations are made at each level of each factor, but this number can vary with factor, see Section 2.1

for a mathematical definition. The design has balanced cells if the same number of observations are available for each combination of factor levels, i.e., at each cell of the contingency table defined by the categorical predictors. By construction, a design with balanced cells has also balanced levels. In the notation of (

1) we allow , which corresponds to empty cells. The total number of factor levels, hence of regression parameters, is denoted by , and that of number of observations by . Crossed random effect models adapt naturally to modern high-dimensional but sparse data. For example, they are used in the context of recommender systems where in the simplest setup there are two factors, customers and products, and the response is a rating; the examples in Gao & Owen (2017) are such that .

Likelihood-based inference for such models requires a marginalisation over the factors. An exact marginalisation is possible in the linear model due to the joint Gaussian distribution of responses and factors. However, this involves matrix operations the cost of which are

, which is prohibitively large in modern applications. For example in the case of recommendation this cost this is typically , hence infeasible for large datasets. The precision matrices of the Gaussian distributions involved can be computed efficienty, e.g., Wilkinson & Yeung (2004), and may be sparse, hence black-box sparse linear algebra algorithms can be used for the matrix operations in the hope of reducing the complexity, but it has never been established that this is actually achieved in crossed random effect models.

Alternatively, Markov chain Monte Carlo can be used to carry out the integration (and the inference more generally). The most popular and convenient algorithm in this context is the Gibbs sampler, which samples the factors

iteratively from their full conditional distributions. Recently, Gao & Owen (2017) sketched an argument that suggests that the complexity of this algorithm, in the special case that is assumed known, in the context of recommendation where with balanced cell design, has complexity

. The complexity of a Markov chain Monte Carlo algorithm can be defined as the product of the computational time per iteration and the number of iterations the algorithm needs to mix. The heuristic argument in

Gao & Owen (2017) suggested that the Gibbs sampler is not scalable for crossed random effects models due to its superlinear cost in the number of observations.

In this article we develop the theory for analysing the complexity of the Gibbs sampler for crossed random effect models under different designs. We propose a small modification of the basic algorithm, the collapsed Gibbs sampler, which we analyse too, and establish rigorously its superior performance and scalability. We obtain rigorous results on the mixing times of the algorithms in Section 5 and analyse their computational cost in the Appendix. The careful analysis also shows that to an extent the scalability of the Gibbs sampler depends on the design. Our theory is useful beyond the specific designs that have been assumed to derive it. The essence of the methodology we develop in this article is shown in Figure 1, details of which are given in Section 6.1.

Figure 1: Mixing time over number of datapoints for , and

. Data are missing at completely random with probability of missingness 0.9. The exact rates and those predicted by our theory (which, however, does not apply to these designs) are shown.

The figure highlights different aspects of our results; we consider modern big data asymptotic regimes where both the number of parameters and observations grow, for factors; only a small fraction of the cells of the contingency table made by the outer product of the two factors is observed; the mixing time of the Gibbs sampler and the collapsed Gibbs sampler can be computed numerically and are plotted versus the size of the datasets, and it is evident the slowing down of the Gibbs sampler and the improvement of the collapsed Gibbs sampler with increasing data sizes; our theory is not applicable in these cases since the resultant designs, which have been generated randomly, are not balanced levels, still the rate that our theory predicts matches quite remarkably the correct rates. We obtain comparable results in a well-known real dataset of student evaluations with 5 factors in Section 6.2. For the student evaluations dataset we consider state of the art Markov chain Monte Carlo algorithms that sample the factor levels and the precision parameters in the crossed effect models, including parameter expansion and Hamiltonian Monte Carlo algorithms. We find that the collapsed Gibbs sampler we propose, appropriately extended to sample the precisions, has far superior performance. This is again a setting where our theory has not been developed yet (unbalanced designs, unknown precisions) but where the intuition gained from the simpler settings suggests practically useful algorithms. Therefore, our theory leads to generic guidelines for practitioners.

The theory is based upon a multigrid decomposition of the Markov chain generated by the sampler, which allows us to identify the slowest mixing components, and capitalises on existing theory for the convergence of Gaussian Markov chains. The multigrid decomposition of a Markov chains is a powerful theoretical tool for studying its mixing time, since it provides its decomposition into independent processes. Identifying such decomposition is a kind of art; a previously successful example is in Zanella & Roberts (2017) in the context of multilevel nested linear models. We point out that for nested hierarchical models scalable Bayesian computation can also be achieved with deterministic algorithms, such as belief propagation, as in Papaspiliopoulos & Zanella (2017).

The article closes with some conjectures that dictate our future research.

2 Decompositions of the posterior distribution

2.1 Notation

The statistical model we work with is described in (1). In accordance with standard practice Gaussian priors are used for the factor levels, , and an improper prior for the global mean, . When convenient we write , which is is the same as . We allow , which corresponds to empty cells in the contingency table defined by the outer product of the categorical factors. With we denote the data incidence array, a multidimensional array with elements . Two-dimensional marginal tables extracted from the data incidence matrix are denoted by and have elements , which is the total number of observations on level of factor and of factor ; margins of this table are denoted by , and are vectors of size and elements , which is the total number of observations with level on the -th factor. By definition , where is the total number of observations. A data design has balanced levels if for every and , and balanced cells if for all combinations of factor levels.

Averages of vectors are denoted by an overline, e.g., ; weighted averages are denoted by a tilde, e.g., . The vector of all factor averages is denoted by , the first element of which is trivially . Negative superscripts in factors denote the vector of factor levels for all factors but the one with the index whose negative value is used in the superscript, e.g., includes all factor levels except those of . Similarly, negative subscripts in factors denote the vector of levels of the given factor except the level whose negative value is used in the subscript, e.g., includes all levels of factor except the -th level; denotes the vector of all levels of all factors. We define to be a residual operator that when applied to a vector returns the difference of its elements from their sample average, e.g., has elements and is referred to as the factor’s level increments; denotes the vector of all such increments (except which is 0 trivially).

The law of a random variable

is denoted by , e.g., , and that of conditionally on by

. When a joint distribution has been specified for

and other random variables, denotes the full conditional distribution of conditionally on the rest.

2.2 Full conditional distributions

Fairly standard Bayesian linear model calculations yield that

(2)

where the LHS is an example of the abridged notation we shall adopt for the full conditional distribution (in this case of given all other parameters and data). With balanced levels this simplifies to

(3)

Similarly we obtain that for

(4)

where is the weighted average of all observations for which their level on factor is . With balanced cells this simplifies to

(5)

2.3 Factorisations

In balanced levels designs the posterior distribution of regression parameters admits certains factorisation, which are collected together in the following Proposition.

Proposition 1.

For balanced levels designs

and

(6)

For balanced cells designs we have further

The factorisation in (6) is particularly relevant to the collapsed Gibbs sampler we introduce later in the article. A sketch of the proof of Proposition 1 is the following. For the first factorisation, directly from (4) with the assumption of balanced levels we obtain that

(7)

We use the fact that global and local Markovian properties are equivalent, see, e.g., Section 3 of Besag (1974). This yields the independence stated in the lemma. The proof of (6) follows by similar arguments using . The third factorisation is argued in the same way noting that (5) implies that

3 Gibbs samplers for inference

We consider two main algorithms in this paper. The first is a block Gibbs sampler that updates in a single block the levels of a given factor conditionally on everything else. Due to the dependence structure in the model, the levels of a given factor conditionally on the rest are independent, hence in practice the sampling is done separately for each factor level, i.e., iteratively from , for , and ; these distributions are specified in Section 2.2. We refer to this algorithm as the Gibbs sampler, although it should be understood that it is just one implementation of the scheme.

We also consider the collapsed version of this algorithm that samples from , i.e., the algorithm that is obtained by first analytically integrating out the global mean , and then sampling in blocks the levels of each of the remaining factors; we term this algorithm the collapsed Gibbs sampler. In practice, we implement this algorithm by sampling iteratively from , for . In this implementation we first sample , and then for as in the previous scheme. The implementation of the collapsed Gibbs sampler relies on the following result.

Proposition 2.

Denoting , then

(8)

The reason why we prefer to present the collapsed Gibbs sampler in this way where is updated together with each block as opposed to being integrated out before sampling starts, is because our preferred version is still realisable in more elaborate models, e.g., generalised linear crossed random effects models. In such extensions exact sampling from might not be feasible, but a Metropolis-Hastings step can be used instead. Additionally, it requires a minimal modification of the Gibbs sampler code to implement.

4 Multigrid decomposition of the Gibbs samplers

4.1 Notation

For the stochastic processes generated by Markov chain Monte Carlo the time index corresponds to iteration, which is generically denoted by , and it is included in parentheses, e.g., ; in such a case the stochastic process over iterations is denoted by ; we write when ; we write to denote a stochastic process that at each time takes as value the vector composed by and . We say that the stochastic process is a timewise transformation of another if there is a function such that for all .

4.2 Main results

The results we derive in this paper stem from the following result, the proof of which is given in the Appendix.

Theorem 1.

(Multigrid decomposition) Let be the Markov chain generated either by the Gibbs sampler or the collapsed Gibbs sampler for balanced levels designs. Then, the timewise transformations and obtained from are each a Markov chain and they are independent of each other.

A crucial point here is that the fact that the posterior distribution of and factorise for balanced levels, as shown in Proposition 1, does not imply that the corresponding chains and are independent of each other. The following very simple example that makes this point clear. Consider a Gibbs sampler that targets a bivariate Gaussian for with correlation and standard Gaussian marginals. Then the transformation and orthogonalises the target, but the corresponding stochastic processes and obtained by timewise transformation of the original chain are not independent Markov chains, see, e.g., the cross-correlogram in Figure 2.

Figure 2: Cross correlation between and , where , and is the Gibbs sampler Markov chain on a bivariate Gaussian with correlation .

Although this is a toy example, there are many instances where an independence factorisation of the target distribution does not imply that of the MCMC algorithm adopted (for example in Hamiltonian Monte Carlo, population Markov chain Monte Carlo and piecewise deterministic Monte Carlo algorithms such as the zig-zag and bouncy particle sampler). There are subtle and deep reasons why the factorisation in Proposition 1 extends to the independence of the Markov chains obtained as timewise transformations.

In Section 5 we use Theorem 1 in conjunction with two others to characterise the complexity of the two samplers. The first of the additional results is about convergence rates of Markov chains, and shows how to relate the rate of convergence of to that of the timewise transformations and . The second is about the rate of convergence of each of the Markov chains and .

5 Complexity analysis

5.1 Complexity of Markov chain Monte Carlo

In this article we focus on convergence, which relies on functional analytic concepts, a very high level description of which are given below. For a given target distribution defined on a state space , we define to be the space of complex-valued functions that are square-integrable with respect to . We define the inner product in this space such that the associated norm of a function is . For a Markov chain defined on with transition kernel that is invariant with respect to , we view as an integral operator on , and we say that it converges geometrically fast to in

norm (also known as operator norm), if and only if the second largest in absolute value eigenvalue of

, known as its geometric rate of convergence, is less than 1. The spectral gap of is defined as the difference between 1 and the rate of convergence, hence a Markov chain converges in norm if and only it has positive spectral gap. All this is fairly standard functional analysis theory applied to Markov chains on general state spaces.

For our purposes, we define the mixing time of a Markov chain to be the inverse of its spectral gap; this can be interpreted as the number of iterations needed to subsample the Markov chain so that the resultant draws are roughly independent of each other. The complexity of a Markov chain Monte Carlo algorithm can be defined as the product of the mixing time and the cost per iteration.

5.2 Timewise transformations and convergence of Markov chains

The multigrid decomposition in Theorem 1 identifies two timewise transformations of the Markov chain produced by either of the algorithms considered in this article, each of which evolves independently of each other as a Markov chain. We can relate the rate of convergence of the Markov chains involved in this decomposition using the following two technical lemmata that are proved in the Appendix.

Lemma 1.

Let be a Markov chain with invariant distribution and be a timewise transformation given by , where is an injective function. Then is a Markov chain with the same rate of convergence as .

Lemma 2.

Let be a Markov chain with state space and target distribution . If the stochastic processes and obtained by projection on the and components are two independent Markov chains, then the rate of convergence of equals the supremum between the rates of convergence of and .

Therefore, for balanced levels designs the rate of convergence of the Markov chain , generated either by the Gibbs sampler or the collapsed Gibbs sampler, is the larger of the rates of the two chains and . Each of these chains is amenable to analysis using the theory summarised in Section 5.3 below.

5.3 The spectral gap of the Gibbs sampler on Gaussian distributions

The Markov chain generated by a Gibbs Sampler targeting a Gaussian multivariate distribution is a Gaussian autoregressive process evolving as , see for example Lemma 1 in Roberts & Sahu (1997). The details of the Gibbs sampler (e.g., the order that its components are updated or blocked together) are reflected in the precise form of . There is a generic recipe how to obtain described in Lemma 1 in Roberts & Sahu (1997), but sometimes it is easier to work it out directly from first principles, as for example we do in Propositions 3 and 4 and below. This representation implies that the rate of convergence of the Gibbs sampler is , the largest absolute eigenvalue of the matrix , see Theorem 1 of Roberts & Sahu (1997). This characterisation of the rate of convergence is immensely useful and has provided invaluable insights into the performance of the Gibbs sampler and has lead to much more efficient modifications of the basic algorithm, see for example Papaspiliopoulos et al. (2003, 2007). However, in high-dimensional scenarios it is often very challenging to compute explicitly as a function of the important parameters of the model (e.g., and in the crossed effects models considered here). Hence as a tool for understanding the complexity of the Gibbs sampler in difficult problems this approach has limited scope. In this article we will make it useful by combining it with the multigrid decomposition of Theorem 1, which collapses the problem to studying the spectral gaps of the Gaussian subchains and that turn out to be amenable to direct analysis.

5.4 Complexity analysis for balanced cells designs

The most substantial result of this section is Proposition 3 below, which actually holds for balanced levels designs too, hence used also in Section 5.5. It characterises the rate of convergence of one of the two timewise transformations involved in the multigrid decomposition.

Proposition 3.

For balanced levels designs the rate of convergence of the Markov chain defined in Theorem 1 equals for the Gibbs Sampler and for the collapsed Gibbs Sampler, and this rate is the same for any order that the different blocks are updated.

Proof.

For the Gibbs Sampler, the subchain is a Gaussian Gibbs Sampler, with one-dimensional components. We can explicitly work out that its autoregressive matrix takes the form

(9)

where is a lower triangular matrix with diagonal elements equal to , with

(10)

We check (9) verifying directly that . From equation (3) we have , which implies that the first row of is as in (9). To conclude the proof of (9) we need to show that

(11)

for some and . Using (7), we have

When the latter implies , meaning that (11) holds for . By induction we have that (11) holds for all . In fact if (11) holds for up to , we have

meaning that (11) holds also for . Therefore has a form as in (9).

Since is a lower triangular matrix its spectrum coincides with its diagonal elements . For each , let

be the eigenvector with eigenvalue

. It is easy to check that the -dimensional vector is an eigenvector of B with eigenvalue . Thus are also eigenvalues of . Finally note that is an eigenvector of with eigenvalue 0. With these ingredients the proof of the claim for the Gibbs sampler follows immediately.

For the collapsed Gibbs Sampler, is obtained from by simulating from

for . By Proposition 1, the latter procedure produces independent and identically distributed draws from , or equivalently if is jointly updated with .

These rates do not depend on the order that the different components are updated. This is trivially true for the collapsed Gibbs since the components are independent. For the Gibbs sampler the argument is as follows. The Gibbs Sampler rate of convergence is invariant with respect to cyclic permutations of the order of update of the components, see e.g. Roberts & Sahu (1997, p.297). Thus we can always assume to be the first component to be updated. Then the result follows by relabeling the components to according to their update order and replicating the argument developed in the previous paragraphs. ∎

The main result of this section follows rather easily from Proposition 3.

Theorem 2.

For balanced cells designs, the mixing time of the Gibbs Sampler is , and that of the collapsed Gibbs Sampler is 1, i.e., it produces independent and identically distributed draws from the target, and these rates do not depend on the order that different components are updated.

of Theorem 2.

Let be the Markov chain generated by the Gibbs Sampler or its collapsed version. Lemma 1 implies that is a Markov chain with the same rate of convergence as . Thus, by means of Theorem 1 and Lemma 2, the rate of convergence of equals the maximum between the rate of convergence of and the one of . Proposition 1 implies that performs independent sampling from and thus its rate of convergence is 0 and the rate of convergence of equals the one of . To conclude, Proposition 3 and the definition of mixing times as inverse of the spectral gap imply the statement to be proved. ∎

The theorem completely characterises the mixing time of the Gibbs sampler and the collapsed Gibbs sampler for balanced cells designs. Considering the computational cost of the algorithms, we find that each of the algorithms requires an computation at initialisation to precompute data averages. In the Appendix we show that both algorithms have the same cost per iteration, which is proportional to the number of parameters, . Therefore, the collapsed Gibbs sampler is an implementation of exact sampling from the posterior.

We now consider asymptotic regimes. The more classical asymptotic regime, which we will refer to as infill asymptotics, keeps the number of factors and levels fixed, hence and fixed, and increases the number of observations per cell, hence grows. The other more modern asymptotic regime, which we will refer to as outfill asymptotics, increases with , e.g. considering the observations per cell bounded and increasing the number or levels and/or factors. It is this type of asymptotic that it is more interesting in recommendation applications.

Regardless of the asymptotic regime considered the mixing time of the collapsed Gibbs sampler is . On the other hand, that of the Gibbs sampler depends on the regime considered. In infill asymptotics Theorem 2 implies that the mixing time of the algorithm is . An intuition for this deterioration of the algorithm with increasing data size can be obtained by considering the analysis of non-centered parameterisations for hierarchical models in Section 2 of Papaspiliopoulos et al. (2007); the parameterisation of the crossed effect model is non-centred and the infill asymptotics regime makes the data increasingly informative per random effect, hence we should anticipate the deterioration. Therefore, in this regime the complexity of both algorithms is but in practice the collapsed will be much more efficient. In outfill asymptotics, both and the number of factor levels ’s are growing, hence by Theorem 2 the mixing time of the Gibbs sampler is no worse than but no better than . The lower bound on the mixing time can be deduced from the balanced cells design assumption, which implies and ; the bound is achievable when . On the other hand, the number of parameters can grow as different powers of . For example, if the number of levels for all but one factor are fixed and those of the remaining factor are increasing (e.g., fixed number of customers, increasing number of products) then is and the mixing time of the Gibbs sampler is also , resulting in a Gibbs Sampler complexity of , whereas the collapsed Gibbs sampler is .

5.5 Complexity analysis for balanced levels designs

The strategy for obtaining complexity results for balanced levels designs is the same as for balanced cells and Proposition 3 is as instrumental. However, in this case the analysis is much more complicated since the second timewise transformation, , does not sample anymore independently from its invariant distribution; in fact its invariant distribution does not factorise as in the case of balanced cells. On the other hand, Lemma 2 and Proposition 3 imply immediately lower bound on the mixing time of the Gibbs sampler.

Theorem 3.

For balanced levels designs, the mixing time of the Gibbs Sampler is at least .

From Proposition 3 we also know that the rate of the blocked Gibbs sampler is that of . Therefore, obtaining explicit rates of convergence for is the step needed for characterising the mixing time of both algorithms in balanced levels designs. We are able to do this for in Proposition 4 below. Our theory is based on an auxiliary process with discrete state space that evolves according to a two component Gibbs Sampler, iteratively updating and , with invariant distribution .

Proposition 4.

For balanced levels designs with , the rate of convergence of the Markov chain is

where is the rate of convergence of the auxiliary Gibbs sampler .

Proof.

The chain is a two-component Gibbs Sampler that alternates updates from the conditional distributions and . Thus, is marginally a Markov chain and its rate of convergence equals the one of , see e.g. Roberts & Rosenthal (2001). Let and defined by and . It is then a simple computation that is a Gaussian autoregressive process with autoregression matrix . Since for balanced levels design it holds , it can be deduced from (4) and (7) that , where is a matrix being the transition kernel of the update of the auxiliary process. Similarly, one can show , where is a matrix being the transition kernel of the update of the auxiliary process. Hence, the autoregressive matrix of is , where is the transition kernel of the auxiliary Gibbs sampler . Consequently, the spectrum of the autoregressive matrix is , where are the eigenvalues of . The largest is of course 1 since

is a stochastic matrix. However, since

is constrained to have zero sum, by Lemma 3 in the Appendix the rate of convergence of is not given by the largest modulus eigenvalue of the autoregressive matrix, but the largest modulus eigenvalue whose eigenvector has zero sum, i.e., we need to consider only the subspace orthogonal to the vector of 1’s. Therefore, the rate of convergence of equals times the second largest modulus eigenvalue of , which is by definition. ∎

With Proposition 4 in place, the main result of this Section on the mixing time of the algorithms follows immediately.

Theorem 4.

For balanced levels designs with , the rate of convergence of the Gibbs Sampler and the collapsed Gibbs Sampler are given by, respectively,

where is the rate of convergence of the auxiliary Gibbs sampler with invariant distribution .

Note that if the design is in fact balanced cells, the rates given in Theorem 4 match those of Theorem 3, as they should, since in this case.

A corollary to this Theorem is that the mixing time of the Gibbs sampler is and that of the collapsed Gibbs sampler is no larger than , where is the mixing time of the auxiliary process . An implication of this is that the collapsed Gibbs Sampler is never slower than the standard Gibbs Sampler and it is has good mixing both when the amount of data per level is low and high. To see this, note first the ratios and coincide with the number of datapoints per column and row, respectively, in the data incidence matrix with entries and thus their value increases as the amount of data per level increases. On the contrary the mixing time of the auxiliary process tends to decrease as the amount of data per level increases because the latter corresponds to adding more edges in the conditional independence graph, hence larger connectivity in the state space of the auxiliary process. Unfortunately, it is not true in general that the minimum across , and is uniformly bounded over . Consider for example a design where users and items are split into two communities of equal size, and users inside each community have rated all items from their community and no item from the other community. In this case the random walk is reducible. Therefore and, provided both and go to infinity, the mixing time of the collapsed Gibbs Sampler diverges as goes to infinity.

We now address the case of number of factors that Theorem 4 does not cover. A conjecture we make in this paper is that is the mixing time of the Gibbs sampler also for . We have experimented numerically quite extensively, since for specific examples we can compute the mixing time by computing numerically the largest eigenvalue of an explicit matrix, and we have not been able to find a counter-example. The missing step for a generic result would be to show that always mixes faster than . Such a result would also immediately prove, due to Proposition 3, that the collapsed Gibbs sampler has lower mixing time than the Gibbs sampler for arbitrary number of factors for balanced levels designs. On the other hand, numerical experimentation has also showed that certain extensions of Theorem 4 are not true. We know that the convergence rate of the collapsed Gibbs sampler can be larger than when ; we also know that the rate will depend on the order that the different components are updated. We return to these points in the Discussion.

We close the section with some asymptotic considerations on the complexity. The following arguments assume that the mixing time of the Gibbs sampler is the conjectured ; we will not consider the collapsed Gibbs sampler in the following considerations since we do not have conjecture for its rate when . The asymptotic behaviour of the Gibbs sampler mixing time depends on the regime under consideration as it was for balanced cells designs. The mixing time can be as bad as , for example if the number of levels of at least one factor is fixed as grows; it can be in the regime where and ; but it can also be in the sparse observation regime where . The Appendix discuss the computational cost per iteration, which for these designs can grow quadratically with the number of parameters, as opposed to linearly in the case of balanced cells. In terms of its growth with the observations, this can be , in infill asymptotics regimes where the number of levels of factors does not grow with ; it can be when and ; but it can also be in the sparse regime . Connecting now to the observation in Gao & Owen (2017), we obtain that for when and , the complexity of the Gibbs sampler is , hence the algorithm is not scalable.

6 Simulation Studies

6.1 Simulated data with missingness completely at random

First we consider simulated data with and . We assume data to be missing completely at random, where for each combination of factors we observe a datapoint, i.e., , with probability 0.1 independently of the rest, and otherwise we have a missing observation, i.e., . Since the mixing time of the samplers under consideration does not depend on the the value of the observations , but only on their presence or absence, we can set without affecting the computed convergence rates. In this context our theory does not apply directly because the designs under consideration are not balanced in general. However, we can still compute numerically the convergence rate of the Gibbs Sampler and its collapsed version in the context of known precisions, using the results discussed in Section 5.3, to explore to which extent the qualitative findings of our theory still apply. Figure 1 displays the behaviour of the mixing time of the Gibbs Sampler and its collapsed version in an outfill asymptotic regime, where both the number of datapoints and factor levels increase. For the simulations we fixed the precision terms to 1 and take in the set . The results suggest that the mixing time of the Gibbs Sampler diverges with , while the mixing time of its collapsed version converges to 1 as increases. This is coherent with the theoretical results of previous section. In fact, we can compare the mixing times that we computed numerically with the theoretical values computed as if the design were balanced levels, which of course it is not here. The figure shows an extremely close match, which showcases the use of our theory beyond the specific designs that have facilitated the analysis. This suggests that the theory previously developed is relevant beyond cases that strictly satisfy balanced levels. Since the cost per iteration of both samplers is , the results in Figure 1 suggest that, for the asymptotic regime considered in this section, the computational complexity of the Gibbs Sampler is and the one of the collapsed Gibbs Sampler is .

6.2 ETH Instructor Evaluations dataset

We now consider a real dataset containing university lecture evaluations by students at ETH Zurich. The dataset is freely available from the R package lme4 (Bates et al., 2015) under the name InstEval. It contains observations, each corresponding to a score ranging from 1 to 5, assigned to a lecture together with 6 factors potentially impacting such score, such as identity of the student giving the rating or department that offers the course. See the lme4 help material for more details on the dataset. We fit model (1) to the InstEval dataset. Following the notation in (1), we have , and . Clearly, a categorical response calls for a generalised linear model extension of (1), however the point of this analysis is to test the algorithms, and (1) is not an outright unreasonable model to fit for this dataset.

First we consider the known precision case, where the values are assumed to be known. In this context our theory does not apply directly because the design of the dataset is not balanced. However, we can still compute numerically the convergence rate of the Gibbs Sampler and its collapsed version, using the results discussed in Section 5.3. Consider first a two-factor case, by restricting our attention to the first two factors. In this case, setting , the mixing times of the Gibbs Sampler and its collapsed version are, respectively, and . Such values are numerical approximations obtained by computing the autoregressive matrix explicitly and then using the power method to approximate the size of its largest eigenvalue. If instead we consider the first and the last factor, thus having and , the mixing times of the Gibbs Sampler and its collapsed version become, respectively, and . This is coherent with our theory, which suggests that the presence of a factor with a small number of levels should severely slow down the Gibbs Sampler while not affecting the collapsed version. We can compute the mixing time implied by Theorem 4, even if the design is not balanced levels; the numbers we obtain for the Gibbs sampler are and and the one for the collapsed Gibbs Sampler are and , depending on whether or , respectively. All values match closely the values obtained numerically. This suggests that the theory previously developed can be highly informative also for unbalanced cases, provided the level of unbalancedness in the design is moderate. Finally, if we fit the whole dataset, with and for , the mixing times of the Gibbs Sampler and its collapsed version are, respectively, and . The mixing time of the Gibbs sampler implied by our theory, which again does not apply in this design, is , which is accurate again. In this case the mixing time of the collapsed Gibbs Sampler, despite being orders of magnitude smaller than the non-collapsed version, is moderately large, suggesting that the residual chain mixes slower than in the other examples.

Next consider the case of unknown precisions, where the hyperparameters are given a prior distribution and the posterior of interest is the joint distribution of and . We consider five Markov chain Monte Carlo schemes. The first two schemes alternate sampling from the conditional distribution and updating

with the Gibbs Sampler and its collapsed version, respectively. These are the most straightforward extensions of the samplers studied above to the unknown precisions case. Provided conjugate priors are used, the update

is trivial as the precision terms are conditionally independent given . The third and fourth schemes combine the first and second schemes, respectively, with the parameter expanded data augmentation methodology (Liu & Wu, 1999; Meng & Van Dyk, 1999). In the context under consideration, the parameter expanded methodology seeks to avoid issues related to potential correlation between the two blocks and by introducing appropriate auxiliary parameters, see Gelman et al. (2008) for more discussion. Finally, the fifth scheme is the No U-Turn sampler (Hoffman & Gelman, 2014), a state-of-the-art Hamiltonian Monte Carlo scheme implemented in the R package RStan (Stan Development Team, 2018). For the precision parameters, we used a standard flat prior , mainly to facilitate the implementation of parameter expanded methodologies. In order to avoid potential issues related to using flat priors with a very low number of factor levels, we excluded the factor with only two levels from the analysis, resulting in and .

Scheme time per Effective Sample Size / time (1/s)
1000 iter.
GS 13.2s (0.07, 11.0, 2.12, 0.16, 0.21, 0.87) (60.9, 15.9, 36.8, 3.56, 2.53, 2.14)
cGS 14.2s (65.9, 42.1, 18.1, 70.5, 62.3, 35.0) (55.1, 14.7, 34.5, 17.7, 33.9, 2.51)
GS+PX 13.5s (0.06, 10.7, 2.02, 0.08, 0.11, 0.95) (59.6, 41.2, 43.9, 0.85, 0.58, 2.33)
cGS+PX 14.4s (62.5, 44.1, 19.9, 62.9, 63.2, 34.6) (55.1, 38.0, 41.9, 19.2, 33.0, 2.96)
HMC 1112.6s (0.11, 0.78, 0.19, 0.08, 0.25, 0.68) (0.99, 0.51, 0.99, 0.10, 0.41, 0.18)
Table 1: Comparison of sampling schemes on the InstEval data, where . GS and cGS refer to the Gibbs Sampler and the collapsed version with precision updates, while +PX indicates combination with the parameter expanded methodology. HMC referes to the RStan implementation. Numbers are averaged over 10 runs of 10000 iterations for each scheme, discarding the first 1000 samples as burn-in.

Table 1 and Figure 3 report runtimes for the five schemes together with effective sample sizes and autocorrelation functions. It can be seen that the first four schemes have similar runtimes, but the ones using the collapsed methodology proposed in this paper induce a much faster mixing compared to the others. The use of the parameter expansion methodology provides a further, very limited in this case, improvement.

Figure 3: Autocorrelation functions of and , for , on the InstEval dataset. Lines are averaged over 10 runs of 10000 iterations for each scheme.

On the other hand, Hamiltonian Monte Carlo has a cost per iteration that is two orders of magnitude larger than the other schemes, resulting in the lowest effective sample sizes per unit of computation time. Finally, to obtain a higher level sense of the practicality of the approach we pursue in this article, we also fit the same crossed effect model in a frequentist fashion using the R package lme4, which took seconds to run. All computations were performed on the same desktop computer with 16GB of RAM and an Intel processor. It is worth noting that the first four schemes were directly implemented using a high level language such as , so we would expect significant further speed-ups by using a low-level language and use of distributed computing for the precomputations needed for the Gibbs samplers.

7 Discussion

There are many directions this work can move forward. We highlight the two that are most imminent. First is to investigate the conjecture made in Section 5.5 that the mixing time of the Gibbs sampler for balanced levels designs is . If this is true we also obtain that the collapsed Gibbs sampler has always smaller rate for such designs. The second is to obtain a characterisation of the rate of the collapsed Gibbs sampler for such designs when . From numerical experimentation we know that the natural extension of the expression of Theorem 4 is not true for , hence a different line of attack is needed.

Acknowledgement

The authors would like to acknowledge helpful discussions with Art B. Owen. Papaspiliopoulos acknowledges financial support from the Spanish Ministry of Economics via a research grant. Zanella was supported by the European Research Council (ERC) throught the “New Directions in Bayesian NonParameterics” StG 306406.

Appendix

Proof of Theorems and auxiliary results

of Theorem 1.

For concreteness and without affecting the validity of the argument we assume that the algorithm updates factors and their levels in ascending order, i.e., first simulates , then , , and so on and so forth. We first establish the result for the Gibbs sampler,i.e., part 1. Note that due to the conditional independence structure the algorithm can be equivalently represented as one that samples in blocks according to the conditional laws . For each iteration , each such draw, can be transformed to and . Proposition 1 establishes that the

(12)

Appealing to the equivalence of local and global Markov properties, as in Section 3 of Besag (1974), we obtain that the processes and , obtained as functions of , are each a Markov chain with respect to its own filtration, and independent of each other.

The collapsed Gibbs Sampler case is analogous. Here the sampler iterates the updates of for . It can be easily deduced from Proposition 1 that . Therefore, transforming each draw to and , we obtain