Fitting a deeply-nested hierarchical model to a large book review dataset using a moment-based estimator

06/01/2018 ∙ by Ningshan Zhang, et al. ∙ NYU college 0

We consider a particular instance of a common problem in recommender systems: using a database of book reviews to inform user-targeted recommendations. In our dataset, books are categorized into genres and sub-genres. To exploit this nested taxonomy, we use a hierarchical model that enables information pooling across across similar items at many levels within the genre hierarchy. The main challenge in deploying this model is computational: the data sizes are large, and fitting the model at scale using off-the-shelf maximum likelihood procedures is prohibitive. To get around this computational bottleneck, we extend a moment-based fitting procedure proposed for fitting single-level hierarchical models to the general case of arbitrarily deep hierarchies. This extension is an order of magnetite faster than standard maximum likelihood procedures. The fitting method can be deployed beyond recommender systems to general contexts with deeply-nested hierarchical generalized linear mixed models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given a dataset of books, users, and user reviews of those books, consider the problem of recommending books to users. This problem is a specific instance of the recommender system problem common in commercial applications (Adomavicius and Tuzhilin, 2005). In our context the following data are available:

  • A collection of 38,659 books, each with an author title, genre, subgenre, and sub-subgenre, a taxonomy scraped from by McAuley, Pandey and Leskovec (2015). Fig. 1 shows the first two levels of the book genre hierarchy.

  • A set of 157,638 ratings of the books made by 38,085 users, taken from The Book Crossing Dataset, an anonymized collection of book reviews harvested from by Ziegler et al. (2005).

  • User age and location (continent), included with the Book Crossing Dataset.

Figure 1: First two levels of the book genre hierarchy.

Appendix A

gives descriptive statistics for the dataset, including descriptions of how the ratings are distributed between the user age groups and continents. Most ratings are from users in the United States, aged 20–40 years.

Others have built recommender systems for the Book Crossing Dataset (Weng et al., 2008; Agarwal and Chen, 2010; Zhang, Cao and Yeung, 2010). Our application is unique in that we will attempt to leverage the book genre hierarchy to improve recommendations.

Rather than solving the book recommendation problem directly, we will attempt to solve a proxy problem: for each book-user pair, predict whether the user would like the book if he or she rated it. We can use the solution to the proxy problem to solve the original problem by recommending books with the highest predicted “like” probabilities. This proxy approach to the recommendation system is common 

(Ansari, Essegaier and Kohli, 2000; Adomavicius and Tuzhilin, 2005); it ignores information inherent in a user’s selection of which books to review, but despite this it often gives reasonable downstream results.

One strategy for solving a recommender system problem is the content-based approach, using user attributes together with book-specific parameters to make recommendations. Another strategy is the collaborative approach, recommending books that are liked by similar users. We prefer instead a variant of the hierarchical-model-based approach advocated by Condliff, Lewis and Madigan (1999) and Ansari, Essegaier and Kohli (2000) that combines the content-based and collaborative strategies.

The simplest form of the hierarchical model approach is with a flat item hierarchy, with each item sitting directly under the root. In our context, the flat model would assign a random effect vector 

to each book  that relates the popularity of the book to user- and context-specific covariate vector, along with a fixed effect vector that relates book popularity to another covariate vector (possibly the same). For a particular user review of a book , let be a binary indicator of whether the user liked the book, and let and be the user-context covariate vectors associate fixed and random effects, respectively. The link between the effects, the covariates and the response is


The random effect vectors are independent multivariate normal random vectors with mean and covariance matrix :


Eq. (1) demonstrates the content-based aspect of the model, where user attributes ( and ) are linked to preferences. Eq. (2) introduces the collaborative-based features: books with abundant data will have strongly-identified random effects ; others will have posterior means (conditional on the available review data) determined in part by the effects of similar items, through the covariance matrix .

In our application, we have a richer hierarchy, with books nested under author, sub-subgenre, subgenere, and genre. We can exploit this hierarchy in the model formulation by allowing for a random effect vector at each node in the hierarchy, not just at the leaves. Doing so allows for information pooling in random effect estimates across the levels in the hierarchy. We elaborate on this benefit in Section 2.

Despite its appeal, the hierarchical modeling and more general mixed modelling approaches to recommender systems have long considered infeasible at commercial scale due to the high computational demands of fitting the model (Agarwal, 2008; Naik et al., 2008). Recent progress has expanded the scope of application of these models: Gao and Owen (2016a, b) proposed a moment-based approach for estimating the parameters of a crossed effects model using a moment-based approach; Perry (2017) proposed a moment-based approach for fitting a flat hierarchical model; Tan et al. (2018) proposed a kernel-based approach for fitting a linear flat hierarchical model; and Zhang et al. (2016) developed a parallelized maximum likelihood fitting algorithm that can exploit multiple computing cores.

To fully exploit the deeply-nested book hierarchy in a computationally efficient manner, we will in the sequel develop a moment-based fitting method for hierarchical models of arbitrary depth. Before doing so, in Section 2 we elaborate on the benefits of using a hierarchical model in our context. Next in Section 3 we introduce the details of our model, using framework suitable for describing general hierarchical models. Our fitting method proceeds in two passes. In the first pass, we use the available data to get initial parameter estimates at the leaves of the tree, and then we propagate information in these estimates up to the root. In the second pass, we use the accumulated information to refine the estimates back down from the root back to the leaves. We describe these procedures in Section 4 and Section 5, respectively. After investigating the performance of our method in simulations in Section 6, we apply the procedure to our dataset in Sections 7 and 8. We conclude with a short discussion in Section 9.

Our fitting procedure is implemented in the mbest R package, available at

2 Local and global approaches

This hierarchical modeling approach interpolates two extremes: a “global” and a “local” approach. The global approach would have a single parameter vector shared by all books, fit by lumping the reviews for all books together. The local approach would have a different parameter vector for each book, fit in isolation using only the reviews of the corresponding book. The global approach corresponds to setting

for all books ; the local approach corresponds to treating as nonrandom.

The appeal of the hierarchical model can be demonstrated by comparing the performance of a global and a local model. We perform this comparison at the author level: the global model will have a single parameter vector shared by all authors; the local model will have a different parameter vector for each author. Both models use the same set of covariates, described later in Section 3. We fit both models on a training set, then evaluate on a test set. In the global model the probability that a user likes an item is described as

where is the vector of user covariates. The predictions from this model are the same for all authors. In the local model, the probability that the user likes book  is described as

This takes a similar form to the global model, but the coefficients depend on the book author .

Table 1 shows the test set misclassification rates for the two fitted models, grouped by the number of ratings per author. For authors with 100 or fewer ratings, the global model performs better on average than the local model. Only in the group of authors with large number of ratings (more than 100) does the local model perform better.

# Author Ratings Error (%) Error (%) Error -Error (%)
[10,20] 38.52 (0.53) 35.16 (0.52) 3.36
(20,50] 37.15 (0.42) 34.91 (0.41) 2.24
(50,100] 34.76 (0.48) 34.20 (0.48) 0.55
(100,1000] 32.15 (0.40) 32.32 (0.40) -0.16
Table 1: Average Misclassification Rates of Local Author-specific (Error) and Global (Error

) Models. (Standard Deviations in Parentheses.)

What is needed is an adaptive model that can interpolate between these two extremes: for authors with abundant data, use the local model; for authors with little or no data, use the global model; for others, use some combination of the two models.

The hierarchical model achieves the interpolation between local and global models automatically. For items  with abundant data, the posterior distribution (conditional on the data) for the random effect vector  is concentrated around the local coefficient estimate that uses only the reviews for item . For items with no data, the posterior distribution for is diffuse, with mean zero; the predictions for item  are determined mostly by the fixed effect vector  shared globally by all items. As the number of reviews for item  increases, the posterior distribution for and the corresponding predictions for item  interpolate between these local and global extremes.

The flat item hierarchy shares information across all items. In our application, we have a deep hierarchy of books. In Section 3, we will show how to exploit this deep hierarchy by a using a model with a random effect vector at each node in the hierarchy. Predictions for the items at the leaves involve the random effects on the nodes on the path from the root of the hierarchy to the item. Information pooling occurs at all levels of the hierarchy, with siblings in a subtree pooled to estimate the posterior distribution of their random effects. A hierarchy node with a subtree of abundant data will have an estimated random effect close to what would come in a fitted model. For other nodes, the estimate will involve information-pooling across other nodes at the same level in the tree.

3 Modeling framework

For our proxy problem, the goal is to estimate, for a given book and user, the probability that the user would like the book conditional on the user rating the book. We have a set of user and context covariates for each review, and a response  indicating whether the user liked the book. We also have a deeply nested hierarchy of books nested under author, sub-subgenere, subgenre, and genre. In what follows, we describe a model that leverages the book hierarchy in a way that facilitates information pooling for the estimates across the levels in the hierarchy.

To introduce the model, we first need to be more precise about what we mean by a hierarchy in the context of our problem and other similar settings. For us, a “depth- hierarchy” is a tree where all leaves have depth . In such a hierarchy, we label the nodes of the tree by unique strings of natural numbers:

  • the root of the hierarchy gets labeled by the empty string, denoted ;

  • the children of the root get labeled by the length-1 strings ;

  • in general, if is the label of a node, we let denote its number of children; we label these children by the strings gotten by concatenating the label with the child identifiers: .

In this labeling scheme, each node other than the root has a label that can be represented as , where is the node’s parent and is a natural number in the range .

For a node , we denote its depth by , equal to its distance from the root; this is also equal to the length of its label. For any depth in the range , we let denote the set of nodes with depth . Finally, for node of depth , and for , we let denote its ancestor at depth  in the hierarchy, setting .

In the context of our application, the leaves of the hierarchy are authors. The internal nodes are genres and subgenres. We observe data for each author (ratings for that author by users); our goal is to relate these ratings to the user covariates, using the structure of the book hierarchy to inform our predictions.

In general terms, the hierarchical model supposes that at each leaf node we observe a response vector of length . The behavior of this response vector is linked to observable covariates through a vector of nonrandom fixed effects identified with the root of the hierarchy and a set of random effects identified with the nodes in the hierarchy on the path from the root to the leaf . Different levels of the hierarchy may use different sets of user and book features to predict the user’s probability of liking the book. We denote these features by , where , a matrix of dimension , contains the features used by level- of the hierarchical model to predict the response vector . To link these features to the response, we posit existence of a fixed effect vector of dimension identified with the root of the hierarchy, along with a random effect vector at every other node in the tree such that has dimension when is at depth . The distribution of the response is determined by some function of the linear predictor , defined as


This predictor involves the fixed effect vector and the random effects of all nodes on the path from the leaf to the root.

In our application, we have a binary response vector with entries indicating whether the user liked the book. We use the canonical logistic link, supposing that for , this predictor relates to the response as


where without subscript denotes the collection of all random effects. We further suppose that the components of the vector are independent of each other conditional on . In an application with a continuous response vector we would instead typically specify that

has independent Gaussian components with mean and variance given by


for and for some dispersion parameter . The hierarchical model is not limited to these two settings, and in principle a modeler could specify any link between the linear predictor and the mean of the response .

To endow our model with a mechanism that allows borrowing strength across similar items in the hierarchy, we model the populations of random effects at the levels of the hierarchy. We treat these populations as independent. For the population of level- random effects, , we suppose that each item

is an independent draw from a multivariate normal distribution with mean-zero and covariance matrix

for some covariance matrix :


We further suppose that all random effects are independent of each other.

In the sequel, we discuss estimation for the depth- hierarchical model. That estimation procedes in two stages: first, estimate the model parameters and . Next, use the model parameters to get empirical Bayes estimates of the random effects . The empirical Bayes estimation procedure is the part of the model estimation that leverages information across different levels of the hierarchy. Our estimates of the covariance matrices

allow us to impute components of particular random effects vectors when we only have information about a subset of their components.

4 Fitting procedure

Frequentist hierarchical models like the one described in the previous section often get fit via maximum likelihood (Bates et al., 2013). However, these fitting algorithms can be prohibitively slow for large data sets like those that appear in commercial-scale settings (Agarwal, 2008; Naik et al., 2008). Perry (2017) got over the computational hurdle in a depth-1 hierarchical model by using a moment-based estimation procedure, adapted from an earlier procedure due to Cochran (1937). Here, we will extend nmfmt@@posfmt@@swafalsectype@@@partrue ifstar@@fulltrue@@citetp@@fullfalse@@citetpPerr16 procedure to handle hierarchy of arbitrary depth.

Throughout the section we will assume that the model described in (3) and (6) is in force. For the response data and the leaf nodes

we will allow for both the logistic regression case from (

4), the normal response case from (5). Our estimators extend naturally to any generalized linear model at the leaves with shared dispersion parameter .

The estimation procedure is easiest to describe if we reparametrize. To do so, for any node , let denote the vector of fixed and random effects on the path from the root up to and including . Specifically, set and for node with parent define recursively

For depth , let be the total number of fixed and random effects up to and including depth :

if , then has components.

Now, for leaf node define the matrix gotten by concatenating the columns of feature matrices by , so that has dimension . In this reparametrized form, the linear predictor at leaf node  is


This form makes the hierarchical model look somewhat like a standard generalized linear model, but the effect vector includes both nonrandom and random components: the fixed effect vector and the random effects on the path from the root to the leaf .

The estimation procedure for the hierarchical model is defined by repeatedly pruning the tree by reducing the leaves to a set of estimates at their parents. The high level description of the procedure is as follows:

  1. Produce estimates of at each leaf node . Set .

  2. We have at hand estimates of for each node . For each , combine the child estimates, for , to produce an estimate of and an estimate of .

  3. Combine estimates for to produce a final estimate .

  4. If , set to be the final estimate of the fixed effects and stop. Otherwise, go to Step 2 with the level decreased to .

In settings with a dispersion parameter , we handle this parameter analogously to .

When the fitting procedure terminates we will have final estimates and of the fixed effects and the random effect covariance matrices. The rest of this section is devoted to detailing the individual steps of the fitting procedure.

4.1 Step 1: Estimate parameters at the leaves

The first step in the estimation procedure is to use the data at each leaf to produce an estimate of , the vector of fixed and random effects on the path from the root to the leaf. Recall that . We will explicitly handle cases where the combined predictor matrix is rank-degenerate. We will only require that, conditional on , the estimate has negligible bias outside the null space of and is approximately normally distributed with known covariance matrix.

First we handle the normal model (5), where for the components of the response satisfy for a mean-zero Gaussian error vector with independent components for having unknown variance . In this case we set

where denotes pseudo-inverse. When has full rank, the estimate is the unique least-squares estimate of ; otherwise, the least-squares estimate is not unique and is one of the vectors minimizing the squared Euclidean norm .

To define the estimate of the dispersion parameter , we let denote the rank of . When we set

otherwise, we set . We combine the estimates across all the leaves to get a single estimate for the dispersion parameter:

Next we derive the properties of the estimate . First, let

denote a compact singular value decomposition where

is a diagonal matrix of dimension with positive diagonal entries. Then


where is a mean-zero Gaussian random vector of independent components, each with variance . If we set , then the quantity is approximately mean-zero normal with identity covariance matrix.

For the logistic regression model (4), we proceed analogously, but we use the Firth’s biased-reduced estimator (Firth, 1993) in place of the least squares estimator for . This is a refinement of the maximum likelihood estimator that is well-defined even when the responses are perfectly separated by a linear combination of the predictors. In cases where is rank-degenerate, there are multiple such estimators; we arbitrarily take to be one of them. The properties of the estimator are like those of the maximum likelihood: as the sample size increases, the estimator is asymptotically unbiased with covariance equal to the inverse information matrix. In the case of rank-deficient feature matrix , the information matrix takes the form where is the matrix of right singular vectors of . In this case if we set , then is approximately mean-zero normal with identity covariance matrix.

In both the normal and the logistic regression case, we can find an estimator and a matrix with full column rank such that conditional on , the quantity is approximately normal with identity covariance. In the logistic regression case, the quality of the normal approximation depends on the sample size being large.

4.2 Step 2: Combine the estimates at level

We now suppose that for some level , for each node we have a matrix of full row rank , such that conditional on ,

For linear models, these two conditions hold exactly; in nonlinear models these will only hold approximately, with the quality of the approximation depending on the size of the sample used to estimate . We will show how to combine estimates to get an estimate of and an estimate of .

Recall that for each node , where is the random effects on level , and are of length respectively. Let be a compact singular value decomposition. When the context is clear, for simplicity we denote as the first rows of , and denote as the last rows of .

We have the following (unconditional) moment equations:


The moment equations (8) and (9) hold for any node , therefore by standard moment matching method, we want to take empirical mean of terms on the left hand side and set parameters on the right hand side to match it. However, we cannot do this right away, since the dimension of , or equivalently the rank , may vary by nodes . To overcome this, we augment the moment equations to have same dimension across nodes . In particular, with any choice of symmetric positive-definite matrix for each node , we have


We use the semi-weighted scheme for choosing as described by (Perry, 2017).

Now, the moment equations have consistent dimension across all nodes: for every node , equation (10) has dimension , and (11) has dimension . Based on equation (10), we define the moment-based estimator as

where denotes pseudo-inverse. Based on equation (11), the moment-based estimator should satisfy

In practice, we do not have access to the true to compute in the above equation, instead we use the empirical estimate . If the result is not positive semidefinite, we project it onto the cone of positive semidefinite matrices and obtain the final estimate.

Let denote the eigendecomposition of the positive semidefinite matrix . Let denote the symmetric square root of . nmfmt@@posfmt@@swafalsectype@@@partrue ifstar@@fulltrue@@citetp@@fullfalse@@citetpPerr16 results imply that, subject to assumptions on the sample size, conditional on , the quantity is approximately normally distributed with

The error in the approximation tends to zero as the number of child nodes increases. In addition, since . Thus we can rewrite the above results as


Perry (2017) details the precise assumptions required for these results along with the quality of the approximations.

4.3 Step 3: Combine the level- covariance estimates

At the end of Step 2 in the procedure we have an estimates of for each . In Step 3, we combine these estimates to produce a final estimate by taking a weighted average: of over all nodes ,

Nodes with higher numbers of children get more weights.

4.4 Step 4: Recurse or Stop

If we are at the root of the tree, so that and we have an estimate , then we terminate the estimation procedure by setting our final estimate of the fixed effects to . Otherwise we decrement to , and go to Step 2 with that are used in equation (12).

5 Empirical Bayes random effect estimates

At the end of the fitting procedure described in Sec. 4 we have estimate of the fixed effect vector and estimates of the random effect covariance matrices. We also have at each node  in the hierarchy a preliminary estimate of , the fixed and random effects on the path from the root to node . These preliminary estimates do not share information across the hierarchy; estimate is determined only from the data at the leaves descending from . We can improve the estimates by replacing each with an empirical Bayes estimate that pools information across the hierarchy.

The information-pooling algorithm works top-down from the root. It starts by setting . Then at depth-1 nodes , the procedure uses and together with to get a refined estimate . This process repeats, level by level, until we get refined estimates at the leaves.

The full procedure is as follows:

  1. Set and set .

  2. If , stop.

  3. For each node we have a refined estimate . For each child for , we have a preliminary estimate . Use together with and to produce a refined estimate .

  4. Increment to and go to Step 2.

After applying this procedure, we have a refined estimate at each node in the tree. The estimates at each leaf  can be used to make refined estimates of the linear predictors ( or they can be used to make predictions for new data.

To complete the description of the procedure we need to explain Step 3 in more detail. In this step we have at our disposal , , and for node and its children. Further, we have a matrix of full column rank such that is approximately distributed as a multivariate normal with identity covariance matrix.

By definition, where is the random effect vector for node , and are of length respectively. We denote as the first columns of , and denote as the last columns of . Conditional on

, we have the following (approximate) Bayesian linear regression model: for any node


The empirical Bayes estimate is an estimate of the posterior mean of conditional on the observed data , gotten by using plug-in estimates and for and .

To derive the posterior distribution define , noting that the conditional distribution is the same as that of . By Bayes rule, then, the posterior density of satisfies

The posterior distribution, then, is that of a multivariate Gaussian with expected value given by

The empirical Bayes estimate of comes from using this expression in conjunction with plug-in estimates for and :

The refined estimate of , then, is .

6 Simulation

To evaluate our proposed estimation method, we compare its performance with three other procedures:

  • glmer, a maximum likelihood procedure, implemented as part of the lme4 R package (Bates et al., 2013).

  • glmer.split, a data-splitting estimation procedure, which randomly splits the data set into 10 subsets, computes estimates on each of them separately using glmer, and then combine the estimates by averaging them. We implemented the procedure ourselves in R; the algorithm is based on procedures proposed by Huang and Gelman (2005), Gebregziabher et al. (2012), and Scott et al. (2013).

  • sgd

    , which uses stochastic gradient descent to maximize a regularized version of the

    -likelihood. We implemented the procedure in a combination of C and  R; the algorithm is based on procedures proposed by Koren, Bell and Volinksy (2009) and Dror, Koenigstein and Koren (2011). We choose the regularization parameters by cross-validation.

In evaluating the methods, we look at both the quality of their estimates and the time it takes to compute them. We do not include the tuning parameter cross-validation time in the timing results.

We perform two sets of simulations: one for a two-level logistic regression model, and one for a two-level linear regression model. The setup and results for both simulations are similar, so we only include the logistic regression results here. Appendix B contains the linear regression results.

Following the notation in Section 3, we set the number of groups on the first level to , and number of groups on second level (the leaves) to . We simulate samples with ranging from to . We set the dimensions of fixed and random effect vectors to . For each value of we draw replicates according to the following procedure.

For each replicate, we draw a -dimensional fixed effect vector with components drawn independently from a heavy-tailed student’s

-distribution with 4 degrees of freedom. We draw random effect covariance matrices

and independently from an inverse Wishart distribution with shape and 10 degrees of freedom, scaled by .

We allocate the  samples to the groups and

subgroups, in a way that approximates the highly skewed hierarchies in the Book Crossing dataset. In each replicate, we first draw sampling rates

from a Pareto distribution, with scale and shape parameters set to 1. Then, we allocate the samples to the leaf nodes by drawing from a multinomial distribution with probability vector . Similarly we allocate the leaf nodes to groups using the same Pareto distribution and sampling scheme.

For every group node in the first level of the hierarchy, , we draw a -dimensional random effect vector from multivariate Gaussian with mean zero and covariance matrix ; for every leaf node , we draw a -dimensional random effect vector from multivariate Gaussian with mean zero and covariance . Then we randomly draw fixed effect predictor vectors for sample point , with independent elements taking values  and  with probability each. We use the same procedure to randomly draw random effect predictors for every sample point , and let the two levels of the hierarchy share the same random effect predictors. Finally, for every sample in leaf node , we draw response as Bernoulli with success probability

To evaluate the quality of the estimators, we use the following loss functions:

  • Fixed Effect Loss: ;

  • Random Effect Level- Covariance Loss: ;

  • Random Effect Level- Loss:

  • Prediction Loss:

    where for sample  in leaf .

We also measure the overall computation time for each, excluding the cross-validation time for tuning parameter selection.

Figure 2:

Performance for the multilevel logistic regression model. radii indicate one standard error along

-axis (absent when smaller than line width)

We compare our method (mhglm) with the three other methods described above: glmer, glmer.split, and sgd. Figure 2 shows the mean performance for each method, averaged over replicates, with circle radii indicating standard errors along the vertical axes. For moderate to large sample sizes, there is a noticeable difference between the proposed method and other maximum likelihood based estimators. However the proposed method still appears to be consistent, in the sense that its estimators improve with more samples. In terms of prediction loss, our proposed method outperformed both sgd and glmer.split and is only slightly worse than glmer.

The bottom panel compares the computation time for all methods. For large sample sizes, our proposed method is much faster than the other procedures, by factor ranging from to , and the factor appears to grow exponentially as sample sizes increase.

In the context of this simulation, our proposed method is able to trade off a modest loss in prediction performance for a dramatic decrease in computation time. We can see that our proposed procedure will scale well to our book recommendation context and to commercial recommendation settings generally.

7 Application

Having developed an estimation procedure for deeply-nested hierarchical models in Secs. 4 and 5, and having established its suitability in Sec. 6, we now return to our main application, fitting a model to data that allows us to predict whether or not a user would like a book if he or she had rated it.

Recall from Sec. 1

that our dataset consists of two parts: a set of user ratings of books, and a hierarchy of these books. We treat each rating as an observation containing book and user identifiers along with a numerical score between 1 and 10. To smooth differences between user-specific rating scales, we binarize the ratings, treating numerical scores of 8 or above as “positive” and ratings below this as “negative”. We will model these binarized ratings using a hierarchical logistic regression model. We use the user demographic features together with the rating context to construct candidate predictors in the model, linked to the response through fixed and random effects. We use a subset of the book hierarchy for the structure of the hierarchical model.

7.1 Candidate predictors

The first set of candidate predictors are converted from user demographic data. We bin users’ ages into 5 groups: (0,26], (26,32], (32,38], (38,47] and (47,101], where each group has approximately same number of ratings. Each age group is represented by one categorical variable. If age is missing, then all five indicators are zero. We aggregate the geographic feature into 6 groups by continent: North America, Europe, Oceania, Asia, South America, and Africa. Similarly, each group is represented by one categorical variable. We have a total of 11 demographic predictors.

Our second set of candidate predictors is the time-varying predictors defined as functions of past user behavior. The first predictor prev is a user-specific binary indicator of whether the user’s previous rating was positive. This is designed to capture a user’s propensity to make positive ratings. The second predictor, dist, is user-category specific, computed as the smoothed log proportion of past ratings that the user gives in each category. This is designed to capture users’ tendency to give ratings to each category, revealing his or her relative preference among all categories. Table 2 gives detailed descriptions of all the predictors.

Predictor Description
Age User-specific features: a 5-component indicator vector for age range (0,26],(26,32],(32,38],(38,47],(47,101]
Geographic User-specific features: a 6-component indicator vector for continent Africa, Asia, Europe, North America, Oceania, South America
Previous User-specific feature: a smoothed estimate of the log of proportion of positive ratings from user: , where and are the number of positive ratings () and total number of ratings from user.
Distribution User-book-specific feature: a smoothed estimate of the log of proportion of ratings in book’s genre from user: , where is number of ratings user gives in book’s genre; is total number of ratings from user; and is total number of genres.
Table 2: Predictor associated with one observation from user on book.

7.2 Model selection

We perform two forms of model selection. First, we need to choose which parts of the book hierarchy to use. Second, we need to choose which predictors to use. To carry out the model selection, we randomly partition the data into 80/10/10 percent chunks, for training/development/testing sets. We train various models on the training set, select a model with best performance on the development set, and finally compare the chosen model with other fitting methods on the testing set. In data processing, we only use training data to construct the new features.

We use all predictors for fixed effects, and we can fit these reliably given the large volume of data. However, we do not use all of these predictors on all random effects levels. Fitting the random effects is much more difficult, because it relies on ratings specific to the particular position in the hierarchy. The population structure of the random effects mitigates against some of this data sparsity, but there will still be situations where using coarser hierarchy makes the model less susceptible to over-fitting. To guard against overfitting in the random effects terms of the model, we perform model selection by using out-of-sample prediction performance on the development set. To choose the specific subset of the predictors to use as random effects, we fit all possible combinations at all levels of the model, selecting the model with the lowest misclassification rate on the development set.

We start with a depth-1 model. Here we fit depth-1 models with all five possible grouping factors: genre, subgenre, sub-subgenre, author, book. Table 3 lists the best one-level model using each grouping factor. We sort the performance by misclassification error on development dataset. We see that using author as the grouping factor, and demographic information as the random effect features gives the best prediction performance on development set. Note that we did not get additional performance improvement by using a book-specific random effect model, which suggest that we could potentially over-fit the data by using too many groups.

Group Features Error
author geo 0.3212
book age, geo 0.3235
sub-subgenre prev, dist, age, geo 0.3282
subgenre prev 0.3288
genre 1 0.3291
Table 3: Best performing model for all choices of grouping factor for one-level model. The standard deviations of the listed model errors are below

To further take advantage of the five nested hierarchies, we also consider depth-2 models. Using the lme4 modeling notation, we fit all models of the following form:

where grouping level is nested under , and and are predictor matrices with columns taken from the candidate predictors. The notation indicates that the model has fixed effects corresponding to an intercept and predictors age, geo, pref, and dist, random effect predictors at the first level, and random effect predictors at the second level.

subgenre author dist, age age, geo 0.3177
sub-subgenre author age dist, geo 0.3184
genre author dist, geo age, geo 0.3189
author book geo dist 0.3210
sub-subgenre book dist, age, geo geo 0.3212
subgenre book prev, dist, age age, geo 0.3218
genre book age, geo age 0.3226
subgenre sub-subgenre age, geo dist 0.3266
genre sub-subgenre dist, age dist, geo 0.3269
genre subgenre prev, dist prev 0.3287
Table 4: Best performing model for all choices of grouping factors for two-level model. The standard deviations of the listed model prediction errors are below

We list the best performing depth-2 model for every combination of () in Table 4, where we sort the performance by misclassification error on development dataset. The feature 1 indicates the feature of all ones (i.e. the intercept term). Note that we decrease the misclassification rate from 0.3212 to 0.3177 by adding an additional hierarchy subgenre on top of author. This improvement in predictive performance may seem small, but in practice such improvements can translate to big impacts when the corresponding models are deployed in commercial scale recommender system applications (Kramer, Guillory and Hancock, 2014; Kohavi et al., 2014). Thus, the small improvement of the two-level model over the one-level model can be meaningful.

The best two-level model is using subgenre and author as the two grouping factors; on subgenre level it uses dist and age as random effects features; on author level it uses age and geo as random features. It is a relatively simple model with competitive performance, and we will focus on this model throughout the rest of the paper.

8 Results

8.1 Performance

In Sec. 7.2, the model that gave the best prediction performance on the development set used two levels of hierarchy, corresponding to “author” and “subgenere,” with author nested within subgenre. For fixed effect predictors, the model used an intercept along with age, geo, prev, and dist. For random effect predictors at the first level in the hierarchy (subgenre), the model used an intercept along with dist and age; at the second level in the hierarchy (author) the model used an intercept along with age and geo. Having selected the model, we will now evaluate its performance on the held-out test set.

We fit the model to the training data set using our proposed moment-based procedure mhglm along with two competing methods described in Sec. 6, the glmer maximum likelihood procedure and the sgd stochastic gradient descent -likelihood-based procedure , and we compare their prediction performances on the held-out test data set. We do not include the glmer.split method, because glmer fails on the randomly splitted subsets due to sparsity.

Fitting Method Error

Error 95% Confidence Interval

Time (seconds)
mhglm 55.14
glmer 44790.23
sgd 2022.35
Table 5: Misclassification Error and Running Time For Three Fitting Methods

Table 5 lists misclassification error, error’s 95% confidence intervals, and running time for all three methods. All methods have comparable prediction performance, with a misclassification rate of about 32.5%. Our proposed procedure mhglm slightly outperforms the other two methods in overall misclassification error, but the difference is not statistically significant. When we look at the running time, however, mhglm is faster than sgd by a factor of 45, and faster than glmer by a factor of 1000. Fitting the model using our proposed method took under a minute; fitting using sgd took 33 minutes; fitting using glmer took 12.4 hours.

The results in Table 5 demonstrate two features of the mhglm fitting procedure. First, the prediction performance is comparable to that of the more established likelihood-based procedures. Second, mhglm is faster than these methods by at least an order of magnitude. This reduction in computation time enabled us to perform an exhaustive model selection search over all 1- and 2-level models. For the four predictors and the intercept, and for the 5 grouping levels, there were 1-level models and 2-level models. Extrapolating from the timing results in Table 5, performing the search over these models using mhglm took approximately 40 hours; using sgd or glmer, the same search would take approximately 60 days or 3.75 years, respectively.

Our proposed fitting method has enabled us to perform an exhaustive search over all 1- and 2-level hierarchical logistic regression models without sacrificing prediction performance.

8.2 Fitted model

To gain some insight into the predictions made by our fitted model, we investigate the empirical Bayes random effect estimates. Specifically, we investigate the age random effects at the subgenre and author levels.

Figure 3:

Given book subgenre. Everything else remain the same, the increase in log odds if user is young (left panel) or old (right panel). Error-bars show the

estimated posterior standard deviation. Both figures show top 50 subgenres with most ratings.

In the context of the fitted model, given a book’s subgenre we can compute the increase in log odds of a user liking the book if we change the user’s age from “missing” to known while keeping all other predictors constant. In Fig. 3 we show the change in log odds ( estimated posterior standard deviation) for young and old age groups, for the subgenres that have most ratings. For the old age group (47–101 years), the estimates have large estimated posterior standard deviations across the subgenres listed, making it difficult to identify a clear patter. For the young age group (0–26 years) there is some weak but meaningful signal. In this age group, there is a clear pattern in which of the common subgenres the users like and don’t like: “romance” is their favorite subgenre, which has significantly higher random effects than “literature & fiction genre fiction”, their least favorite subgenre.

Figure 4: Given book author. Everything else remain the same, the increase in log odds if user is young (left panel) or old (right panel). Error-bars show the estimated posterior standard deviation. Both figures show top 50 authors with most ratings.

Next we perform a similar analysis, but on the second level of the hierarchy, “author.” For every author we compute the increase in log odds of liking the book if we change the users’ age from “missing” to known while everything else remain the same. In Figure 4 we show the increase in log odds ( estimated posterior standard deviation) for young and old age groups, for the authors that have most ratings.

We observe a few interesting patterns:

  • The estimated posterior standard deviations are much larger for random effects on the second level (author), for both young and old age groups.

  • Some authors are consistent across different age groups. For instance, one would want to recommend J. K. Rowling to both young and age groups. Meanwhile Nick Hornby and Piers Anthony are liked by neither groups.

  • Some authors have quite different behaviors across age groups. For instance, Danielle Steel has positive log odds increase if we know the user is within young age group, but negative log odds increase if user is among old age groups. An opposite example is Elizabeth Berg: we will suffer a decrease in log odds if user is young, meanwhile log odds will increase if the user is old.

The size of the estimated posterior standard deviations make clear that these associations are weak. Still, as demonstrated in Secs. 7.2 and 8.1, there is enough signal in them to translate to a meaningful reduction in misclassification rate on the held-out development and test sets.

9 Discussion

The appeal of the deeply-nested hierarchical model is that it facilitates information sharing across subtrees at all levels of the hierarchy. Nodes with abundant data effectively have their random effects estimated using only data at the leaves descending from them. Nodes with little or moderate data, however, benefit by having their estimated coefficients (random effects) shrunk towards the global mean. In our book recommendation application, we have demonstrated this advantage by showing that using two levels of hierarchy (author and subgenre) delivers increased prediction performance than using one or no levels.

The main hurdle in deploying hierarchical models in recommender systems applications like ours and other contexts of similar scale is that the time required to fit these models can be prohibitive. Perry (2017) extended a method original due to Cochran (1937) and proposed a partial solution to this problem, but his procedure is limited to single-level hierarchical models. Here, with our proposed mglhm method we have shown how to fit a hierarchical model of arbitrary depth by repeatedly applying the single-level fitting procedure to prune the leaves of the hierarchy. We then showed how to propagate the estimates at the root of the hierarchy down through the nodes in the hierarchy to refine the random effect estimates.

In our book recommendation application, our proposed fitting procedure was faster than stochastic gradient descent by a factor of 45, and faster than the likelihood-based glmer procedure by a factor of 1000. This increase in computational speed enabled us to perform an exhaustive model selection search over all one- and two-level models, reducing the overall computation time from about 60 days using sgd (or 3.75 years using glmer) to about 40 hours. As our simulations in Sec. 6 demonstrated, the tradeoff in deploying our method is reduced statistical efficiency and prediction performance. However, in our application, the loss in prediction performance was negligible.

Although our motivation was a book recommendation system, our proposed fitting procedure is general enough to handle hierarchical generalized linear models of arbitrary depth. We have incorporated our implementation of this procedure into the mbest R package, available on the Comprehensive R Archive Network (CRAN). The interface in this implementation is flexible enough to handle any deeply nested hierarchical generalized linear model.

Appendix A Data description

a.1 Overview

The Book-Crossing dataset introduced in Section 1 contains 433,671 numerical ratings of 185,973 books from 77,805 users (Ziegler et al., 2005). Each rating consists of a book id (ISBN), a user id, and a numerical score between 1 and 10, where 1 indicates extreme negative and 10 indicates extreme positive sentiment. We binarize the ratings so that ratings equal or above are considered positive and ratings below are considered negative. The threshold is chosen such that the two classes have comparable number of samples. We have user demographic information including age and location; Section A.2 reports some descriptive statistics about these features. We also know the book authors and titles.

We augment the book meta-data with a genre hierarchy scraped from by McAuley, Pandey and Leskovec (2015). In this meta-data, book titles are nested within authors within sub-subgenres within subgenres within genres. If the same author writes titles in multiple sub-subgenres, we treat the author as multiple, separate entities. Section A.3 describes the hierarchy in meta-data.

In the raw dataset, more than half of the ratings cannot be matched to Amazon meta-data. Dealing with this missing data is beyond the scope of the present treatment, so we remove samples with missing ratings or unmatched book ids from consideration. This leaves us with 157,638 ratings of 38,659 books from 38,085 users.

a.2 User demographic features

The reported user age is a continuous variable, ranging from 15 to 100. The mean and standard deviation of user age are 36.4 and 12.6 respectively. Table 6 shows the number of users and ratings from each age range.

Age Interval # Users # Ratings
20 2,664 8,752
(20,30] 6,011 30,820
(30,40] 5,906 32,252
(40,50] 3,765 20,539
(50,60] 2,609 11,961
(60,70] 952 2,997
70 297 992
Table 6: Number of Users & Ratings from Each Age Range.

Most of the ratings comes from young or middle-aged users, which makes it easier to estimate and predict for users from those age ranges.

User’s location information is reported as his city, state, country, and continent. Table 8 reports the number of users and ratings from the 10 most-represented countries, and Table 8 reports the same information for each continent.

Country # Users # Ratings
USA 29,042 120,201
Canada 3,619 14,592
United Kingdom 989 3,622
Australia 632 2,067
Portugal 181 1,490
Germany 381 1,189
Spain 187 1,008
Malaysia 111 964
Netherlands 178 638
New Zealand 148 563
Table 8: Number of Users & Ratings from Each Continent.
Continent # Users # Ratings
North America 32,722 135,059
Europe 2,632 10,122
Oceania 780 2,630
Asia 430 2,272
South America 68 210
Africa 36 87
Table 7: Top 10 Countries With Most Ratings.

We can see that the vast majority of the ratings () are from North America, with Europe, the next-most-represented country, receiving only of ratings. This indicates that it’s quite difficult to accurately estimate and predict for users from other than these two continents.

a.3 Book hierarchy

Every book is nested under a deep hierarchy:

genre subgenre sub-subgenre author title.

For example, the book Harry Potter and the Chamber of Secrets is nested as Children’s Books Literature Science Fiction, Fantasy, Mystery & Horror J. K. Rowling Harry Potter and the Chamber of Secrets. Figure 1 displays all hierarchies on the first two levels.

For modeling purposes, we only chose two out of five hierarchies. We omit the intermediate levels and use simplified hierarchy of subgenre author. Our first level of hierarchy subgenre has 1,344 groups, which captures the necessary amount of diversity across books using a reasonable amount of groups. We use author (nested under subgenre) as the second level of hierarchy, which has 27,360 groups. We use book author instead of book title as the second level hierarchy, since the Book Crossing dataset is very sparse, such that most books has only a few number of ratings. Hierarchical models will not work well if most groups have very few samples, which shrinks the overall results towards that of a simple “global” model.

Even for these carefully chosen hierarchies, the distribution of subgroups and samples are still highly skewed. We can see this skewness in Figure 5

, which plots the quantiles of the number of authors per subgenre (left panel) and the number of ratings per author (right panel). Both plots are on

scale. 50% of subgenres have fewer than 17 authors, and 90% of subgenres have fewer than 261 authors. At the other extreme, the largest subgenre (Literature & Fiction General) has 2893 authors. The distribution of ratings among authors are highly skewed as well: 50% of authors have only 1 rating, 90% of authors have less than 9 ratings, meanwhile the mostly rated author (Sue Grafton) received 1183 ratings.

Figure 5: Left panel: Quantile Plot of Number of Authors within Subgenres. Right panel: Quantile Plot of Number of Ratings of Authors.

Hierarchical models gain its predictive power by pooling information across groups. The existence of large numbers of small groups will make learning model parameters as and making good predictions difficult.

Appendix B Two-level linear model simulations

Here we perform a simulation study similar to the two-level logistic regression model study described in Sec. 6, but using a two-level linear regression model instead.

Figure 6: Performance for the multilevel linear regression model. Circle radii indicate one standard error along -axis (absent when smaller than line width)

With all other simulation parameters drawn as described in Sec. 6, in the linear regression setup we draw response  from a normal distribution with mean and variance whenever sample  belongs to leaf . We again compare our procedure with those three methods. We use the same loss for fixed and random effects, as well as the random effect covariance. For prediction loss, we use the mean squared error:

where and .

Figure 6 shows the mean loss, averages over replicates, with circle radii indicating standard errors along the vertical axes. For moderate to large sample sizes, there is a noticeable but decreasing difference between the proposed method and other maximum likelihood based estimators. However the proposed method still appears to be consistent. In terms of computation time, this method again has improvement by factor ranging from to , and the factor appears to grow exponentially as sample sizes increase.


  • Adomavicius and Tuzhilin (2005) [author] Adomavicius, G.G. Tuzhilin, A.A. (2005). Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE T. Knowl. Data En. 17 734–749.
  • Agarwal (2008) [author] Agarwal, D.D. (2008). Statistical Challenges in Internet Advertising. In Statistical Methods in e-Commerce Research (W.W. Jank G.G. Shmueli, eds.) Wiley.
  • Agarwal and Chen (2010) [author] Agarwal, DeepakD. Chen, Bee-ChungB.-C. (2010). fLDA: Matrix Factorization Through Latent Dirichlet Allocation. In Proceedings of the Third ACM International Conference on Web Search and Data Mining. WSDM ’10 91–100. ACM, New York, NY, USA. 10.1145/1718487.1718499
  • Ansari, Essegaier and Kohli (2000) [author] Ansari, AsimA., Essegaier, SkanderS. Kohli, RajeevR. (2000). Internet Recommendation Systems. Journal of Marketing Research 37 363-375.
  • Bates et al. (2013) [author] Bates, DouglasD., Maechler, MartinM., Bolker, BenB. Walker, StevenS. (2013). lme4: Linear mixed-effects models using Eigen and S4 R package version 1.1-7.
  • Cochran (1937) [author] Cochran, W. G.W. G. (1937). Problems Arising in the Analysis of a Series of Similar Experiments. Supplement to the Journal of the Royal Statistical Society 4 102–118.
  • Condliff, Lewis and Madigan (1999) [author] Condliff, Michelle KeimM. K., Lewis, David D.D. D. Madigan, DavidD. (1999). Bayesian Mixed-Effects Models for Recommender Systems. In In ACM SIGIR ’99 Workshop on Recommender Systems: Algorithms and Evaluation.
  • Dror, Koenigstein and Koren (2011) [author] Dror, G.G., Koenigstein, N.N. Koren, Y.Y. (2011). Yahoo! Music Recommendations: Modeling Music Ratings with Temporal Dynamics and Item Taxonomy. In Proceedings of the fifth ACM conference on Recommender systems 165–172. ACM.
  • Firth (1993) [author] Firth, D.D. (1993). Bias Reduction of Maximum Likelihood Estimates. Biometrika 80 27–38.
  • Gao and Owen (2016a) [author] Gao, K.K. Owen, A. B.A. B. (2016a). Efficient moment calculations for variance components in large unbalanced crossed random effects models. ArXiv e-prints.
  • Gao and Owen (2016b) [author] Gao, K.K. Owen, A. B.A. B. (2016b). Estimation and Inference for Very Large Linear Mixed Effects Models. ArXiv e-prints.
  • Gebregziabher et al. (2012) [author] Gebregziabher, M.M., Egede, L.L., Gilbert, G. E.G. E., Hunt, K.K., Nietert, P. J.P. J. Mauldin, P.P. (2012). Fitting Parametric Random Effects Models in Very Large Data Sets with Application to VHA National Data. BMC Medical Research Methodology 12 1–14.
  • Huang and Gelman (2005) [author] Huang, Z.Z. Gelman, A.A. (2005). Sampling for Bayesian Computation with Large Datasets. Unpublished.
  • Kohavi et al. (2014) [author] Kohavi, RonR., Deng, AlexA., Longbotham, RogerR. Xu, YaY. (2014). Seven rules of thumb for web site experimenters. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining 1857–1866. ACM.
  • Koren, Bell and Volinksy (2009) [author] Koren, Y.Y., Bell, R.R. Volinksy, C.C. (2009). Matrix Factorization Techniques for Recommender Systems. Computer 42 30–37.
  • Kramer, Guillory and Hancock (2014) [author] Kramer, Adam DIA. D., Guillory, Jamie EJ. E. Hancock, Jeffrey TJ. T. (2014). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences 111 8788–8790.
  • McAuley, Pandey and Leskovec (2015) [author] McAuley, J.J., Pandey, R.R. Leskovec, J.J. (2015). Inferring networks of substitutable and complementary products. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794.
  • Naik et al. (2008) [author] Naik, P.P., Wedel, M.M., Bacon, L.L., Bodapati, A.A., Bradlow, E.E., Kamakura, W.W., Kreulen, J.J., Lenk, P.P., Madigan, D. M.D. M. Montgomery, A.A. (2008). Challenges and Opportunities in High-Dimensional Choice Data Analyses. Market. Lett. 19 201–213.
  • Perry (2017) [author] Perry, P. O.P. O. (2017). Fast Moment-Based Estimation for Hierarchical Models. Journal of the Royal Statistical Society (Series B) 79 267–291.
  • Scott et al. (2013) [author] Scott, S. L.S. L., Blocker, A. W.A. W., Bonassi, F. V.F. V., Chipman, H. A.H. A., George, E. I.E. I. McCulloch, R. E.R. E. (2013). Bayes and Big Data: The Consensus Monte Carlo Algorithm. In Bayes 250.
  • Tan et al. (2018) [author] Tan, ZilongZ., Roche, KimberlyK., Zhou, XiangX. Mukherjee, SayanS. (2018). Scalable Algorithms for Learning High-Dimensional Linear Mixed Models. arXiv preprint arXiv:1803.04431.
  • Weng et al. (2008)

    [author] Weng, L. T.L. T., Xu, Y.Y., Li, Y.Y. Nayak, R.R. (2008). Exploiting Item Taxonomy for Solving Cold-Start Problem in Recommendation Making. In 2008 20th IEEE International Conference on Tools with Artificial Intelligence 2 113-120. 10.1109/ICTAI.2008.97

  • Zhang, Cao and Yeung (2010) [author] Zhang, YuY., Cao, BinB. Yeung, Dit-YanD.-Y. (2010). Multi-domain Collaborative Filtering. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence. UAI’10 725–732. AUAI Press, Arlington, Virginia, United States.
  • Zhang et al. (2016) [author] Zhang, XianXingX., Zhou, YitongY., Ma, YimingY., Chen, Bee-ChungB.-C., Zhang, LiangL. Agarwal, DeepakD. (2016). GLMix: Generalized Linear Mixed Models For Large-Scale Response Prediction. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’16 363–372. ACM, New York, NY, USA. 10.1145/2939672.2939684
  • Ziegler et al. (2005) [author] Ziegler, Cai-NicolasC.-N., McNee, Sean M.S. M., Konstan, Joseph A.J. A. Lausen, GeorgG. (2005). Improving Recommendation Lists Through Topic Diversification. In Proceedings of the 14th international conference on World Wide Web 22–32. ACM.