1 Introduction
Given a dataset of books, users, and user reviews of those books, consider the problem of recommending books to users. This problem is a specific instance of the recommender system problem common in commercial applications (Adomavicius and Tuzhilin, 2005). In our context the following data are available:

A set of 157,638 ratings of the books made by 38,085 users, taken from The Book Crossing Dataset, an anonymized collection of book reviews harvested from bookcrossing.com by Ziegler et al. (2005).

User age and location (continent), included with the Book Crossing Dataset.
Appendix A
gives descriptive statistics for the dataset, including descriptions of how the ratings are distributed between the user age groups and continents. Most ratings are from users in the United States, aged 20–40 years.
Others have built recommender systems for the Book Crossing Dataset (Weng et al., 2008; Agarwal and Chen, 2010; Zhang, Cao and Yeung, 2010). Our application is unique in that we will attempt to leverage the book genre hierarchy to improve recommendations.
Rather than solving the book recommendation problem directly, we will attempt to solve a proxy problem: for each bookuser pair, predict whether the user would like the book if he or she rated it. We can use the solution to the proxy problem to solve the original problem by recommending books with the highest predicted “like” probabilities. This proxy approach to the recommendation system is common
(Ansari, Essegaier and Kohli, 2000; Adomavicius and Tuzhilin, 2005); it ignores information inherent in a user’s selection of which books to review, but despite this it often gives reasonable downstream results.One strategy for solving a recommender system problem is the contentbased approach, using user attributes together with bookspecific parameters to make recommendations. Another strategy is the collaborative approach, recommending books that are liked by similar users. We prefer instead a variant of the hierarchicalmodelbased approach advocated by Condliff, Lewis and Madigan (1999) and Ansari, Essegaier and Kohli (2000) that combines the contentbased and collaborative strategies.
The simplest form of the hierarchical model approach is with a flat item hierarchy, with each item sitting directly under the root. In our context, the flat model would assign a random effect vector
to each book that relates the popularity of the book to user and contextspecific covariate vector, along with a fixed effect vector that relates book popularity to another covariate vector (possibly the same). For a particular user review of a book , let be a binary indicator of whether the user liked the book, and let and be the usercontext covariate vectors associate fixed and random effects, respectively. The link between the effects, the covariates and the response is(1) 
The random effect vectors are independent multivariate normal random vectors with mean and covariance matrix :
(2) 
Eq. (1) demonstrates the contentbased aspect of the model, where user attributes ( and ) are linked to preferences. Eq. (2) introduces the collaborativebased features: books with abundant data will have stronglyidentified random effects ; others will have posterior means (conditional on the available review data) determined in part by the effects of similar items, through the covariance matrix .
In our application, we have a richer hierarchy, with books nested under author, subsubgenre, subgenere, and genre. We can exploit this hierarchy in the model formulation by allowing for a random effect vector at each node in the hierarchy, not just at the leaves. Doing so allows for information pooling in random effect estimates across the levels in the hierarchy. We elaborate on this benefit in Section 2.
Despite its appeal, the hierarchical modeling and more general mixed modelling approaches to recommender systems have long considered infeasible at commercial scale due to the high computational demands of fitting the model (Agarwal, 2008; Naik et al., 2008). Recent progress has expanded the scope of application of these models: Gao and Owen (2016a, b) proposed a momentbased approach for estimating the parameters of a crossed effects model using a momentbased approach; Perry (2017) proposed a momentbased approach for fitting a flat hierarchical model; Tan et al. (2018) proposed a kernelbased approach for fitting a linear flat hierarchical model; and Zhang et al. (2016) developed a parallelized maximum likelihood fitting algorithm that can exploit multiple computing cores.
To fully exploit the deeplynested book hierarchy in a computationally efficient manner, we will in the sequel develop a momentbased fitting method for hierarchical models of arbitrary depth. Before doing so, in Section 2 we elaborate on the benefits of using a hierarchical model in our context. Next in Section 3 we introduce the details of our model, using framework suitable for describing general hierarchical models. Our fitting method proceeds in two passes. In the first pass, we use the available data to get initial parameter estimates at the leaves of the tree, and then we propagate information in these estimates up to the root. In the second pass, we use the accumulated information to refine the estimates back down from the root back to the leaves. We describe these procedures in Section 4 and Section 5, respectively. After investigating the performance of our method in simulations in Section 6, we apply the procedure to our dataset in Sections 7 and 8. We conclude with a short discussion in Section 9.
Our fitting procedure is implemented in the mbest R package, available at https://cran.rproject.org/package=mbest.
2 Local and global approaches
This hierarchical modeling approach interpolates two extremes: a “global” and a “local” approach. The global approach would have a single parameter vector shared by all books, fit by lumping the reviews for all books together. The local approach would have a different parameter vector for each book, fit in isolation using only the reviews of the corresponding book. The global approach corresponds to setting
for all books ; the local approach corresponds to treating as nonrandom.The appeal of the hierarchical model can be demonstrated by comparing the performance of a global and a local model. We perform this comparison at the author level: the global model will have a single parameter vector shared by all authors; the local model will have a different parameter vector for each author. Both models use the same set of covariates, described later in Section 3. We fit both models on a training set, then evaluate on a test set. In the global model the probability that a user likes an item is described as
where is the vector of user covariates. The predictions from this model are the same for all authors. In the local model, the probability that the user likes book is described as
This takes a similar form to the global model, but the coefficients depend on the book author .
Table 1 shows the test set misclassification rates for the two fitted models, grouped by the number of ratings per author. For authors with 100 or fewer ratings, the global model performs better on average than the local model. Only in the group of authors with large number of ratings (more than 100) does the local model perform better.
# Author Ratings  Error (%)  Error (%)  Error Error (%) 

[10,20]  38.52 (0.53)  35.16 (0.52)  3.36 
(20,50]  37.15 (0.42)  34.91 (0.41)  2.24 
(50,100]  34.76 (0.48)  34.20 (0.48)  0.55 
(100,1000]  32.15 (0.40)  32.32 (0.40)  0.16 
) Models. (Standard Deviations in Parentheses.)
What is needed is an adaptive model that can interpolate between these two extremes: for authors with abundant data, use the local model; for authors with little or no data, use the global model; for others, use some combination of the two models.
The hierarchical model achieves the interpolation between local and global models automatically. For items with abundant data, the posterior distribution (conditional on the data) for the random effect vector is concentrated around the local coefficient estimate that uses only the reviews for item . For items with no data, the posterior distribution for is diffuse, with mean zero; the predictions for item are determined mostly by the fixed effect vector shared globally by all items. As the number of reviews for item increases, the posterior distribution for and the corresponding predictions for item interpolate between these local and global extremes.
The flat item hierarchy shares information across all items. In our application, we have a deep hierarchy of books. In Section 3, we will show how to exploit this deep hierarchy by a using a model with a random effect vector at each node in the hierarchy. Predictions for the items at the leaves involve the random effects on the nodes on the path from the root of the hierarchy to the item. Information pooling occurs at all levels of the hierarchy, with siblings in a subtree pooled to estimate the posterior distribution of their random effects. A hierarchy node with a subtree of abundant data will have an estimated random effect close to what would come in a fitted model. For other nodes, the estimate will involve informationpooling across other nodes at the same level in the tree.
3 Modeling framework
For our proxy problem, the goal is to estimate, for a given book and user, the probability that the user would like the book conditional on the user rating the book. We have a set of user and context covariates for each review, and a response indicating whether the user liked the book. We also have a deeply nested hierarchy of books nested under author, subsubgenere, subgenre, and genre. In what follows, we describe a model that leverages the book hierarchy in a way that facilitates information pooling for the estimates across the levels in the hierarchy.
To introduce the model, we first need to be more precise about what we mean by a hierarchy in the context of our problem and other similar settings. For us, a “depth hierarchy” is a tree where all leaves have depth . In such a hierarchy, we label the nodes of the tree by unique strings of natural numbers:

the root of the hierarchy gets labeled by the empty string, denoted ;

the children of the root get labeled by the length1 strings ;

in general, if is the label of a node, we let denote its number of children; we label these children by the strings gotten by concatenating the label with the child identifiers: .
In this labeling scheme, each node other than the root has a label that can be represented as , where is the node’s parent and is a natural number in the range .
For a node , we denote its depth by , equal to its distance from the root; this is also equal to the length of its label. For any depth in the range , we let denote the set of nodes with depth . Finally, for node of depth , and for , we let denote its ancestor at depth in the hierarchy, setting .
In the context of our application, the leaves of the hierarchy are authors. The internal nodes are genres and subgenres. We observe data for each author (ratings for that author by users); our goal is to relate these ratings to the user covariates, using the structure of the book hierarchy to inform our predictions.
In general terms, the hierarchical model supposes that at each leaf node we observe a response vector of length . The behavior of this response vector is linked to observable covariates through a vector of nonrandom fixed effects identified with the root of the hierarchy and a set of random effects identified with the nodes in the hierarchy on the path from the root to the leaf . Different levels of the hierarchy may use different sets of user and book features to predict the user’s probability of liking the book. We denote these features by , where , a matrix of dimension , contains the features used by level of the hierarchical model to predict the response vector . To link these features to the response, we posit existence of a fixed effect vector of dimension identified with the root of the hierarchy, along with a random effect vector at every other node in the tree such that has dimension when is at depth . The distribution of the response is determined by some function of the linear predictor , defined as
(3) 
This predictor involves the fixed effect vector and the random effects of all nodes on the path from the leaf to the root.
In our application, we have a binary response vector with entries indicating whether the user liked the book. We use the canonical logistic link, supposing that for , this predictor relates to the response as
(4) 
where without subscript denotes the collection of all random effects. We further suppose that the components of the vector are independent of each other conditional on . In an application with a continuous response vector we would instead typically specify that
has independent Gaussian components with mean and variance given by
(5) 
for and for some dispersion parameter . The hierarchical model is not limited to these two settings, and in principle a modeler could specify any link between the linear predictor and the mean of the response .
To endow our model with a mechanism that allows borrowing strength across similar items in the hierarchy, we model the populations of random effects at the levels of the hierarchy. We treat these populations as independent. For the population of level random effects, , we suppose that each item
is an independent draw from a multivariate normal distribution with meanzero and covariance matrix
for some covariance matrix :(6) 
We further suppose that all random effects are independent of each other.
In the sequel, we discuss estimation for the depth hierarchical model. That estimation procedes in two stages: first, estimate the model parameters and . Next, use the model parameters to get empirical Bayes estimates of the random effects . The empirical Bayes estimation procedure is the part of the model estimation that leverages information across different levels of the hierarchy. Our estimates of the covariance matrices
allow us to impute components of particular random effects vectors when we only have information about a subset of their components.
4 Fitting procedure
Frequentist hierarchical models like the one described in the previous section often get fit via maximum likelihood (Bates et al., 2013). However, these fitting algorithms can be prohibitively slow for large data sets like those that appear in commercialscale settings (Agarwal, 2008; Naik et al., 2008). Perry (2017) got over the computational hurdle in a depth1 hierarchical model by using a momentbased estimation procedure, adapted from an earlier procedure due to Cochran (1937). Here, we will extend nmfmt@@posfmt@@swafalsectype@@@partrue ifstar@@fulltrue@@citetp@@fullfalse@@citetpPerr16 procedure to handle hierarchy of arbitrary depth.
Throughout the section we will assume that the model described in (3) and (6) is in force. For the response data and the leaf nodes
we will allow for both the logistic regression case from (
4), the normal response case from (5). Our estimators extend naturally to any generalized linear model at the leaves with shared dispersion parameter .The estimation procedure is easiest to describe if we reparametrize. To do so, for any node , let denote the vector of fixed and random effects on the path from the root up to and including . Specifically, set and for node with parent define recursively
For depth , let be the total number of fixed and random effects up to and including depth :
if , then has components.
Now, for leaf node define the matrix gotten by concatenating the columns of feature matrices by , so that has dimension . In this reparametrized form, the linear predictor at leaf node is
(7) 
This form makes the hierarchical model look somewhat like a standard generalized linear model, but the effect vector includes both nonrandom and random components: the fixed effect vector and the random effects on the path from the root to the leaf .
The estimation procedure for the hierarchical model is defined by repeatedly pruning the tree by reducing the leaves to a set of estimates at their parents. The high level description of the procedure is as follows:

Produce estimates of at each leaf node . Set .

We have at hand estimates of for each node . For each , combine the child estimates, for , to produce an estimate of and an estimate of .

Combine estimates for to produce a final estimate .

If , set to be the final estimate of the fixed effects and stop. Otherwise, go to Step 2 with the level decreased to .
In settings with a dispersion parameter , we handle this parameter analogously to .
When the fitting procedure terminates we will have final estimates and of the fixed effects and the random effect covariance matrices. The rest of this section is devoted to detailing the individual steps of the fitting procedure.
4.1 Step 1: Estimate parameters at the leaves
The first step in the estimation procedure is to use the data at each leaf to produce an estimate of , the vector of fixed and random effects on the path from the root to the leaf. Recall that . We will explicitly handle cases where the combined predictor matrix is rankdegenerate. We will only require that, conditional on , the estimate has negligible bias outside the null space of and is approximately normally distributed with known covariance matrix.
First we handle the normal model (5), where for the components of the response satisfy for a meanzero Gaussian error vector with independent components for having unknown variance . In this case we set
where denotes pseudoinverse. When has full rank, the estimate is the unique leastsquares estimate of ; otherwise, the leastsquares estimate is not unique and is one of the vectors minimizing the squared Euclidean norm .
To define the estimate of the dispersion parameter , we let denote the rank of . When we set
otherwise, we set . We combine the estimates across all the leaves to get a single estimate for the dispersion parameter:
Next we derive the properties of the estimate . First, let
denote a compact singular value decomposition where
is a diagonal matrix of dimension with positive diagonal entries. ThenThus,
where is a meanzero Gaussian random vector of independent components, each with variance . If we set , then the quantity is approximately meanzero normal with identity covariance matrix.
For the logistic regression model (4), we proceed analogously, but we use the Firth’s biasedreduced estimator (Firth, 1993) in place of the least squares estimator for . This is a refinement of the maximum likelihood estimator that is welldefined even when the responses are perfectly separated by a linear combination of the predictors. In cases where is rankdegenerate, there are multiple such estimators; we arbitrarily take to be one of them. The properties of the estimator are like those of the maximum likelihood: as the sample size increases, the estimator is asymptotically unbiased with covariance equal to the inverse information matrix. In the case of rankdeficient feature matrix , the information matrix takes the form where is the matrix of right singular vectors of . In this case if we set , then is approximately meanzero normal with identity covariance matrix.
In both the normal and the logistic regression case, we can find an estimator and a matrix with full column rank such that conditional on , the quantity is approximately normal with identity covariance. In the logistic regression case, the quality of the normal approximation depends on the sample size being large.
4.2 Step 2: Combine the estimates at level
We now suppose that for some level , for each node we have a matrix of full row rank , such that conditional on ,
For linear models, these two conditions hold exactly; in nonlinear models these will only hold approximately, with the quality of the approximation depending on the size of the sample used to estimate . We will show how to combine estimates to get an estimate of and an estimate of .
Recall that for each node , where is the random effects on level , and are of length respectively. Let be a compact singular value decomposition. When the context is clear, for simplicity we denote as the first rows of , and denote as the last rows of .
We have the following (unconditional) moment equations:
(8)  
(9) 
The moment equations (8) and (9) hold for any node , therefore by standard moment matching method, we want to take empirical mean of terms on the left hand side and set parameters on the right hand side to match it. However, we cannot do this right away, since the dimension of , or equivalently the rank , may vary by nodes . To overcome this, we augment the moment equations to have same dimension across nodes . In particular, with any choice of symmetric positivedefinite matrix for each node , we have
(10)  
(11) 
We use the semiweighted scheme for choosing as described by (Perry, 2017).
Now, the moment equations have consistent dimension across all nodes: for every node , equation (10) has dimension , and (11) has dimension . Based on equation (10), we define the momentbased estimator as
where denotes pseudoinverse. Based on equation (11), the momentbased estimator should satisfy
In practice, we do not have access to the true to compute in the above equation, instead we use the empirical estimate . If the result is not positive semidefinite, we project it onto the cone of positive semidefinite matrices and obtain the final estimate.
Let denote the eigendecomposition of the positive semidefinite matrix . Let denote the symmetric square root of . nmfmt@@posfmt@@swafalsectype@@@partrue ifstar@@fulltrue@@citetp@@fullfalse@@citetpPerr16 results imply that, subject to assumptions on the sample size, conditional on , the quantity is approximately normally distributed with
The error in the approximation tends to zero as the number of child nodes increases. In addition, since . Thus we can rewrite the above results as
(12) 
Perry (2017) details the precise assumptions required for these results along with the quality of the approximations.
4.3 Step 3: Combine the level covariance estimates
At the end of Step 2 in the procedure we have an estimates of for each . In Step 3, we combine these estimates to produce a final estimate by taking a weighted average: of over all nodes ,
Nodes with higher numbers of children get more weights.
4.4 Step 4: Recurse or Stop
If we are at the root of the tree, so that and we have an estimate , then we terminate the estimation procedure by setting our final estimate of the fixed effects to . Otherwise we decrement to , and go to Step 2 with that are used in equation (12).
5 Empirical Bayes random effect estimates
At the end of the fitting procedure described in Sec. 4 we have estimate of the fixed effect vector and estimates of the random effect covariance matrices. We also have at each node in the hierarchy a preliminary estimate of , the fixed and random effects on the path from the root to node . These preliminary estimates do not share information across the hierarchy; estimate is determined only from the data at the leaves descending from . We can improve the estimates by replacing each with an empirical Bayes estimate that pools information across the hierarchy.
The informationpooling algorithm works topdown from the root. It starts by setting . Then at depth1 nodes , the procedure uses and together with to get a refined estimate . This process repeats, level by level, until we get refined estimates at the leaves.
The full procedure is as follows:

Set and set .

If , stop.

For each node we have a refined estimate . For each child for , we have a preliminary estimate . Use together with and to produce a refined estimate .

Increment to and go to Step 2.
After applying this procedure, we have a refined estimate at each node in the tree. The estimates at each leaf can be used to make refined estimates of the linear predictors ( or they can be used to make predictions for new data.
To complete the description of the procedure we need to explain Step 3 in more detail. In this step we have at our disposal , , and for node and its children. Further, we have a matrix of full column rank such that is approximately distributed as a multivariate normal with identity covariance matrix.
By definition, where is the random effect vector for node , and are of length respectively. We denote as the first columns of , and denote as the last columns of . Conditional on
, we have the following (approximate) Bayesian linear regression model: for any node
,The empirical Bayes estimate is an estimate of the posterior mean of conditional on the observed data , gotten by using plugin estimates and for and .
To derive the posterior distribution define , noting that the conditional distribution is the same as that of . By Bayes rule, then, the posterior density of satisfies
The posterior distribution, then, is that of a multivariate Gaussian with expected value given by
The empirical Bayes estimate of comes from using this expression in conjunction with plugin estimates for and :
The refined estimate of , then, is .
6 Simulation
To evaluate our proposed estimation method, we compare its performance with three other procedures:

glmer, a maximum likelihood procedure, implemented as part of the lme4 R package (Bates et al., 2013).

glmer.split, a datasplitting estimation procedure, which randomly splits the data set into 10 subsets, computes estimates on each of them separately using glmer, and then combine the estimates by averaging them. We implemented the procedure ourselves in R; the algorithm is based on procedures proposed by Huang and Gelman (2005), Gebregziabher et al. (2012), and Scott et al. (2013).

sgd
, which uses stochastic gradient descent to maximize a regularized version of the
likelihood. We implemented the procedure in a combination of C and R; the algorithm is based on procedures proposed by Koren, Bell and Volinksy (2009) and Dror, Koenigstein and Koren (2011). We choose the regularization parameters by crossvalidation.
In evaluating the methods, we look at both the quality of their estimates and the time it takes to compute them. We do not include the tuning parameter crossvalidation time in the timing results.
We perform two sets of simulations: one for a twolevel logistic regression model, and one for a twolevel linear regression model. The setup and results for both simulations are similar, so we only include the logistic regression results here. Appendix B contains the linear regression results.
Following the notation in Section 3, we set the number of groups on the first level to , and number of groups on second level (the leaves) to . We simulate samples with ranging from to . We set the dimensions of fixed and random effect vectors to . For each value of we draw replicates according to the following procedure.
For each replicate, we draw a dimensional fixed effect vector with components drawn independently from a heavytailed student’s
distribution with 4 degrees of freedom. We draw random effect covariance matrices
and independently from an inverse Wishart distribution with shape and 10 degrees of freedom, scaled by .We allocate the samples to the groups and
subgroups, in a way that approximates the highly skewed hierarchies in the Book Crossing dataset. In each replicate, we first draw sampling rates
from a Pareto distribution, with scale and shape parameters set to 1. Then, we allocate the samples to the leaf nodes by drawing from a multinomial distribution with probability vector . Similarly we allocate the leaf nodes to groups using the same Pareto distribution and sampling scheme.For every group node in the first level of the hierarchy, , we draw a dimensional random effect vector from multivariate Gaussian with mean zero and covariance matrix ; for every leaf node , we draw a dimensional random effect vector from multivariate Gaussian with mean zero and covariance . Then we randomly draw fixed effect predictor vectors for sample point , with independent elements taking values and with probability each. We use the same procedure to randomly draw random effect predictors for every sample point , and let the two levels of the hierarchy share the same random effect predictors. Finally, for every sample in leaf node , we draw response as Bernoulli with success probability
To evaluate the quality of the estimators, we use the following loss functions:

Fixed Effect Loss: ;

Random Effect Level Covariance Loss: ;

Random Effect Level Loss:

Prediction Loss:
where for sample in leaf .
We also measure the overall computation time for each, excluding the crossvalidation time for tuning parameter selection.
We compare our method (mhglm) with the three other methods described above: glmer, glmer.split, and sgd. Figure 2 shows the mean performance for each method, averaged over replicates, with circle radii indicating standard errors along the vertical axes. For moderate to large sample sizes, there is a noticeable difference between the proposed method and other maximum likelihood based estimators. However the proposed method still appears to be consistent, in the sense that its estimators improve with more samples. In terms of prediction loss, our proposed method outperformed both sgd and glmer.split and is only slightly worse than glmer.
The bottom panel compares the computation time for all methods. For large sample sizes, our proposed method is much faster than the other procedures, by factor ranging from to , and the factor appears to grow exponentially as sample sizes increase.
In the context of this simulation, our proposed method is able to trade off a modest loss in prediction performance for a dramatic decrease in computation time. We can see that our proposed procedure will scale well to our book recommendation context and to commercial recommendation settings generally.
7 Application
Having developed an estimation procedure for deeplynested hierarchical models in Secs. 4 and 5, and having established its suitability in Sec. 6, we now return to our main application, fitting a model to data that allows us to predict whether or not a user would like a book if he or she had rated it.
Recall from Sec. 1
that our dataset consists of two parts: a set of user ratings of books, and a hierarchy of these books. We treat each rating as an observation containing book and user identifiers along with a numerical score between 1 and 10. To smooth differences between userspecific rating scales, we binarize the ratings, treating numerical scores of 8 or above as “positive” and ratings below this as “negative”. We will model these binarized ratings using a hierarchical logistic regression model. We use the user demographic features together with the rating context to construct candidate predictors in the model, linked to the response through fixed and random effects. We use a subset of the book hierarchy for the structure of the hierarchical model.
7.1 Candidate predictors
The first set of candidate predictors are converted from user demographic data. We bin users’ ages into 5 groups: (0,26], (26,32], (32,38], (38,47] and (47,101], where each group has approximately same number of ratings. Each age group is represented by one categorical variable. If age is missing, then all five indicators are zero. We aggregate the geographic feature into 6 groups by continent: North America, Europe, Oceania, Asia, South America, and Africa. Similarly, each group is represented by one categorical variable. We have a total of 11 demographic predictors.
Our second set of candidate predictors is the timevarying predictors defined as functions of past user behavior. The first predictor prev is a userspecific binary indicator of whether the user’s previous rating was positive. This is designed to capture a user’s propensity to make positive ratings. The second predictor, dist, is usercategory specific, computed as the smoothed log proportion of past ratings that the user gives in each category. This is designed to capture users’ tendency to give ratings to each category, revealing his or her relative preference among all categories. Table 2 gives detailed descriptions of all the predictors.
Predictor  Description 

Age  Userspecific features: a 5component indicator vector for age range (0,26],(26,32],(32,38],(38,47],(47,101] 
Geographic  Userspecific features: a 6component indicator vector for continent Africa, Asia, Europe, North America, Oceania, South America 
Previous  Userspecific feature: a smoothed estimate of the log of proportion of positive ratings from user: , where and are the number of positive ratings () and total number of ratings from user. 
Distribution  Userbookspecific feature: a smoothed estimate of the log of proportion of ratings in book’s genre from user: , where is number of ratings user gives in book’s genre; is total number of ratings from user; and is total number of genres. 
7.2 Model selection
We perform two forms of model selection. First, we need to choose which parts of the book hierarchy to use. Second, we need to choose which predictors to use. To carry out the model selection, we randomly partition the data into 80/10/10 percent chunks, for training/development/testing sets. We train various models on the training set, select a model with best performance on the development set, and finally compare the chosen model with other fitting methods on the testing set. In data processing, we only use training data to construct the new features.
We use all predictors for fixed effects, and we can fit these reliably given the large volume of data. However, we do not use all of these predictors on all random effects levels. Fitting the random effects is much more difficult, because it relies on ratings specific to the particular position in the hierarchy. The population structure of the random effects mitigates against some of this data sparsity, but there will still be situations where using coarser hierarchy makes the model less susceptible to overfitting. To guard against overfitting in the random effects terms of the model, we perform model selection by using outofsample prediction performance on the development set. To choose the specific subset of the predictors to use as random effects, we fit all possible combinations at all levels of the model, selecting the model with the lowest misclassification rate on the development set.
We start with a depth1 model. Here we fit depth1 models with all five possible grouping factors: genre, subgenre, subsubgenre, author, book. Table 3 lists the best onelevel model using each grouping factor. We sort the performance by misclassification error on development dataset. We see that using author as the grouping factor, and demographic information as the random effect features gives the best prediction performance on development set. Note that we did not get additional performance improvement by using a bookspecific random effect model, which suggest that we could potentially overfit the data by using too many groups.
Group  Features  Error 

author  geo  0.3212 
book  age, geo  0.3235 
subsubgenre  prev, dist, age, geo  0.3282 
subgenre  prev  0.3288 
genre  1  0.3291 
To further take advantage of the five nested hierarchies, we also consider depth2 models. Using the lme4 modeling notation, we fit all models of the following form:
where grouping level is nested under , and and are predictor matrices with columns taken from the candidate predictors. The notation indicates that the model has fixed effects corresponding to an intercept and predictors age, geo, pref, and dist, random effect predictors at the first level, and random effect predictors at the second level.
Error  

subgenre  author  dist, age  age, geo  0.3177 
subsubgenre  author  age  dist, geo  0.3184 
genre  author  dist, geo  age, geo  0.3189 
author  book  geo  dist  0.3210 
subsubgenre  book  dist, age, geo  geo  0.3212 
subgenre  book  prev, dist, age  age, geo  0.3218 
genre  book  age, geo  age  0.3226 
subgenre  subsubgenre  age, geo  dist  0.3266 
genre  subsubgenre  dist, age  dist, geo  0.3269 
genre  subgenre  prev, dist  prev  0.3287 
We list the best performing depth2 model for every combination of () in Table 4, where we sort the performance by misclassification error on development dataset. The feature 1 indicates the feature of all ones (i.e. the intercept term). Note that we decrease the misclassification rate from 0.3212 to 0.3177 by adding an additional hierarchy subgenre on top of author. This improvement in predictive performance may seem small, but in practice such improvements can translate to big impacts when the corresponding models are deployed in commercial scale recommender system applications (Kramer, Guillory and Hancock, 2014; Kohavi et al., 2014). Thus, the small improvement of the twolevel model over the onelevel model can be meaningful.
The best twolevel model is using subgenre and author as the two grouping factors; on subgenre level it uses dist and age as random effects features; on author level it uses age and geo as random features. It is a relatively simple model with competitive performance, and we will focus on this model throughout the rest of the paper.
8 Results
8.1 Performance
In Sec. 7.2, the model that gave the best prediction performance on the development set used two levels of hierarchy, corresponding to “author” and “subgenere,” with author nested within subgenre. For fixed effect predictors, the model used an intercept along with age, geo, prev, and dist. For random effect predictors at the first level in the hierarchy (subgenre), the model used an intercept along with dist and age; at the second level in the hierarchy (author) the model used an intercept along with age and geo. Having selected the model, we will now evaluate its performance on the heldout test set.
We fit the model to the training data set using our proposed momentbased procedure mhglm along with two competing methods described in Sec. 6, the glmer maximum likelihood procedure and the sgd stochastic gradient descent likelihoodbased procedure , and we compare their prediction performances on the heldout test data set. We do not include the glmer.split method, because glmer fails on the randomly splitted subsets due to sparsity.
Fitting Method  Error  Error 95% Confidence Interval 
Time (seconds) 

mhglm  55.14  
glmer  44790.23  
sgd  2022.35 
Table 5 lists misclassification error, error’s 95% confidence intervals, and running time for all three methods. All methods have comparable prediction performance, with a misclassification rate of about 32.5%. Our proposed procedure mhglm slightly outperforms the other two methods in overall misclassification error, but the difference is not statistically significant. When we look at the running time, however, mhglm is faster than sgd by a factor of 45, and faster than glmer by a factor of 1000. Fitting the model using our proposed method took under a minute; fitting using sgd took 33 minutes; fitting using glmer took 12.4 hours.
The results in Table 5 demonstrate two features of the mhglm fitting procedure. First, the prediction performance is comparable to that of the more established likelihoodbased procedures. Second, mhglm is faster than these methods by at least an order of magnitude. This reduction in computation time enabled us to perform an exhaustive model selection search over all 1 and 2level models. For the four predictors and the intercept, and for the 5 grouping levels, there were 1level models and 2level models. Extrapolating from the timing results in Table 5, performing the search over these models using mhglm took approximately 40 hours; using sgd or glmer, the same search would take approximately 60 days or 3.75 years, respectively.
Our proposed fitting method has enabled us to perform an exhaustive search over all 1 and 2level hierarchical logistic regression models without sacrificing prediction performance.
8.2 Fitted model
To gain some insight into the predictions made by our fitted model, we investigate the empirical Bayes random effect estimates. Specifically, we investigate the age random effects at the subgenre and author levels.
In the context of the fitted model, given a book’s subgenre we can compute the increase in log odds of a user liking the book if we change the user’s age from “missing” to known while keeping all other predictors constant. In Fig. 3 we show the change in log odds ( estimated posterior standard deviation) for young and old age groups, for the subgenres that have most ratings. For the old age group (47–101 years), the estimates have large estimated posterior standard deviations across the subgenres listed, making it difficult to identify a clear patter. For the young age group (0–26 years) there is some weak but meaningful signal. In this age group, there is a clear pattern in which of the common subgenres the users like and don’t like: “romance” is their favorite subgenre, which has significantly higher random effects than “literature & fiction genre fiction”, their least favorite subgenre.
Next we perform a similar analysis, but on the second level of the hierarchy, “author.” For every author we compute the increase in log odds of liking the book if we change the users’ age from “missing” to known while everything else remain the same. In Figure 4 we show the increase in log odds ( estimated posterior standard deviation) for young and old age groups, for the authors that have most ratings.
We observe a few interesting patterns:

The estimated posterior standard deviations are much larger for random effects on the second level (author), for both young and old age groups.

Some authors are consistent across different age groups. For instance, one would want to recommend J. K. Rowling to both young and age groups. Meanwhile Nick Hornby and Piers Anthony are liked by neither groups.

Some authors have quite different behaviors across age groups. For instance, Danielle Steel has positive log odds increase if we know the user is within young age group, but negative log odds increase if user is among old age groups. An opposite example is Elizabeth Berg: we will suffer a decrease in log odds if user is young, meanwhile log odds will increase if the user is old.
9 Discussion
The appeal of the deeplynested hierarchical model is that it facilitates information sharing across subtrees at all levels of the hierarchy. Nodes with abundant data effectively have their random effects estimated using only data at the leaves descending from them. Nodes with little or moderate data, however, benefit by having their estimated coefficients (random effects) shrunk towards the global mean. In our book recommendation application, we have demonstrated this advantage by showing that using two levels of hierarchy (author and subgenre) delivers increased prediction performance than using one or no levels.
The main hurdle in deploying hierarchical models in recommender systems applications like ours and other contexts of similar scale is that the time required to fit these models can be prohibitive. Perry (2017) extended a method original due to Cochran (1937) and proposed a partial solution to this problem, but his procedure is limited to singlelevel hierarchical models. Here, with our proposed mglhm method we have shown how to fit a hierarchical model of arbitrary depth by repeatedly applying the singlelevel fitting procedure to prune the leaves of the hierarchy. We then showed how to propagate the estimates at the root of the hierarchy down through the nodes in the hierarchy to refine the random effect estimates.
In our book recommendation application, our proposed fitting procedure was faster than stochastic gradient descent by a factor of 45, and faster than the likelihoodbased glmer procedure by a factor of 1000. This increase in computational speed enabled us to perform an exhaustive model selection search over all one and twolevel models, reducing the overall computation time from about 60 days using sgd (or 3.75 years using glmer) to about 40 hours. As our simulations in Sec. 6 demonstrated, the tradeoff in deploying our method is reduced statistical efficiency and prediction performance. However, in our application, the loss in prediction performance was negligible.
Although our motivation was a book recommendation system, our proposed fitting procedure is general enough to handle hierarchical generalized linear models of arbitrary depth. We have incorporated our implementation of this procedure into the mbest R package, available on the Comprehensive R Archive Network (CRAN). The interface in this implementation is flexible enough to handle any deeply nested hierarchical generalized linear model.
Appendix A Data description
a.1 Overview
The BookCrossing dataset introduced in Section 1 contains 433,671 numerical ratings of 185,973 books from 77,805 users (Ziegler et al., 2005). Each rating consists of a book id (ISBN), a user id, and a numerical score between 1 and 10, where 1 indicates extreme negative and 10 indicates extreme positive sentiment. We binarize the ratings so that ratings equal or above are considered positive and ratings below are considered negative. The threshold is chosen such that the two classes have comparable number of samples. We have user demographic information including age and location; Section A.2 reports some descriptive statistics about these features. We also know the book authors and titles.
We augment the book metadata with a genre hierarchy scraped from Amazon.com by McAuley, Pandey and Leskovec (2015). In this metadata, book titles are nested within authors within subsubgenres within subgenres within genres. If the same author writes titles in multiple subsubgenres, we treat the author as multiple, separate entities. Section A.3 describes the hierarchy in metadata.
In the raw dataset, more than half of the ratings cannot be matched to Amazon metadata. Dealing with this missing data is beyond the scope of the present treatment, so we remove samples with missing ratings or unmatched book ids from consideration. This leaves us with 157,638 ratings of 38,659 books from 38,085 users.
a.2 User demographic features
The reported user age is a continuous variable, ranging from 15 to 100. The mean and standard deviation of user age are 36.4 and 12.6 respectively. Table 6 shows the number of users and ratings from each age range.
Age Interval  # Users  # Ratings 

20  2,664  8,752 
(20,30]  6,011  30,820 
(30,40]  5,906  32,252 
(40,50]  3,765  20,539 
(50,60]  2,609  11,961 
(60,70]  952  2,997 
70  297  992 
Most of the ratings comes from young or middleaged users, which makes it easier to estimate and predict for users from those age ranges.
User’s location information is reported as his city, state, country, and continent. Table 8 reports the number of users and ratings from the 10 mostrepresented countries, and Table 8 reports the same information for each continent.
Country  # Users  # Ratings 

USA  29,042  120,201 
Canada  3,619  14,592 
United Kingdom  989  3,622 
Australia  632  2,067 
Portugal  181  1,490 
Germany  381  1,189 
Spain  187  1,008 
Malaysia  111  964 
Netherlands  178  638 
New Zealand  148  563 
Continent  # Users  # Ratings 

North America  32,722  135,059 
Europe  2,632  10,122 
Oceania  780  2,630 
Asia  430  2,272 
South America  68  210 
Africa  36  87 
We can see that the vast majority of the ratings () are from North America, with Europe, the nextmostrepresented country, receiving only of ratings. This indicates that it’s quite difficult to accurately estimate and predict for users from other than these two continents.
a.3 Book hierarchy
Every book is nested under a deep hierarchy:
genre subgenre subsubgenre author title.
For example, the book Harry Potter and the Chamber of Secrets is nested as Children’s Books Literature Science Fiction, Fantasy, Mystery & Horror J. K. Rowling Harry Potter and the Chamber of Secrets. Figure 1 displays all hierarchies on the first two levels.
For modeling purposes, we only chose two out of five hierarchies. We omit the intermediate levels and use simplified hierarchy of subgenre author. Our first level of hierarchy subgenre has 1,344 groups, which captures the necessary amount of diversity across books using a reasonable amount of groups. We use author (nested under subgenre) as the second level of hierarchy, which has 27,360 groups. We use book author instead of book title as the second level hierarchy, since the Book Crossing dataset is very sparse, such that most books has only a few number of ratings. Hierarchical models will not work well if most groups have very few samples, which shrinks the overall results towards that of a simple “global” model.
Even for these carefully chosen hierarchies, the distribution of subgroups and samples are still highly skewed. We can see this skewness in Figure 5
, which plots the quantiles of the number of authors per subgenre (left panel) and the number of ratings per author (right panel). Both plots are on
scale. 50% of subgenres have fewer than 17 authors, and 90% of subgenres have fewer than 261 authors. At the other extreme, the largest subgenre (Literature & Fiction General) has 2893 authors. The distribution of ratings among authors are highly skewed as well: 50% of authors have only 1 rating, 90% of authors have less than 9 ratings, meanwhile the mostly rated author (Sue Grafton) received 1183 ratings.Hierarchical models gain its predictive power by pooling information across groups. The existence of large numbers of small groups will make learning model parameters as and making good predictions difficult.
Appendix B Twolevel linear model simulations
Here we perform a simulation study similar to the twolevel logistic regression model study described in Sec. 6, but using a twolevel linear regression model instead.
With all other simulation parameters drawn as described in Sec. 6, in the linear regression setup we draw response from a normal distribution with mean and variance whenever sample belongs to leaf . We again compare our procedure with those three methods. We use the same loss for fixed and random effects, as well as the random effect covariance. For prediction loss, we use the mean squared error:
where and .
Figure 6 shows the mean loss, averages over replicates, with circle radii indicating standard errors along the vertical axes. For moderate to large sample sizes, there is a noticeable but decreasing difference between the proposed method and other maximum likelihood based estimators. However the proposed method still appears to be consistent. In terms of computation time, this method again has improvement by factor ranging from to , and the factor appears to grow exponentially as sample sizes increase.
References
 Adomavicius and Tuzhilin (2005) [author] Adomavicius, G.G. Tuzhilin, A.A. (2005). Toward the Next Generation of Recommender Systems: A Survey of the StateoftheArt and Possible Extensions. IEEE T. Knowl. Data En. 17 734–749.
 Agarwal (2008) [author] Agarwal, D.D. (2008). Statistical Challenges in Internet Advertising. In Statistical Methods in eCommerce Research (W.W. Jank G.G. Shmueli, eds.) Wiley.
 Agarwal and Chen (2010) [author] Agarwal, DeepakD. Chen, BeeChungB.C. (2010). fLDA: Matrix Factorization Through Latent Dirichlet Allocation. In Proceedings of the Third ACM International Conference on Web Search and Data Mining. WSDM ’10 91–100. ACM, New York, NY, USA. 10.1145/1718487.1718499
 Ansari, Essegaier and Kohli (2000) [author] Ansari, AsimA., Essegaier, SkanderS. Kohli, RajeevR. (2000). Internet Recommendation Systems. Journal of Marketing Research 37 363375.
 Bates et al. (2013) [author] Bates, DouglasD., Maechler, MartinM., Bolker, BenB. Walker, StevenS. (2013). lme4: Linear mixedeffects models using Eigen and S4 R package version 1.17.
 Cochran (1937) [author] Cochran, W. G.W. G. (1937). Problems Arising in the Analysis of a Series of Similar Experiments. Supplement to the Journal of the Royal Statistical Society 4 102–118.
 Condliff, Lewis and Madigan (1999) [author] Condliff, Michelle KeimM. K., Lewis, David D.D. D. Madigan, DavidD. (1999). Bayesian MixedEffects Models for Recommender Systems. In In ACM SIGIR ’99 Workshop on Recommender Systems: Algorithms and Evaluation.
 Dror, Koenigstein and Koren (2011) [author] Dror, G.G., Koenigstein, N.N. Koren, Y.Y. (2011). Yahoo! Music Recommendations: Modeling Music Ratings with Temporal Dynamics and Item Taxonomy. In Proceedings of the fifth ACM conference on Recommender systems 165–172. ACM.
 Firth (1993) [author] Firth, D.D. (1993). Bias Reduction of Maximum Likelihood Estimates. Biometrika 80 27–38.
 Gao and Owen (2016a) [author] Gao, K.K. Owen, A. B.A. B. (2016a). Efficient moment calculations for variance components in large unbalanced crossed random effects models. ArXiv eprints.
 Gao and Owen (2016b) [author] Gao, K.K. Owen, A. B.A. B. (2016b). Estimation and Inference for Very Large Linear Mixed Effects Models. ArXiv eprints.
 Gebregziabher et al. (2012) [author] Gebregziabher, M.M., Egede, L.L., Gilbert, G. E.G. E., Hunt, K.K., Nietert, P. J.P. J. Mauldin, P.P. (2012). Fitting Parametric Random Effects Models in Very Large Data Sets with Application to VHA National Data. BMC Medical Research Methodology 12 1–14.
 Huang and Gelman (2005) [author] Huang, Z.Z. Gelman, A.A. (2005). Sampling for Bayesian Computation with Large Datasets. Unpublished.
 Kohavi et al. (2014) [author] Kohavi, RonR., Deng, AlexA., Longbotham, RogerR. Xu, YaY. (2014). Seven rules of thumb for web site experimenters. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining 1857–1866. ACM.
 Koren, Bell and Volinksy (2009) [author] Koren, Y.Y., Bell, R.R. Volinksy, C.C. (2009). Matrix Factorization Techniques for Recommender Systems. Computer 42 30–37.
 Kramer, Guillory and Hancock (2014) [author] Kramer, Adam DIA. D., Guillory, Jamie EJ. E. Hancock, Jeffrey TJ. T. (2014). Experimental evidence of massivescale emotional contagion through social networks. Proceedings of the National Academy of Sciences 111 8788–8790.
 McAuley, Pandey and Leskovec (2015) [author] McAuley, J.J., Pandey, R.R. Leskovec, J.J. (2015). Inferring networks of substitutable and complementary products. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794.
 Naik et al. (2008) [author] Naik, P.P., Wedel, M.M., Bacon, L.L., Bodapati, A.A., Bradlow, E.E., Kamakura, W.W., Kreulen, J.J., Lenk, P.P., Madigan, D. M.D. M. Montgomery, A.A. (2008). Challenges and Opportunities in HighDimensional Choice Data Analyses. Market. Lett. 19 201–213.
 Perry (2017) [author] Perry, P. O.P. O. (2017). Fast MomentBased Estimation for Hierarchical Models. Journal of the Royal Statistical Society (Series B) 79 267–291.
 Scott et al. (2013) [author] Scott, S. L.S. L., Blocker, A. W.A. W., Bonassi, F. V.F. V., Chipman, H. A.H. A., George, E. I.E. I. McCulloch, R. E.R. E. (2013). Bayes and Big Data: The Consensus Monte Carlo Algorithm. In Bayes 250.
 Tan et al. (2018) [author] Tan, ZilongZ., Roche, KimberlyK., Zhou, XiangX. Mukherjee, SayanS. (2018). Scalable Algorithms for Learning HighDimensional Linear Mixed Models. arXiv preprint arXiv:1803.04431.

Weng et al. (2008)
[author] Weng, L. T.L. T., Xu, Y.Y., Li, Y.Y. Nayak, R.R. (2008). Exploiting Item Taxonomy for Solving ColdStart Problem in Recommendation Making. In 2008 20th IEEE International Conference on Tools with Artificial Intelligence 2 113120. 10.1109/ICTAI.2008.97
 Zhang, Cao and Yeung (2010) [author] Zhang, YuY., Cao, BinB. Yeung, DitYanD.Y. (2010). Multidomain Collaborative Filtering. In Proceedings of the TwentySixth Conference on Uncertainty in Artificial Intelligence. UAI’10 725–732. AUAI Press, Arlington, Virginia, United States.
 Zhang et al. (2016) [author] Zhang, XianXingX., Zhou, YitongY., Ma, YimingY., Chen, BeeChungB.C., Zhang, LiangL. Agarwal, DeepakD. (2016). GLMix: Generalized Linear Mixed Models For LargeScale Response Prediction. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’16 363–372. ACM, New York, NY, USA. 10.1145/2939672.2939684
 Ziegler et al. (2005) [author] Ziegler, CaiNicolasC.N., McNee, Sean M.S. M., Konstan, Joseph A.J. A. Lausen, GeorgG. (2005). Improving Recommendation Lists Through Topic Diversification. In Proceedings of the 14th international conference on World Wide Web 22–32. ACM.
Comments
There are no comments yet.