The problem of imputing missing values into possibly noisy matrices and tensors has wide-ranging applications. In missing data problems, practitioners use imputation to overcome the hurdle of non-response in service of building statistical models(rubin2004multiple; sengupta2017; sportisse2018imputation)
. In machine learning, various problems can be cast as missing data problems including collaborative filtering(keshavan2010matrix; su2009survey), as well as image and video denoising, reconstruction, and classification (yuan2018higher; ji2010robust).
Imputation requires some assumption about the relationship between the missing values and the values which are observed. By assuming that the matrix or tensor is low-rank, we can specify the missing values using only a small number of observations. We will begin by discussing matrices, and later generalize the discussion to tensors.
If the rank of the matrix is known, completing the matrix is a more straightforward problem (geweke1996bayesian). However, if the rank is unknown, simultaneously finding a suitable rank and completing the matrix is a non-convex and computationally complex problem (candes2010matrix; candes2009exact). A predominant frequentist approach yields a convex optimization problem by relaxing the low-rank assumption into a constraint on the nuclear norm of the matrix (recht2011simpler; candes2010matrix).
Meanwhile Bayesian methods use prior distributional information to encourage approximate rank sparsity in the matrix a posteriori. Let be an rank matrix with missing entries. can be decomposed into “component” matrices, () and () where , such that . The Bayesian approach to adaptive rank estimation is to set and place priors distributions on and which encourage posterior estimates of entire columns to be exactly or approximately .
The most common prior formulation (babacan2011low; alquier2014bayesian; song2016bayesian) used to induce near-sparsity in the columns of the component matrices is the normal-gamma prior, here defined as:
where for . Here
is the gamma distribution parametrized by shape and rate, respectively. By settingto be large, we can put posterior weight near for each encouraging sparse solutions.
For a solution which is unique up a factor of on the columns of the component matrices, one can control the length and orthogonality of the columns of the component matrices by restricting them to Stiefel manifolds (hoff2016equivariant; hoff2007model). However, it is computationally expensive to do so, because it introduces complex dependencies a posteriori between the all of the entries of the component matrices. By contrast, the unconstrained normal-gamma formulation does not specify unique posterior component matrices, but affords highly efficient block Gibbs sampling due to posterior row-wise independence in the component matrices. Uniquely specified component matrices are not always necessary as long as is estimated faithfully, and length and orthogonality constraints do not seem to bolster the estimation of in practice (song2016bayesian; alquier2015bayesian).
This problem of selecting a small number of columns to remain non-null is similar to the problem of variable selection in linear regression. In that setting, choosing the best subset of variables to include in a model is computationally complex. A classic solution is the Lasso method, in which we relax the subset-cardinality constraint into anconstraint (tibshirani1996regression). A Bayesian analogue to the Lasso is the application of the Laplace prior to model parameters (park2008bayesian). In this model, the maximum-a-posteriori (MAP) parameter estimates coincide with the estimates produced by the Lasso.
However, newly developed prior distributions have proven more effective in sparse regression than the Laplace prior, which fails to simultaneously induce sparsity while efficiently recovering non-null parameters (van2016conditions). Among these are “global-local priors” including the Horseshoe prior (carvalho2010horseshoe), the Horseshoe+ prior (bhadra2017horseshoe+), and most recently the Inverse-Gamma Gamma (IGG) prior (bai2017inverse).
Returning to the matrix completion problem, we see there is a connection between the normal-gamma prior and the Lasso. The Laplace distribution is an exponential scale-mixture of Gaussians, which is a special case of the normal-gamma prior. Indeed, MAP estimation in the normal-gamma case is equivalent to using a group-Lasso penalty on the columns of the component matrices (alquier2014bayesian). Having developed an instinctive aversion to all things Lasso and -penalized, we will instead adapt these global-local priors to the group-case and use them to model the columns of the component matrices and .
One advantage to these global-local priors is that they are tuning-parameter free. As we will demonstrate, the results using the normal-gamma prior are quite sensitive to the choice of , which corresponds to the tuning parameter in the Lasso. This parameter ought to specify the level of sparsity, which is usually unknown a priori. Learning the global shrinkage parameter with these global local priors avoids the need to use cross validation. Furthermore, the global-local priors result in sparser solutions and better estimates than the normal-gamma prior, both in simulations of low-rank incomplete matrices and in real data.
In this paper, We use the following notational conventions unless otherwise noted. Vectors are indicated by bolded lowercase letters and symbols, such asand . Matrices and tensors are indicated by bolded uppercase letters and symbols such as and . refers to a column vector formed by the entries of the th row of the matrix , and refers to a column vector formed by the entries of the th column of the matrix . refers to the column vector formed using the entries of a set of scalars, . For a length-n vector of indices , is Furthermore, refers to the matrix with th column , and refers to the matrix with th row .
For two column vectors and length m and n respectively, is their outer product, . By extension, for several vectors with lengths respectively, is the mode-K tensor such that
For two matrices of the same dimensions, and , indicates an element-wise matrix product.
We useto denote the K-dimensional multivariate normal distribution parametrized by mean and dispersion.
is the Cauchy distribution truncated to the positive part, parametrized by location and scale.is the gamma distribution parametrized by shape and rate. is the inverse gamma distribution parametrized by shape and scale.
is the inverse Gaussian distribution parametrized by mean and shape, andis the generalized inverse Gaussian distribution parametrized as in johnson1994continuous.
2 Rank-Sparse Matrix Completion with Global-Local Priors
Let be an random matrix with a potentially large number of missing entries. Le We will complete by assuming that it is the sum of a centered mean matrix, row and column intercepts, an overall intercept, and Gaussian noise, that is,
where is an mean matrix, and are vectors with lengths and respectively such that and , is a scalar, and is an error matrix with
We also assume, the rank of , , thus there exist (non-unique) and component matrices and such that
2.1 Prior Formulation
where and are and augmented component matrices, with columns. We choose so that we can assign continuous shrinkage priors to the columns of and to encourage all but columns to become small. We formulate the mixture priors as
, and to the the row and column intercept vectors and in (1) we assign independent improper uniform priors.
We can express all of the prior formulations we will use in this paper in terms of the distribution of for
as well as some fixed hyperparameters, , , , and . The standard priors are taken from alquier2014bayesian. Following alquier2014bayesian, we set shape parameter for the gamma prior to .
|Prior||Column Variance Distribution|
If the performance of a method is highly sensitive to the prior hyperparameters, these hyperparameters must be treated as tuning parameters. In the unlikely situation we know the true level of rank sparsity of the matrix to be completed, we can choose hyperparameters using this information. Otherwise, we will need to use a sample-data-dependent technique such as cross-validation. We find in simulations that the Gaussian and Gamma methods are sensitive to the choice of and , respectively.
However, the Horseshoe and Horseshoe+ methods contain no prior hyperparameters, as the hierarchical prior structure allows the mixture components of to adapt to the level of sparsity in the data. We therefore consider these methods “tuning parameter free,” and in practice practitioners can use these priors without first finding suitable hyperparameter values. The Gamma Inverse-Gamma prior has a similar property, as we will keep its prior hyperparameters constant. Due to its global-local construction, this posterior distributions corresponding to the Gamma Inverse-Gamma prior are not particularly sensitive to the choices of , , and .
2.2 Gibbs Sampler
We explore the posterior distribution of , , and using Gibbs sampling. Efficient block sampling is facilitated by the posterior independence of the rows of the component matrices, and . Define the sets of indices
Thus is a vector of column indices associated with non-missing entries in the th row of , and is a vector of row indices associated with non-missing entries in the th column of . Let be the diagonal matrix such that .
The Gibbs sampling algorithm is run by iteratively sampling each parameter or parameter block from its “full conditional distribution,” meaning its distribution conditional on the data, , as well as every other model parameter upon which it depends. For some model parameter , the expression indicates that is the full conditional distribution of . The full conditional distributions for the component matrices are:
with independence between rows of each component matrix. Therefore the rows of the component matrices can be sampled as blocks, even in parallel.
The full conditional distributions for the intercepts are:
Note that for large matrices, sampling from the constrained posterior distributions for and is computationally expensive, but in practice unnecessary. Sampling from the unconstrained Gaussians and stabilizing the sampling routine by recentering these vectors after each iteration achieves nearly identical results.
The full conditional distribution for the noise variance is:
Finally, only the way in which we sample the column variances depends on our choice of prior formulation.
Here the column variances are constant. Thus for all in .
Here, the gamma prior is a special case of the Generalized Inverse Gaussian distribution, which is conjugate to the Gaussian component column entries (johnson1994continuous). Given the careful choice of from alquier2014bayesian
, the posterior simplifies to a an Inverse Gaussian distributed random variable. Thus
Following makalic2016simple, we represent the positive-Cauchy-distributed variance terms as mixtures of inverse-gammas to facilitate sampling. For the Horseshoe formulation, a priori:
Thus a posteriori, the complete conditional distributions are:
For the Horseshoe+ prior we simply add in an extra pair of hierarchical variance terms, so we have
Thus a posteriori, the complete conditional distributions are:
Inverse-Gamma Gamma Priors
For the IGG formulation, the gamma distribution again enjoys conjugacy to the component column as a special case of the generalized inverse Gaussian distribution, and the inverse gamma distribution is itself conjugate to the component column. Therefore, a posteriori, the complete conditional distributions are:
As in the Gamma formulation, if we set , the generalized inverse Gaussian distribution posterior reduces to an inverse Gaussian posterior.
Using the representation of the positive-Cauchy distribution as an inverse-gamma mixture of inverse-gammas, we can see that all of these prior formulations are similar in that the component column variance terms are products of some number and combination of gamma and inverse-gamma distributed global (column-spanning) and local (column-specific) factors.
3 Extension to Tensor Completion
In the case of multiway data, we can easily extend this methodology to tensors. Many frequentist and Bayesian solutions rely on the CANDECOMP/PARAFAC or “CP” tensor decomposition (kolda2009tensor; zhao2016bayesian; bazerque2013rank). We notice that by extending the matrix formulation by adding additional component matrices, we achieve a CP-like decomposition, although, as in the matrix case, we will not enforce the orthogonality of the columns of the component matrices.
Suppose that is a -dimensional tensor with a large proportion of missing values. Now
Again, suppose the tensor is observed with Gaussian noise, thus
where is an mean matrix and is an error matrix with For notational simplicity, we will assume is centered and forego intercept terms in (3). However, the incorporation of intercepts corresponding to any dimensions is straightforward and analogous to the matrix case in 1.
The rank of , , thus there exist (non-unique) component matrices with dimensions such that
Again, we aim to learn the true rank of in (4). We will augment the component matrices to rows such that
where have dimensions . Now
independent across columns and component matrices. The prior formulations for are identical to the matrix case; see to Table 1. As before,
3.1 Gibbs Sampler
Again, efficient block sampling is facilitated by the posterior independence of the rows of the component matrices, . Define the indices
Thus is a vector of indices along dimension corresponding to the set of entries where . Now let
where is an elementwise matrix product. Let . Again, let be the diagonal matrix such that .
Thus, the full conditional distributions can be expressed as
for and . Furthermore,
All of the posterior sampling distributions from the component column variances can be preserved from the matrix case with only slight modification. Those terms which depend on columns of the component matrices and in the matrix case must be extended to depend on the corresponding columns from all of the component matrices in the tensor case. For example, the full conditional distributions in the case of the horseshoe formulation are:
4 Simulation Studies
In the following simulation studies, we compare the performance of the proposed global-local formulations to the Gaussian and Gamma priors. In the 2-dimensional (matrix) case, we also compare our results to a popular frequentist matrix completion algorithm, softImpute (hastie2015matrix), which is based on nuclear norm regularization. In each simulation, we generate random low-rank matrices and tensors using the following procedure. Given the desired dimensions , true rank and scalar column variances , , we generate component matrices in which each entry is an independent draw from . The simulated mean matrix or tensor is then
to which we add Gaussian noise, thus
We then select a proportion of entries to retain uniformly at random, with the condition that at least one entry from each row and column (and slice, etc.) remain.
The softImpute algorithm requires a tuning parameter, , which specifies the desired level of rank-sparsity. The Gamma formulation requires a prior hyperparameter, , which serves much the same function. In these simulations, we will be unusually charitable towards these algorithms by training their tuning parameters using oracle information for each combination of test parameters.
The Gaussian and IGG formulations also require prior hyperparameters, but we find ther results of these algorithms are not particularly sensitive to the choice of hyperparameters. For the Gaussian case, it is necessarily only to choose to be sufficiently large. In the IGG case, following bai2017inverse, we set , , and .
4.1 Changing the rank-sparsity
Here we study the performance of the various algorithms under changing levels of rank-sparsity.
4.1.1 Matrix Case
Here we simulate matrices, setting and for and keeping of observations.
For the Bayesian algorithms, we set . We also set the max rank setting in softImpute to . We attain Bayes estimates from the Gibbs samplers using 100 samples with a thinning factor of 5 after a burn-in of 500 iterations.
For a given estimate of , denoted
We calculate the standard error as:
The standard errors presented are averaged over 100 trials.
We find that even when competing algorithms are tuned with oracle information, the global-local priors still achieve the lowest SE, especially at high levels of sparsity. All of the Bayesian methods outperform softImpute at high levels of sparsity, but in high-dimensional cases where there is very little information about the missing entries, softImpute is competitive.
4.1.2 Order-3 Tensor Case
We find similar results for order-3 tensors. The tensors we generate have , for and dimensions , keeping only of the entries. For the Bayesian algorithms, we use the same settings as above and tune the gamma prior using oracle information.
Similar to above, the standard error is:
The standard errors presented are averaged over 100 iterations.
Again, we find that even when the gamma algorithm is tuned with oracle information, the global-local priors still achieve the same or lower SE, especially at high levels of sparsity.
4.2 Changing the proportion of missing entries
In this experiment, we vary the level of missing observations in matrices while keeping the true rank fixed at with column variances for . As above, for the Bayesian algorithms, we set . We also set the max rank setting in softImpute to . We attain Bayes estimates from the Gibbs samplers using 100 samples with a thinning factor of 5 after a burn-in of 500 iterations. Again, the standard errors are averaged over 100 iterations.
Again, the Horseshoe+ prior formulation generally achieves the best performance, although it is very similar to the performance of the Horseshoe. The Gamma and Gaussian priors perform particularly poorly when there are extremely few observations. The sparsity inducing structures in the global-local prior formulations seem essential in rank-sparse scenarios when there are too few observations for the standard Gaussian formulation to approximate the column variances using the observed data. Here, the performance of the softImpute algorithm largely mimics the performance of the Gaussian algorithm.
4.3 Rank-Sparsity Recovery
Here we showcase the tendency of the global-local Bayesian algorithms to faithfully identify the true underlying dimensionality of the matrix to be completed. We simulate matrices with rank and set column variances Here we retain only of the observations. This scree plot (with values averaged over 100 iterations) demonstrates the the fact that the output of the Bayesian algorithms concentrates more of the information about the signal in the first five singular vectors. The Gaussian algorithm and softImpute disperse more of the information over the full allotment of dimensions. Thus in rank-sparse cases, these sparser formulations may help more accurately identify the true underlying rank. (We omit the Horseshoe, Gamma, and IGG algorithm from this plot because their results nearly entirely overlap the Horseshoe+ results).
By examining Figure 1 the singular values resulting from the Horseshoe+ algorithm, we can easily identify the underlying rank-5 structure. However, we cannot identify this structure from examining the singular values of either the Gaussian or softImpute results.
5 Matrix Completion Application: Board Game Ratings
The preeminant website for hobby board game enthusiasts is www.boardgamegeek.com. The site catalogs information about every board game released, and has over one million registered users, who rate games on a continuous scale between 1 and 10. We compiled a data set of votes on the top 200 games by 5,000 randomly selected users who have rated least 5 of these games. (Users who had suspiciously rated over 200 games on the site were excluded). We represent this as a vote matrix with each row representing a user and each column representing a game.
Here we compare the relative performance of various matrix completion algorithms on this real data. We randomly select 20% of users and hold out one of each of their ratings to form a test set, . For the Gamma and softImpute algorithm, we use multiple values of and respectively. For each algorithm, we calculate the standard error of its predictions as:
where player ’s actual rating on game and is player ’s predicted rating on game .
We also compare the percentage of the total variation in rankings explained by the various models. We define this percentage as
where is the average rating across users and games in the test set.
In order to predict user ratings, we include in each of these models row-wise intercepts representing users’ levels of generosity, columnwise intercepts representing games’ overall levels of quality, and and overall intercept.
|Params||Test SE||% Explained|
|Gaussian||= 10||1.301||17.7 %|
The relative performances of the various algorithms are largely consistent with the simulations, but in this case the global local priors perform only slightly better than softImpute, as long as we choose a suitable correct tuning parameter. It seems the greatest advantage here to the Horseshoe, Horseshoe+ and IGG formulations over softImpute are the lack of tuning parameters, indicating their convenience as default methods. The Gamma formulation seems to be more robust to its hyperparameter in this situation, but still worse performing overall.
As long as we are modeling this user-game vote matrix in terms of its decomposition into a de-facto user and game component matrices, we cannot resist conducting an exploratory factor analysis. This decomposition models user votes as depending on up to latent factors, which may be recognizable as game properties. After normalizing the columns of the component matrices and , we can interpret each as the affinity of user for game property , and we can interpret each as the degree to which game exhibits property .
When we position games according to their first two factor scores, and for familiar patterns emerge.
We notice in Figure 2 that the coordinates seem to capture two aspects of games which are generally considered important by hobby gamers. Towards the bottom left are “lighter” games - they have shorter play time and easier to understand rules. To the top right of the horizontal axis are “heavier” games - these games appeal to “core” gamers who don’t mind learning complex rule-sets and playing for many hours. Orthogonal to this direction, we find an axis that seems to correspond to a game’s thematic content and flavor. To the bottom right, we find games with exciting and immersive themes and settings, and to the top left we find games with subtler and more traditionally European themes and settings.
The furthest game to the bottom left on the horizontal axis is Seiji Kanai’s Love Letter, a card game which only has a handful of different cards and only lasts a few minutes. Love Letter has a “Complexity” rating on boardgamegeek.com of 1.20/5, an average of community votes. On the lower left, we also find Uwe Rosenberg’s Bohnanza, Marc André’s Splendor, and Masao Suganuma’s Machi Koro, with complexity ratings of 1.67, 1.82, and 1.55 respectively. These games are all known for the light rule sets which make them suitable “gateway games” for newcomers to the hobby.
Meanwhile, to the top right we find notoriously formidable and lengthy games like Helge Ostertag’s Terra Mystica, Vlaada Chvátil’s Through the Ages, and Chad Jensen’s Dominant Species
, with complexity ratings of 3.95, 4.17, and 4.03 respectively. The seeming outlier to the right of the image plot is Vlaada Chvatil’sMage Knight, a game notorious for it’s long playtime and complex movement and fighting mechanisms. Mage Knight has a weight rating of 4.26 out of 5.
Looking to the bottom right we find highly thematic games like Jonathan Gilmour and Isaac Vega’s Dead of Winter, a zombie survival game, Fantasy Flight’s Star Wars Rebellion, a game attached to an expensive intellectual property, and Mansions of Madness, also by Fantasy Flight, one of many games set in a world of monsters inspired by H.P. Lovecraft. By contrast, at the top left are games like Andreas Seyfarth’s Puerto Rico, Uwe Rosenberg’s Agricola, and Bernd Brunnhofer’s Saint Petersburg. These games all take place in historical settings, in which players peacefully trade goods to maximize their economic returns. Hobby board game players sometimes call this style of game the “Eurogame,” as this type of economic resource-management game first become popular in Germany in the 1980s and 1990s, when Klaus Teuber’s Settlers of Catan helped bring about the modern renaissance in hobby games (woods2012eurogames). Various community members, depending on their preferences, are known either to heavily anticipate or sneer at “yet another game about trading spices in the Mediterranean.”
We present these interpretations with due modesty considering that the total amount of variation in ratings explained by all of the factors combined is less than - it seems that either the noise in user ratings is much greater than the signal, or else the signal has a structure that cannot be fully captured by this bilinear model. Furthermore, the positions of the various games along the third principal component score axis and beyond elude our interpretation.
6 Tensor Completion Application: Color Image Recovery
In this section, we test out the tensor completion functionality of our proposed algorithms. We use the representation of color images as tensors; is the size in pixels of the two-dimensional image, and there is one matrix slice for each of the red, green and blue color channels. By applying the low-rank tensor completion algorithm with intercepts for each dimension we can reasonably reconstruct the images even while most of the pixels are missing.
In Table 6, we experiment with five different images (all covers of various classic board games) with different percentages of pixels removed uniformly at random. For images with few missing pixels, the output for each algorithm is visually indistinguishable. However, when a high percentage of pixels are missing, the Gaussian algorithm performs poorly, and the Gamma algorithm depends heavily on the choice of hyperparameter. This highlights the advantage of the global-local algorithms as out-of-the-box solutions.
In Table 7, we break down the outputs for one particular image (the cover of the board game Acquire) into low rank reconstructions. We find the r-rank representations of the completed tensors by computing the 3-mode tensor Single Value Decomposition (SVD) of the form
where and are orthogonal 3-mode tensors, and is a 3-mode tensor whose matrix faces are diagonal matrices (kilmer2013third; li2018package). We then set the th entry of to 0 when .
From this breakdown we can see that an accurate reconstruction requires information from a large number of principal components - in this sense the images are themselves not particularly low-rank. Thus we may not expect substantial gains in performance from sparse methods.
In this paper, we have presented three new prior formulations for modelling the components of matrix and tensor decompositions. These models outperform previous algorithms in the task of matrix and tensor completion in simulations and in real data examples. In addition, these algorithms require no tuning parameters, and thus provide convenient out-of-the-box solutions for matrix and tensor completion which automatically adapt to the underlying level of rank-sparsity in the data.
Beyond collaborative filtering and image completion, these factorization techniques can be directly applied to modify existing missing data imputation techniques based on matrix factorization (sengupta2017). The framework in this paper assumes there is no systematic relationship between missingness and response, and thus is directly applicable in the Missing Completely at Random setting (rubin2004multiple). Therefore, it would be interesting to extend these algorithms to account for data Missing Not at Random, as in sportisse2018imputation.
Although in this paper we have focused on the completion of matrices and tensors, global-local priors also provide a flexible structure for automatic rank-detection in low-rank matrix and tensor decomposition problems in general. By elaborating on the structure of the priors on we can also encourage sparsity in the rows of the component columns and even the individual entries; see zhang2018abacus for an application of such an extension to the problem of compression and change-point detection in multivariate time series.
Most importantly, the connection between Bayesian variable selection in linear models and Bayesian low-rank matrix decomposition may point in additional promising directions. For example, in contrast to “one-group” priors including the global-local priors discussed in this paper, “two-group” or “spike and slab” priors (mitchell1988bayesian) may be considered for the singular values of the response matrix. In particular, Spike and slab priors using “non-local” slab-priors (rossell2017nonlocal) have increased power in finding null parameters in regression models, and thus may be useful in nullifying singular values. A wealth of results in sparse Bayesian linear models may yield analogous successes for low-rank models models in general.