I Introduction
MATRIX completion is a fundamental problem in machine learning and signal processing, with a wide range of applications spanning from recommender systems
[1, 2][3, 4]. The problem is defined as follows: given a partially observed matrix with the set of indices of known entries, recover the unknown entries in . This can be achieved by solving for a dense prediction matrix such that:(1) 
with an operator that selects the entries defined in , and the Frobenius norm. Among the most notable applications of matrix completion is collaborative filtering in recommender systems, where is the rating matrix with its rows and columns corresponding to users and items, and an entry representing the rating or interaction between a user and an item. Only a small number of entries are typically observed since an average user only rates a small portion of the items. Due to this scarcity of observed entries, predicting the missing entries becomes highly challenging.
Most existing matrix completion methods follow either a memory or a modelbased approach [5]. Memorybased, alias, nearestneighbors (NN), methods predict the missing ratings by utilizing the relationships among the users and/or items: a missing rating of a user for an item is predicted by using the known ratings of similar users for the same item (userbased method) or the known ratings of the same user for similar items (itembased method) [5, 6, 7]
. A critical step in such methods is the estimation of how similar users or items are. This similarity is often estimated by evaluating predefined metrics—such as the cosine similarity or the Pearson correlation
[6]—on the common known entries. Hybrid memorybased methods fuse the user and itembased views [8, 9, 10], leading to more reliable predictions [9]. Memorybased models, in general, rely on a subset of the available information [5], and can be unreliable due to the data scarcity problem [9].Modelbased methods, on the other hand, predict the missing entries by regularizing Problem (1) using functions that impose underlying lowcomplexity characteristics on the data, i.e., lowrank [11, 12, 13], lowrankplussparse [14, 15], or nonnegativity characteristics [16, 17]. By solving the regularized optimization problems, such models learn latent representations from the data and often provide more accurate predictions compared to memorybased methods [5]
. Recently, deeplearningbased methods, such as deep autoencoders
[18, 19, 20, 21] and deep matrix factorization models [22, 23], learn nonlinear latent representations, leading to high performance. However, the lowcomplexity characteristics imposed by such approaches might not be present or might underfit the underlying structure in the data, resulting in performance loss.Various methods have been proposed to combine the best of both the memory and modelbased approaches. In this direction, geometric matrix completion (GMC) has lately received a lot of attention [24, 25, 26]. GMC refers to modelbased methods that leverage the relationships among the set of users and items when making predictions [24, 25, 27, 28]. These relationships are often represented in the form of graphs, called structure graphs, with nodes representing users, items or entries, and edges encoding their similarities. Several methods proposed to leverage Markov random fields (MRF) to encode the relationships between the matrix entries [29, 30, 31, 32]. In these methods, the structure graphs were incorporated into the estimation of the MRF potentials. MRFbased methods can be highly flexible, in the sense that they can learn the structure graphs directly from the data and do not require specifying the edge weights beforehand [31]. Nevertheless, they rely on handcrafted features for estimating the potentials. Recently, several methods have been proposed to leverage graph deep learning techniques to learn the features from the data while utilizing the structure graphs [33, 34, 35]. These methods have achieved promising performance on various benchmark datasets [33, 34, 35]. However, they require fullydefined user and item graphs to operate on. The user and item graphs were often built using predefined similarity metrics [27], which become unreliable in case of high data scarcity [9], or using side information [33, 35], which is not always available.
In this paper, we focus on the challenge of completing matrices from very few observations, without assuming access to side information (for example, user or item profiles). We propose a geometric matrix completion model that (i
) leverages a deep neural network architecture to learn the latent representations in the data, (
ii) learns the structure graphs and the relationships between entries directly from the data. We consider matrix completion as a structured prediction problem in a conditional random field (CRF), which is characterized by a maximum a posteriori (MAP) inference problem. We employ the meanfield algorithm to approximately solve this MAP inference, and propose a mechanism to unfold the algorithm into neural network layers. This unfolding mechanism allows us to incorporate the meanfield inference on top of a deep neural network, resulting in the proposed deep conditional random fields model for matrix completion (DCMC). As such, the proposed model simultaneously carries the advantages of different stateoftheart approaches: it learns the latent features in the data (advantage of deep learning); it learns the structure graph from the data (advantage of MRFbased matrix completion models).Our main contributions are as follows:

We propose a deep CRF model for matrix completion, which simultaneously computes the CRF potentials, estimates the relationships between entries and performs meanfield inference in each forward pass. The proposed model can be trained endtoend using only the known matrix entries.

We propose a method to supervise the learning of the similarities between entries by utilizing the known matrix entries. Using this method the model effectively learns the structure graphs from the available data.

We perform comprehensive experiments on wellestablished benchmark datasets, which demonstrate (i) the gain in prediction accuracy that the proposed DCMC model brings over various stateoftheart models, and (ii) the effectiveness of the learned similarities compared to those estimated using predefined metrics. The results corroborate that the improvements are more profound on datasets with very few observed entries.
The remainder of the paper is organized as follows. In Section II, we review the related work, and in Section III, we present our formulation of matrix completion as a MAP inference problem in CRF. In Section IV, we describe our deep geometric matrix completion model, and present the experimental settings and results in Section V. Finally, we draw the conclusion in Section VI.
Ii Related Work
Iia Hybrid Memorybased Matrix Completion
Memorybased, or NN, methods predict a missing rating by (weighted) averaging the values of the entries most similar to it from a set of potential predicting entries. User and itembased NN methods consider potential predicting entries along the th row and the th column of the matrix . An illustration of the relationships between matrix entries is given in Fig. 1.
With limited known entries, fewer potential predicting entries are available; as a result, the predictions of NN methods become unreliable [9]. To mitigate this problem, hybrid memorybased methods unify the user and itembased views to enrich the set of potential predicting ratings [9, 8]. In [9], for example, in order to predict a missing rating , the ratings of user for the other items (ratings along the th row), the ratings of the other users for item (ratings along the th column), and the ratings of other users for other items are considered.
In this work, we follow a modelbased approach and propose a deep neural network which simultaneously (i) finds the latent factors in the data, and (ii) learns and leverages the relationships among the matrix entries. Our model incorporates the advantages of the hybridmemorybased methods [9, 8] as it considers all entries as potential predicting entries when making predictions; yet, it does not rely on predefined similarity metrics (e.g., the Pearson correlation), which becomes unreliable in case of high data scarcity [9].
IiB Geometric Matrix Completion
Geometric matrix completion (GMC) methods incorporate structure graphs, which capture relationships between users, items or entries, into the prediction model. Structure graphs were used to regularize prediction models by enforcing a smoothness constraint on the latent user factors [24], on the rows and columns of the dense prediction matrix [27] or on both the latent user and item factors [28] via graph regularization techniques.
With the goal to exploit the structure graphs, several GMC methods utilize random field models, which are powerful in modeling the dependencies between random variables. The Preference Network
[29], for example, used a Markov random fields (MRF) model with nodes representing entries and edges encoding their relationships. This model was later extended to handle ordinal rating values [30] and relative preferences, i.e., item rankings [32]. Alternatively, the item field model [36] built an MRF model on top of the item graphs. These models, however, require the structure graphs to be defined before constructing the models. Tran et al. alleviated this requirement and proposed an MRFbased model in which the edges in the graphs were parameterized and learned from the data [31]. Nevertheless, these MRFbased methods rely on handcrafted features to compute the potentials of the random fields.More recent methods leverage geometric deep learning techniques [37] to learn the latent features using the structure graphs. Monti et al.
proposed a multigraph convolutional neural network to capture the spatial features from the user and item graphs
[33], which were built using side information (e.g., information obtained from user profiles). Berg et al. employed a bipartite graph, constructed from the original matrix, with nodes corresponding to users and items, and edges corresponding to the known ratings. They proposed to use a convolutional graph encoder to encode the nodes into latent factors [34]. Wu et al. constructed the user and item graphs from side information (similar to [33]) and employed graph convolutional neural networks [38] to learn the latent factors. Geometricdeeplearningbased methods have achieved high performance on several benchmark datasets [33, 34, 35]; nevertheless, they require the structure graphs to be fullyspecified before training.Our model is classified into the GMC category of models, especially, the deeplearningbased ones. Unlike existing deeplearningbased GMC models, our model learns the structure graph from the known matrix entries and incorporates them in a conditional random field. Unlike existing randomfieldbased models, our model learns the latent features of the data by leveraging a deep neural network architecture. Furthermore, it utilizes all relationships among all available entries, instead of using only the relationships between entries that share common users or items (as MRFbased
[31] methods do).IiC Deep Random Fields Models
Random field models have been successfully applied to solve various problems in natural language processing (NLP)
[39, 40]and computer vision
[41, 42]. Traditional methods followed a twostage pipeline in which the random fields models are used in a separate postprocessing stage to enforce smoothness over the outputs of dependent nodes, given the potentials estimated at the first stage. Recent studies in the computer vision domain have shown that combining random fields and deep neural networks into a joint model can significantly boost the performance [43, 44, 45].We draw inspiration from these models, and build a deep Conditional random field (CRF) model for matrix completion by unfolding the inference step in the CRF into neural network layers. An inherent challenge that appears when applying this stategy in matrix completion is the lack of an explicit local neighborhood between matrix entries, as opposed to the neighborhood of pixels in visual data (e.g., the widelyused 4connected or 8connected neighborhoods in images and videos). To overcome this challenge, we adopt a fullyconnected CRF where a node is connected to all the other nodes. This full connectivity requires us to develop a new unfolding mechanism for the inference in the CRF, and methods to learn reliable relationships among the nodes. To the best of our knowledge, this is the first work to successfully solve the matrix completion problem with a deep random field model.
Iii Matrix Completion as Structured Prediction in CRF
In this section, we formulate matrix completion as a MAP inference problem in a CRF and then describe the meanfield algorithm that can solve the specific inference problem.
Suppose for now that we have obtained the relationships between matrix entries. We will describe in detail how we learn these relationships in Section IVA2. Let us consider a CRF defined over an undirected graph , where is the set of nodes, with each node corresponding to an entry in the matrix , and is the set of edges whose weights encode the relationships between the nodes. In matrix completion, there is no explicit local neighborhood for an entry. For this reason, we opt to encode all the pairwise relationships between the nodes by making fullyconnected. A downside of constructing as a graph of entries is that the number of nodes becomes very large in applications involving highdimensional matrices. We introduce several techniques to alleviate this problem within our model in Section IVE and Section IVF.
Denote by the number of nodes in , i.e., , we have . A node in the CRF is associated with a latent random variable representing the label (alias, the value) of the corresponding entry. The random variables , , have domain . We consider discrete matrices, e.g., rating matrices, hence, , with the number of possible entry values. The edge between the nodes and encodes the statistical dependency between the random variables and . In the rest of the paper, we refer to the nodes and the labels by their indices, namely and .
We denote by the observations over the matrix , i.e., the given entries, and by a labeling operator that assigns to each node in a label in . Each instantiation of indicates a sequence of labels for the CRF’s nodes. By taking the labels for the missing entries from an instantiation of , one can complete the matrix
. Finding the best predictions for the missing entries’ values is, therefore, equivalent to finding the most probable instantiation of
given . This procedure can be formulated as a MAP inference problem:(2) 
with the posterior in the CRF, which is given by:
(3) 
In (3), is the partition function ensuring a valid distribution, and is the energy of the CRF, which has the form:
(4) 
The unary potential in (4) measures the cost of assigning the label to the node . This cost is computed for each node and each label. The computation of the unary potential can be done in a separate step before the CRF inference, e.g., by means of a prediction model. The pairwise potential measures the cost of assigning to nodes and , the labels and , respectively. is the set of all connected pairs in the CRF; in our model, . Intuitively, encodes the relationship between the two corresponding entries. Unlike existing MRFbased models for matrix completion (e.g., [31]), where the pairwise potentials were only computed for pairs of entries of the same users or the same items, the pairwise potentials in our model are computed for all pairs of matrix entries. Furthermore, as shown in Section IVA, both the unary and pairwise potentials of our CRF are computed using a deep neural network.
As exactly computing the posterior is intractable, we employ the meanfield algorithm to approximate the posterior [46]. In what follows, we briefly describe this algorithm, the steps of which are interpreted as neural network layers within our model (see Section IVC). The meanfield algorithm approximates by a simpler proposal distribution belonging to the family of fullyfactorized distributions:
(5) 
where is the distribution over the variable and is the probability of labeling the node with the label according to the distribution . Then, the algorithm [46] tries to find the proposal distribution that is as close as possible to the target distribution
, where the closeness is measured via the KullbackLeibler divergence
. For brevity, let us denote as ; the meanfield algorithm [46] estimates , for all and , by minimizing with respect to each , subject to the constraint . This is done by means of the following generic meanfield update equation [46]:(6) 
with the set of nodes connected to the node , and the normalization factor to make
a valid probability distribution:
(7) 
The meanfield algorithm iteratively updates according to (6), for all , for a certain number of iterations, or until a convergence condition has been reached, e.g., the changes in all fall below a small tolerance value. The result is the proposal distribution that best approximates . As is fullyfactorized, the solution to the MAP problem (2) can be found by taking for each node the label that maximizes the marginal distribution .
Iv Deep CRF for Matrix Completion
In this section, we first describe our deep neural network that simultaneously estimates the similarities between entries and computes the unary and pairwise potentials of the CRF. We refer to this neural network as the base prediction network. Using the computed potentials, we derive our final meanfield update equation. After that, we present a method to perform the meanfield update using speciallydesigned neural network layers, which we call meanfield layers. Stacking these meanfield layers on top of the base prediction network forms our Deep CRF model for matrix completion (DCMC). The architecture of the DCMC model is illustrated in Fig. 2. At the end of the section, we present methods to efficiently train and make predictions with the proposed model, and to effectively supervise the learning of the similarity between entries.
Iva The Base Prediction Network
The architecture of the base prediction network is depicted in blue in Fig. 2. This architecture is inspired by our previous deep matrix factorization models in [22, 23]. The base prediction network has two branches, called the row and column
branches, which consist of a configurable number of fully connected layers, each followed by a batch normalization layer
[47]. All layers, except for the last ones in each branch, are followed by the Rectified Linear Unit (ReLU) activation function
[48] and dropout regularization [49]. The network takes as inputs a batch of row vectors () and a batch of column vectors () from the original matrix . Similar to [18, 22], we impute missing entries with
. The row and column branches transform these input vectors into embeddings in the dimensional latent space: Given a row vector and a column vector , the two branches produces two embeddings , respectively. Using these embeddings, the score for the entry at position and label , is calculated via a bilinear decoder:(8) 
with learnable weights for the label . As there are labels in , the bilinear decoder consists of parameters in total.
IvA1 Label Probability
Using the predicted scores , the predicted probability is computed using the softmax function:
(9) 
IvA2 Computing the Entry Similarity
It is worth noting that we focus on the cases where at least one of the two entries are unknown, since calculating the similarity between two known entries is trivial. We denote by and , respectively, the functions that compute the user and item similarities: computes the similarity between users and , and computes the similarity between items and . As the cosine similarity has been proven effective and robust in measuring similarities between high dimensional vectors in learned latent spaces [50], we define and as the cosine similarity between the embeddings produced by the base prediction network, namely,
(10) 
with , the embeddings of user and item .
We model the similarity between two entries as the product of the corresponding user and item similarities. With the assumption that and are nonzero, if two users have similar preferences [that is, is high], their ratings for similar items [that is, is high] should be similar. Whereas, if two users have dissimilar preferences [that is, is low], they are not expected to have similar ratings. Denote by the function that computes the entry similarity, the similarity between two matrix entries and , , is given by
(11) 
Using and , which are defined in (10), the entry similarity is computed as
(12) 
As the cosine similarity has a range of , we linearly scale and so that they lie in . The entry similarity, then, is also in the range .
IvB Modeling the Unary and Pairwise Term
We now present how we compute the unary and pairwise potentials using the outputs of the base prediction network.
IvB1 The Unary Potentials
The unary potential measures the cost of assigning the label to a node . We use the negative loglikehood to compute . will be high if for the node the label has low score and vice versa. Specifically, suppose that the node corresponds to the entry , then, the unary term is computed as
(13) 
where is the predicted label probability that is computed using (9).
IvB2 The Pairwise Potentials
The pairwise potentials measure the label disagreement cost between pairs of nodes in the model. We compute the cost of assigning the labels and to the nodes and as
(14) 
with
a hyperparameter determining the weight of the pairwise term with respect to the unary term and
the estimated similarity between the nodes and . Here, the nodes respectively correspond to the entries at the positions and in the matrix , and the similarity is estimated according to (12). In (14), is a function that computes the compatibility between the labels , which is often referred to as the compatibility function in the random fields literature. There are many forms of that have been used for CRF models; in this work, we employ the truncated quadratic function:(15) 
with a predefined truncation threshold.
It can be seen from (14) that the pairwise potentials depend on the learned entry similarity; as such, in our model, both the unary and pairwise potentials are computed from the learned latent features for the users and items, which are produced by the base prediction network.
IvB3 The Final Meanfield Update
Substituting the unary and pairwise potentials in (13) and (14) into (6), we derive the final meanfield update equation for our model as
(16) 
or equivalently:
(17) 
We refer to the term in (17) as the compatibility transform, and to the outer term , which involves the summation over all the nodes connected to the node , as the message passing operation.
IvC Unfolding the Meanfield Algorithm
Let us suppose for now that we process all the entries in the matrix simultaneously in a full batch, namely, we use all the rows and columns of the given matrix as the inputs to the base prediction network. The outputs of the base prediction network then consist of: (i) the label probability matrix, denoted by , with , and is the location of the entry corresponding to node ; (ii) the learned entry similarity matrix of which each element is the predicted similarity between the corresponding entries of nodes and , ; and (iii) the matrix of the unary terms . Since we build a fullyconnected CRF model, is a dense matrix. We denote by the label compatibility matrix, each element of which corresponds to the compatibility between two labels . The matrix can be calculated offline according to (15) on the possible entry values. In what follows, we describe how we unfold the mean field update, taking the matrices , , , and as input.
IvC1 The Compatibility Transform Step
The compatibility transform can be performed via a 1D convolutional layer applied on the matrix . This convolutional layer has filters of kernel size whose weights are determined from as follows: the weights of the th filter are fixed equal to the values along the th row of
. We do not employ any padding and set the stride to
. The th filter slides vertically across , and calculates the inner product between its weights and the rows of . The output of this layer, which is denoted as , is given by(18) 
where denotes the operation of a convolutional layer on the input with filters constructed from as described above. An element with and is expressed as
(19) 
IvC2 The Message Passing Step
After multiplying with , we get , where an element is given by
(20) 
After expanding according to (19), (20) becomes
(21) 
An illustration of this operation is given in Fig. 2(b). As our graph of entries is fullyconnected, the set of nodes connected to the node is given by . Therefore, is the result of the message passing step in (17).
IvC3 The Meanfield Layer
After the compability transform and message passing steps, the remaining operations involved in one meanfield iteration can be performed straightforwardly. Algorithm 1 summarizes one iteration of the unfolded meanfield update. The step that adds the unary potentials involves elementwise products and elementwise additions, and the update and normalization step can be performed simultaneously for all the nodes and labels using the softmax function. We can group all operations in a meanfield iteration and consider them as a speciallydesigned neural network layer, called meanfield layer.
IvD The DCMC Model
Using the techniques presented in Section IVC1 and Section IVC2, we can then interpret iterations of the meanfield algorithm into meanfield layers stacked on top of each other; namely, a subsequent layer takes the output from its preceding layer as input. All meanfield layers share the same set of parameters, that is . This stack of meanfield layers (illustrated in green in Fig. 2) can then be put on top of the base prediction network (illustrated in blue in Fig. 2
), forming our deep CRF model for matrix completion (DCMC). Each forward pass of the model involves computing entry similarities, estimating the CRF potentials and performing the meanfield updates. As all operations in a meanfield layer are differentiable, we can backpropagate the gradients of the loss function through each meanfield layer. This allows us to train the DCMC model using gradient descent algorithms in an endtoend manner. It is worth noting that a meanfield layer does not introduce any additional free parameters to the model, hence, it does not increase the risk of overfitting of the final model.
Integrating the meanfield update on top of the prediction network allows training the prediction network with feedback from the meanfield layers. Intuitively, this allows the prediction network to learn to adapt to the meanfield inference. This is an advantage of the proposed model compared to using a twostage method, which first performs the base prediction network to compute the potentials and then applies the meanfield algorithm.
IvE Training the DCMC Model
So far, we have assumed working on the whole CRF model with nodes. Nevertheless, in applications involving big matrices, this becomes impractical due to the high computation and memory consumptions. We employ two techniques to mitigate this problem: (i) during training, we consider only the known entries as nodes in the CRF instead of all the matrix entries; and (ii) we train our model in minibatches.
In a training iteration , we randomly sample rows and columns from the original matrix. When evaluating the loss function, we only take into account the observed entries among all the sampled entries. We denote this set of observed entries by , with . It should be noted that is different in each minibatch. Similarly, we only consider the nodes corresponding to the observed entries in when constructing the graph for the CRF. Implementationwise, from the probability matrix , the matrix of unary terms , and the similarity matrix produced by the base prediction network, we select submatrices , and using the indices of the observed entries. is used as the input to the first meanfield layer, while and are shared among all the meanfield layers.
Due to the minibatch sampling, an entry only gets connected to other entries in the same minibatch; hence, not all the relationships among the entries are utilized. To remedy this problem, we sample the row and column vectors according to an ordering and randomly shuffle this ordering after each epoch. By training for long enough, we expect to cover most of the relationships among the entries. In our experiments, we empirically observed that sampling
and during training does not affect the performance of the model.Loss Function
We employ the cross entropy loss to train the DCMC model, which is calculated as
(22) 
with the final probability matrix after the last mean field layer, and the probability of assigning to the node its groundtruth label.
Supervising the Similarity Learning
Given two entries with known values, we can straightforwardly calculate their similarity, which can be used as groundtruth data to supervise the similarity learning. We employ the Gaussian similarity function [51] to obtain the groundtruth similarities between the entries. This function is bounded in the range , which is desired by our similarity modeling in (12). The groundtruth similarity between the nodes and , which correspond to the entries and , is calculated by
(23) 
where is a hyperparameter. We use a loss term measuring the mean squarederror between the predicted and the groundtruth node similarities:
(24) 
with the set of connections between two observed entries in each minibatch. Applying this loss term on two entries of similar values will push the embeddings of the corresponding users and items to be close in the latent space, and pull their embeddings far apart otherwise. By applying the same loss on all pairs of observed entries the model is expected to produce embeddings that minimize the similarity loss globally. We empirically observe that supervising the similarity learning systematically improves the quality of the learned similarities, and boosts the performance of the DCMC model.
Our final loss function is then a weighted combination of the cross entropy and similarity losses:
(25) 
with a parameter balancing the two loss terms. The loss function in (25) is optimized over the model’s parameters using stochatistic gradient descent (SGD) algorithm with Adam parameter update [52].
IvF Testing the DCMC Model
At the testing phase, a CRF model with nodes corresponding to all the entries in the given matrix is constructed. After a forward pass of the model, we get the probability matrix where is the probability of assigning a label to node . The continuous prediction is given by
(26) 
When dealing with matrices of high dimensions, we randomly divide its rows and columns into subsets and perform predictions according to (26) inside each subset separately in order to reduce the computation and memory requirements. This procedure can be performed many times to produce multiple predictions for an entry, each time considering a different random set of predicting entries. The final entry value prediction can then be given by calculating their average.
V Experiments
In this section, we present our experimental studies. We first explain our experimental settings and the hyperparameter sensitivity of the DCMC model. After that, we compare the DCMC model against stateoftheart deeplearningbased matrix completion models. Finally, we carry out experiments to justify the benefits of each component in the proposed model.
Va Experimental Settings
Five realworld datasets are employed in our experiments, namely, the MovieLens [53], Flixster [54], Douban [24], YahooMusic [55] and Epinions [56] datasets. These datasets vary in the number of users and items, rating levels and context (movie, music and general consumer ratings). For the first four datasets, we use the experimental configurations (including train/test splits) provided by [33]. Regarding the Epinions dataset, we randomly split the known ratings into for training, for validation and for testing. The details of the five datasets are given in Table I. It can be seen that the densities of the observed entries vary across the datasets, from and , respectively, on the MovieLens and Douban datasets to very low on the Flixster, YahooMusic and Epinion datasets (, and , respectively). It is worth mentioning that we do not employ any side information, e.g., user or item features, in our experiments.
Dataset  # Users  # Items  # Ratings  Rating levels 

MovieLens [53]  943  1,682  100,000  
Flixster [54]  3,000  3,000  26,173  
Douban [24]  3,000  3,000  136,891  
YahooMusic [55]  3,000  3,000  5,335  
Epinions [56]  40,163  139,738  664,824 
We compare the DCMC model with stateoftheart deeplearningbased matrix completion models. The performance of the models is assessed using the Root Mean Square Error (RMSE) and Mean Absolute Error (MAE),
RMSE  
MAE 
calculated over the entries reserved for testing (indexed by ). Smaller RMSE and MAE values indicate more accurate predictions.
VB Hyperparameters Selection
There are a number of hyperparameters that are related to the base prediction network, the meanfield layers, and the training stage. For the base prediction network, we follow [22, 23], and use 2 hidden layers in both the row and column branches. The number of hidden units in the first and second layers are set to 512 and 128, respectively. As the known entries available for training in the employed datasets are scarce, we use a high dropout rate of to mitigate overfitting. We empirically adapt some hyperparameters related to the meanfield layers to each dataset. Specifically, we set the truncation threshold for the quadratic compatibility function in (15) to on the YahooMusic dataset (values in the range ), and to on the other datasets (values in the range ). The value of used to calculate the groundtruth entries’ similarities in (23) is set to on the YahooMusic dataset and to on the other datasets. We set the number of training epochs to (we count one epoch each time all the rows or all the columns are sampled for training), and the learning rate to initially which is reduced by a factor of every epochs.
We determine the values of the following hyperparameters by cross validation, specifically: , which weights the importance of the similarity loss in (24) with respect to the prediction loss; , which balances the weights between the pairwise term and the unary term [see (14)]; and the number of meanfield iterations, equivalently, the number of meanfield layers, . We carry out cross validation on a separate split of the MovieLens dataset ( training, validation and testing). This split is randomly generated and is different from that used to compare the proposed model against the other models.
VB1 Weight of the Similarity Loss Term
To determine the best value for , we first fix to , with which the meanfield inference is likely to converge [57], and set empirically to . As both the computed and the groundtruth entry similarities lie in the range , the similarity loss is normally much smaller than the prediction loss. As a result, we experiment with small and large values of in . The results of this experiment are shown in Table. II.
RMSE 

It can be seen that gives the best performance. Furthermore, when , the RMSE errors drop significantly compared to when . Recall that when , the similarity loss has no effect during training. This proves the benefit of supervising the similarity learning using the proposed method.
VB2 The value of
We fix to , which is found in the previous experiment, to , and run the proposed model using different values of from to . The results of this experiment are shown in Table III.
RMSE 

As can be seen, yields the best performance. The predictions become less accurate when becomes large (beyond ), possibly because the pairwise terms start to dominate the unary terms.
VB3 Number of Meanfield Iterations
Fixing and , we then run the DCMC model with different numbers of meanfield iterations. The results of this experiment are summarized in Table IV.
RMSE 

It can be observed that the RMSE improves as increases. Even though we still observe improvements when is larger than , the differences are very small. Therefore, we select as it provides the best tradeoff between accuracy and computational complexity.
VC Comparison Against Stateoftheart Models
After finding the most effective hyperparameter settings, we carry out experiments to compare the proposed model with reference models on the five realworld datasets. We select stateoftheart deeplearningbased matrix completion models as references, including nongeometric models: the itembased and userbased autoencoders (IAutorec and UAutorec) [18], the deep Userbased autoencoder (Deep UAutorec) [20], the deep matrix factorization model (NMC) [22], the manifoldlearningbasedregularized autoencoders (mIAutorec and mUAutorec) [21], the CFNADE model [58]; and geometric models: the sRGCNN model [33] and the GCMC model [34]
. For these reference models, we use the source codes released by their authors. For the sRGCNN and GCMC models, the graphs are constructed from the observed ratings. We run each model five times and report the average RMSE and MAE values, together with their standard deviations.
RMSE  MAE  

IAutorec [18]  
UAutorec [18]  
Deep UAutorec [20]  
mIAutorec [21]  
mUAutorec [21]  
NMC [22]  
CFNADE [58]  
sRGCNN [33]  
GCMC [34]  
DCMC (Ours) 
Table V and Table VI present the results for different models on the MovieLens dataset, and on the Flixster, Douban and YahooMusic datasets, respectively. On the MovieLens dataset, the proposed model outperforms all other models in both scores, followed by the mIAutorec [21] and the CFNADE [58] models. On the Flixster dataset, the IAutorec model yields the best performance, while our DCMC model is ranked second. On both the Douban and YahooMusic datasets, our model consistently outperforms the reference models. We do not include the results of the CFNADE model on the YahooMusic dataset, as it requires an excessive amount of memory, proportional to the number of rating levels ( in this case).
Model  Flixster  Douban  YahooMusic  

RMSE  MAE  RMSE  MAE  RMSE  MAE  
IAutorec [18]  
UAutorec [18]  
Deep UAutorec [20]  
mIAutorec [21]  
mUAutorec [21]  
NMC [22]  
CFNADE [58]  N/A  N/A  
sRGCNN [33]  
GCMC [34]  
DCMC (Ours) 
We further compare the performance of the models on the Epinions dataset [56], which is of much higher scale than the other datasets used in the experiments. Another challenge is that in this dataset, the given observations are highly scarse with respect to the large matrix dimensions. Table VII presents the results of different models on this dataset. We do not include the sRGCNN [33] and the CFNADE [58] models as they do not scale well to this dataset.
Model  RMSE  MAE 

IAutorec [18]  
UAutorec [18]  
Deep UAutorec [20]  
mIAutorec [21]  
mUAutorec [21]  
NMC [22]  
GCMC [34]  
DCMC (Ours) 
It can be seen that our model outperforms the reference models on this dataset, in terms of both the RMSE and MAE scores, whereas the UAutorec model has the second best performance.
As mentioned earlier, the design of the base prediction network in the DCMC model follows that of the NMC model [22]. Even though the NMC model performs relatively well on the MovieLens dataset, its performance deteriorates on the Flixster, YahooMusic and Epinions datasets, where the numbers of observed entries are highly limited. By effectively learning and leveraging the relationships among entries, the DCMC model significantly improves the accuracy over the NMC model on these datasets. It is evident that over the benchmark datasets, the DCMC model consistently reports low prediction errors and achieves the best overall performance among all the models. The performance gains brought by the DCMC model are more profound as the data becomes highly scarce (e.g., on the YahooMusic and Epinions datasets).
VD Effects of Training the Base Prediction Network with the Meanfield Inference
Testing  
RMSE  MAE  
w/o MF  with MF  w/o MF  with MF  
Training 
w/o MF  
with MF 
In Section IVD, we argued the advantage of the proposed model over a twostage method. To verify this argument, we perform an experiment comparing the results when using different training/testing variants. The first variant involves training and testing without the meanfield inference. This is equivalent to using only the base prediction network in both training and testing. The second variant involves training without and testing with the meanfield inference. This is equivalent to a twostage approach, running the base prediction network to compute the CRF potentials and then run the meanfield algorithm. The third variant involves training with and testing without the meanfield inference. This variant allows us to see the effects of training the base prediction network with feedback from the meanfield inference. The last variant is our final DCMC model, which applies training and testing with inference in an endtoend manner. The same set of hyperparameters is used for all the variants. We use the learned similarities for the variants with the meanfield inference in the testing phase.
The results of this experiment are summarized in Table VIII. It is clear that using the meanfield inference in testing improves the performance independent of whether the model is trained with or without meanfield inference. This shows the benefit of using the meanfield inference with the learned similarities, to gather the information from the predicting entries when making prediction for a missing entry. Training the base prediction network with feedback from the meanfield inference and then testing it without meanfield inference degrades the performance. However, training and testing with the meanfield inference (the DCMC model) yields the best performance. This shows the benefits of the proposed endtoend training over the twostage approach.
VE Quality of the Learned Similarities
The DCMC model learns the similarities between users and items, and in turn computes the similarities between entries. In this subsection, we evaluate the capacity of the model to learn the entry similarities, since the quality of these learned similarities has a strong impact on the prediction accuracy.
We follow an indirect evaluation where we compare the prediction error of the benchmark NN method, specifically, its userbased and itembased variants, when using the learned user and item similarities—obtained by running our approach on the datasets—against that when using widelyused similarity metrics. We select four similarity metrics for this comparison, namely, the cosine similarity (cosine), the mean square difference (msd), the Pearson correlation coefficient (pearson) [6], and the shrunk Pearson correlation coefficient (pearson_shrunk) [5]. We employ the implementations of the NN method and the predefined similarity metrics in the Surprise recommendation system library^{1}^{1}1https://surprise.readthedocs.io/ (in this library, the NN method is called “KNNBasic”).


Fig. 4 shows the RMSE values obtained when using the user and itembased NN methods with the five approaches to compute the user and item similarities, with varying in . Evidently, the proposed learned similarities lead to the best performance independently of the value. On the MovieLens dataset, the benefit of using the learned similarities is less evident than on the YahooMusic dataset. The reason is that only less than of the entries on the YahooMusic dataset are observed. As such, all the predefined metrics become less reliable and the NN method suffers when using these metrics to calculate user and item similarities. We observe the same patterns when performing this experiment on the Flixster and Douban datasets. This shows the benefit of using the proposed model to learn the user and item similarities, especially from a very limited number of observations.
Vi Conclusion
In this paper, we formulated matrix completion as a MAP inference problem in a CRF. The inference problem was solved using the meanfield algorithm. By unfolding the meanfield algorithm into speciallydesigned neural network layers, we constructed a deep model that simultaneously computes the CRF potentials, learns the correlations among the nodes in the CRF and performs the meanfield inference in each forward pass. The model can be trained in an endtoend manner, using a method to supervise the learning of the similarities between entries. Experimental studies using various realworld datasets showed that the proposed model consistently yields better performance than various stateoftheart models, especially on datasets with very limited number of observations, and justified the benefits of each of the proposed components.
References

Su and Khoshgoftaar [2009]
X. Su and T. M. Khoshgoftaar, “A survey of collaborative filtering
techniques,”
Advances in Artificial Intelligence
, vol. 2009, pp. 4:2–4:2, 2009.  Koren et al. [2009] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.
 Liang et al. [2012] X. Liang, X. Ren, Z. Zhang, and Y. Ma, “Repairing sparse lowrank texture,” in European Conference on Computer Vision (ECCV), 2012, pp. 482–495.
 Hu et al. [2013] Y. Hu, D. Zhang, J. Ye, X. Li, and X. He, “Fast and accurate matrix completion via truncated nuclear norm regularization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 9, pp. 2117–2130, 2013.
 Y. Koren [2015] R. B. Y. Koren, “Advances in collaborative filtering,” in Recommender Systems Handbook, F. Ricci, L. Rokach, and B. Shapira, Eds. Boston, MA: Springer, 2015.
 Sarwar et al. [2001] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Itembased collaborative filtering recommendation algorithms,” in International Conference on World Wide Web (WWW), 2001, pp. 285–295.
 Sarwar et al. [2000] ——, “Analysis of recommendation algorithms for ecommerce,” in ACM Conference on Electronic Commerce (EC), 2000, pp. 158–167.
 Verstrepen and Goethals [2014] K. Verstrepen and B. Goethals, “Unifying nearest neighbors collaborative filtering,” in ACM Conference on Recommender Systems (RecSys), 2014, pp. 177–184.
 Wang et al. [2006] J. Wang, A. P. de Vries, and M. J. T. Reinders, “Unifying userbased and itembased collaborative filtering approaches by similarity fusion,” in ACM International Conference on Research and Development in Information Retrieval (SIGIR), 2006, pp. 501–508.
 Bell et al. [2007] R. Bell, Y. Koren, and C. Volinsky, “Modeling relationships at multiple scales to improve accuracy of large recommender systems,” in ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2007, pp. 95–104.
 Candès and Recht [2009] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational Mathematics, vol. 9, no. 6, p. 717, 2009.
 Candès and Tao [2010] E. J. Candès and T. Tao, “The power of convex relaxation: Nearoptimal matrix completion,” IEEE Transactions on Information Theory, vol. 56, no. 5, pp. 2053–2080, 2010.

Jain et al. [2013]
P. Jain, P. Netrapalli, and S. Sanghavi, “Lowrank matrix completion using
alternating minimization,” in
ACM Symposium on Theory of Computing (STOC)
, 2013, pp. 665–674. 
Candès et al. [2011]
E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?”
Journal of the ACM, vol. 58, no. 3, pp. 11:1–11:37, 2011.  Waters et al. [2011] A. E. Waters, A. C. Sankaranarayanan, and R. Baraniuk, “Sparcs: Recovering lowrank and sparse matrices from compressive measurements,” in Advances in Neural Information Processing Systems (NIPS), 2011, pp. 1089–1097.
 Zhang et al. [2006] S. Zhang, W. Wang, J. Ford, and F. Makedon, “Learning from incomplete ratings using nonnegative matrix factorization,” in SIAM Conference on Data Mining (SDM), 2006, pp. 549–553.
 Lee et al. [2010] H. Lee, J. Yoo, and S. Choi, “Semisupervised nonnegative matrix factorization,” IEEE Signal Processing Letters, vol. 17, no. 1, pp. 4–7, 2010.
 Sedhain et al. [2015] S. Sedhain, A. K. Menon, S. Sanner, and L. Xie, “Autorec: Autoencoders meet collaborative filtering,” in International Conference on World Wide Web (WWW), 2015, pp. 111–112.
 Strub et al. [2016] F. Strub, R. Gaudel, and J. Mary, “Hybrid recommender system based on autoencoders,” in 1st Workshop on Deep Learning for Recommender Systems (DLRS), 2016, pp. 11–16.
 Kuchaiev and Ginsburg [2017] O. Kuchaiev and B. Ginsburg, “Training deep autoencoders for collaborative filtering,” ArXiv eprints, 2017.
 Nguyen et al. [2018a] D. M. Nguyen, E. Tsiligianni, R. Calderbank, and N. Deligiannis, “Regularizing autoencoderbased matrix completion models via manifold learning,” in European Signal Processing Conference (EUSIPCO), 2018, pp. 1880–1884.
 Nguyen et al. [2018b] D. M. Nguyen, E. Tsiligianni, and N. Deligiannis, “Extendable neural matrix completion,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 6328–6332.
 Nguyen et al. [2018c] ——, “Learning discrete matrix factorization models,” IEEE Signal Processing Letters, vol. 25, no. 5, pp. 720–724, 2018.
 Ma et al. [2011] H. Ma, D. Zhou, C. Liu, M. R. Lyu, and I. King, “Recommender systems with social regularization,” in ACM International Conference on Web Search and Data Mining (WSDM), 2011, pp. 287–296.
 Dai et al. [2012] W. Dai, E. Kerman, and O. Milenkovic, “A geometric approach to lowrank matrix completion,” IEEE Transactions on Information Theory, vol. 58, no. 1, pp. 237–247, 2012.
 Chouvardas et al. [2017] S. Chouvardas, M. A. Abdullah, L. Claude, and M. Draief, “Robust online matrix completion on graphs,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 4019–4023.
 Kalofolias et al. [2014] V. Kalofolias, X. Bresson, M. Bronstein, and P. Vandergheynst, “Matrix completion on graphs,” in Advances in Neural Information Processing Systems Workshop “Out of the Box: Robustness in High Dimension” (NIPSW), 2014, pp. 1–9.
 Rao et al. [2015] N. Rao, H.F. Yu, P. K. Ravikumar, and I. S. Dhillon, “Collaborative filtering with graph information: Consistency and scalable methods,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 2107–2115.
 Tran et al. [2007] T. T. Tran, D. Q. Phung, and S. Venkatesh, “Preference networks: Probabilistic models for recommendation systems,” in Australasian Conference on Data Mining and Analytic, 2007, pp. 195–202.
 Liu et al. [2015] S. Liu, T. Tran, and G. Li, “Ordinal random fields for recommender systems,” in Asian Conference on Machine Learning (ACML), 2015, pp. 283–298.
 Tran et al. [2016] T. Tran, D. Phung, and S. Venkatesh, “Collaborative filtering via sparse markov random fields,” Information Sciences, vol. 369, pp. 221 – 237, 2016.
 Liu et al. [2017] S. Liu, G. Li, T. Tran, and Y. Jiang, “Preference relationbased markov random fields for recommender systems,” Machine Learning, vol. 106, no. 4, pp. 523–546, 2017.
 Monti et al. [2017] F. Monti, M. M. Bronstein, and X. Bresson, “Geometric matrix completion with recurrent multigraph neural networks,” in Advances in Neural Information Processing Systems (NIPS), 2017, pp. 3700–3710.
 v. d. Berg et al. [2018] R. v. d. Berg, T. N. Kipf, and M. Welling, “Graph convolutional matrix completion,” in KDD Deep Learning Day, 2018, pp. 1–9.

Wu et al. [2018]
Y. Wu, H. Liu, and Y. Yang, “Graph convolutional matrix completion for
bipartite edge prediction,” in
International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KDIR)
, 2018, pp. 51–60.  Defazio and Caetano [2012] A. J. Defazio and T. S. Caetano, “A graphical model formulation of collaborative filtering neighbourhood methods with fast maximum entropy training,” in International Conference on International Conference on Machine Learning (ICML), 2012, pp. 265–272.
 Bronstein et al. [2017] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: Going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
 [38] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” in International Conference on Learning Representations (ICLR).
 McCallum [2003] A. McCallum, “Efficiently inducing features of conditional random fields,” in Conference on Uncertainty in Artificial Intelligence (UAI), 2003, pp. 403–410.
 Sutton et al. [2007] C. Sutton, A. McCallum, and K. Rohanimanesh, “Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data,” Journal of Machine Learning Research, vol. 8, pp. 693–723, 2007.

Scharstein and Pal [2007]
D. Scharstein and C. Pal, “Learning conditional random fields for stereo,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2007, pp. 1–8.  Krähenbühl and Koltun [2011] P. Krähenbühl and V. Koltun, “Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials,” in Advances in Neural Information Processing Systems (NIPS), 2011, pp. 109–117.

Zheng et al. [2015]
S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr, “Conditional random fields as recurrent neural networks,” in
IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1529–1537.  Liu et al. [2018] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang, “Deep learning markov random field for semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 8, pp. 1814–1828, 2018.
 Arnab et al. [2018] A. Arnab, S. Zheng, S. Jayasumana, B. RomeraParedes, M. Larsson, A. Kirillov, B. Savchynskyy, C. Rother, F. Kahl, and P. H. S. Torr, “Conditional random fields meet deep neural networks for semantic segmentation: Combining probabilistic graphical models with deep learning for structured prediction,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 37–52, 2018.
 Koller and Friedman [2009] D. Koller and N. Friedman, in Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
 Ioffe and Szegedy [2015] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning (ICML), 2015, pp. 448–456.

Nair and Hinton [2010]
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in
International Conference on International Conference on Machine Learning (ICML), 2010, pp. 807–814.  Srivastava et al. [2014] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
 Wang et al. [2016] L. Wang, Y. Li, and S. Lazebnik, “Learning deep structurepreserving imagetext embeddings,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5005–5013.

Ng et al. [2001]
A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in
Advances in Neural Information Processing Systems (NIPS), 2001, pp. 849–856.  Kingma and Ba [2015] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
 Harper and Konstan [2015] F. M. Harper and J. A. Konstan, “The movielens datasets: History and context,” ACM Transactions on Interactive Intelligent Systems, vol. 5, no. 4, pp. 19:1–19:19, 2015.
 Jamali and Ester [2010] M. Jamali and M. Ester, “A matrix factorization technique with trust propagation for recommendation in social networks,” in ACM Conference on Recommender Systems (RecSys), 2010, pp. 135–142.
 Dror et al. [2012] G. Dror, N. Koenigstein, Y. Koren, and M. Weimer, “The Yahoo! Music Dataset and KDDCup’11,” in KDD Cup, 2012, pp. 3–18.
 Richardson and Domingos [2002] M. Richardson and P. Domingos, “Mining knowledgesharing sites for viral marketing,” in ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2002, pp. 61–70.
 Krähenbühl and Koltun [2013] P. Krähenbühl and V. Koltun, “Parameter learning and convergent inference for dense random fields,” in International Conference on International Conference on Machine Learning (ICML), 2013, pp. III–513–III–521.
 Zheng et al. [2016] Y. Zheng, B. Tang, W. Ding, and H. Zhou, “A neural autoregressive approach to collaborative filtering,” in International Conference on International Conference on Machine Learning (ICML), 2016, pp. 764–773.