We consider the problem of online matrix completion with side information. In our model, the learner is sequentially queried to predict entries of the matrix. After each query, the learner then receives the (current) matrix entry. The goal of the learner is to minimize prediction mistakes. To aid the learner, side information is associated with each row and column. For instance, in the classic “Netflix challenge” BL07 , the rows of the matrix correspond to viewers and the columns to movies, with entries representing movie ratings. It is natural to suppose that we have demographic information for each user, and metadata for the movies. In this work, we will consider both transductive and inductive models. In the former model, the side information associated with each row and column is specified completely in advance. For the inductive model, only a pair of kernel functions is specified, one for the rows and one for the columns. What is not specified is the mapping from the domain of the kernel functions to specific rows or columns, which is only revealed sequentially. In the Netflix example, the inductive model is especially useful if new users or movies are introduced during the learning process.
In Theorem 1, we will give regret and mistake bounds for online binary matrix completion with side information. Although this theorem has a broader applicability, our interpretation will focus on the case that the matrix has a latent block structure. Hartigan H72 introduced the idea of sorting a matrix by both the rows and columns into a few homogeneous blocks. This has since become known as co- or bi-clustering. This same assumption has become the basis for probabilistic models which can then be used to “complete” a matrix with missing entries. The authors of GLMZ16 give some rate-optimal results for this problem in the batch setting and provide an overview of this literature. It is natural to compare this assumption to the dominant alternative, which assumes that there exists a low rank decomposition of the matrix to be completed, see for instance CR12 . Common to both approaches is that associated with each row and column, there is an underlying latent
factor so that the given matrix entry is determined by a function on the appropriate row and column factor. The low-rank assumption is that the latent factors are vectors inand that the function is the dot product. The latent block structure assumption is that the latent factors are instead categorical and that the function between factors is arbitrary.
In this work, we prove mistake bounds of the form . The term is a parameter of our algorithm which, when exactly tuned, is the squared margin complexity of the comparator matrix
. The notion of margin complexity in machine learning was introduced inBD2003 , where it was used to study the learnability of concept classes via linear embeddings. It was further studied in CMSM07 , and in SrebroS05 a detailed study of margin complexity, trace complexity and rank in the context of statistical bounds for matrix completion was given. The squared margin complexity is upper bounded by rank. Furthermore, if our matrix has a latent block structure with homogeneous blocks (for an illustration, see Figure 1), then . The second term in our bound is the quasi-dimension which, to the best of our knowledge, is novel to this work. The quasi-dimension measures the extent to which the side information is “predictive” of the comparator matrix. In Theorem 3, we provide an upper bound on the quasi-dimension, which measures the predictiveness of the side information when the comparator matrix has a latent block structure. If there is no side information, then . However, if there is a latent block structure and the side information is predictive, then ; hence our nomenclature “quasi-dimension.” In this case, we then have that the mistake bound term , which we will later argue is optimal up to logarithmic factors. Although latent block structure may appear to be a “fragile” measure of matrix complexity, our regret bound implies that performance will scale smoothly in the case of adversarial noise.
The paper is organized as follows. First, we discuss related literature. We then introduce preliminary concepts in Section 2. In Section 3, we provide our main results for the transductive setting, followed by the results in the inductive setting in Section 4. Finally we give some preliminary simulation experiments in Section 5 to illustrate the performance of our algorithm with noisy side information.
Matrix completion has been studied extensively in the batch setting, see for example Srebro2005 ; CT10 ; Maurer2013 ; Chiang2018 and references therein. Central to these approaches is the aim of finding a low-rank factorization by optimizing a convex proxy to rank, such as the trace norm Fazel2001 . The following papers abernethy2006 ; Xu2013 ; Kalofolias2014 ; RHRD15 are partially representative of methods to incorporate side-information into the matrix completion task. The inductive setting for matrix completion has been studied in abernethy2006
through the use of tensor product kernels, andZhang2018 takes a non-convex optimization approach. Some examples in the transductive setting include Xu2013 ; Kalofolias2014 ; RHRD15 . The last two papers use graph Laplacians to model the side information, which is similar to our approach. To achieve this, two graph Laplacians are used to define regularization functionals for both the rows and the columns so that rows (columns) with similar side information tend to have the same values. In particular, RHRD15 resembles our approach by applying the Laplacian regularization functionals to the underlying row and column factors directly. An alternate approach is taken in Kalofolias2014 , where the regularization is instead applied to the row space (column space) of the “surface” matrix.
proved mistake bounds for learning a binary relation which can be viewed as a special case of matrix completion. In the regret setting, with minimal assumptions on the loss function, the regret of the learner is bounded in terms of thetrace-norm of the underlying comparator matrix in CS11 . The authors of HKSS12 provided tight upper and lower bounds in terms of a parameterized complexity class of matrices that include the bounded-trace-norm and bounded-max-norm matrices as special cases. None of the above references considered the problem of side information. The results in gentile2013online ; ourJMLR15 ; HPP16 are nearest in flavor to the results given here. In HPP16 , a mistake bound of was given. Latent block structure was also introduced to the online setting in HPP16 ; however, it was treated in a limited fashion and without the use of side information. The papers gentile2013online ; ourJMLR15 both used side information to predict a limited complexity class of matrices. In gentile2013online , side information was used to predict if vertices in a graph are “similar”; in Section 3.2 we show how this result can be obtained as a special case of our more general bound. In ourJMLR15 , a more general setting was considered, which as a special case addressed the problem of a switching graph labeling. The model in ourJMLR15 is considerably more limited in its scope than our Theorem 1. To obtain our technical results, we used an adaptation of the matrix exponentiated gradient algorithm tsuda2005matrix . The general form of our regret bound comes from a matricization of the regret bound proven for a Winnow-inspired algorithm litt88 for linear classification in the vector case given in Sabato2015 . For a more detailed discussion, see Appendix A.2.
For any positive integer , we define . For any predicate if pred is true and equals 0 otherwise, and . We define the hinge loss as .
We denote the inner product of vectors as and the norm as . The coordinate -dimensional vector is denoted ; we will often abuse notation and use on the assumption that the space may be inferred. For vectors and we define to be the concatenation of and , which we regard as a column vector. Hence . We let be the set of all real-valued matrices. If then denotes the -th -dimensional row vector and the entry of is . We define and to be its pseudoinverse and transpose, respectively. The trace norm of a matrix is , where indicates the unique positive square root of a positive semi-definite matrix, and denotes the trace of a square matrix. This is given by for . The identity matrix is denoted . In addition, we define to be the set of symmetric matrices and let and be the subset of positive semidefinite and strictly positive definite matrices respectively. Recall that the set of symmetric matrices has the following partial ordering: for every , we say that if and only if . We also define the squared radius of as .
For every matrix , we define , the set of matrices which are sign consistent with . We also define , that is the set of matrices which are sign consistent with with a margin of at least one.
The max-norm (or norm CMSM07 ) of a matrix is defined by the formula
where the infimum is over all matrices and and every integer . The margin complexity of a matrix is
Observe that for that where the lower bound follows from right hand side of (2) and the upper bound follows since, assuming w.l.o.g. , we may decompose . We denote the classes of row-normalized and block expansion matrices as and , respectively. Block expansion matrices may be seen as a generalization of permutation matrices, additionally duplicating rows (columns) by left (right) multiplication.
We now introduce notation specific to the graph setting. Let then be an -vertex connected, weighted and undirected graph with positive weights. Let be the matrix such that if and otherwise. Let be the diagonal matrix such that is the degree of vertex . The Laplacian, , of is defined as . Observe that if is connected, then is rank matrix with in its null space. From we define the (strictly) positive definite PDLaplacian . Observe that if then , and similarly, (see herbster2006prediction for details of this construction).
3 Transductive Matrix Completion
A regret bound of our algorithm is given in the following theorem. In the realizable case (with exact tuning), the mistakes are bounded by . The term evaluates the predictive quality of the side information provided to the algorithm. In order to evaluate more simply, we provide an upper bound in Theorem 3 that is easy to interpret. Examples are given in Sections 3.1 and 4.1, where Theorem 3 is applied to evaluate the quality of side information in idealized scenarios.
The hinge loss of Algorithm 1 with non-conservative updates and parameters , , , p.d. matrices and and for is bounded by
for all row-normalized , and any with where
The mistakes in the realizable case with the additional assumption that for all with conservative updates and parameters , and for are bounded by,
The quasi-dimension characterizes the quality of side information. Observe that in the case of no side information, that is and , then ; in this case the bound (6) recovers the bound of (HPP16, , Theorem 3.1) when in the realizable case. The term is difficult to directly quantify. In the coming Theorem 3, we upper bound the quasi-dimension in terms of the latent block structure of
. For our purposes, the notion of (binarized) latent block structure is captured by the following definition.
The class of -binary-biclustered matrices is defined as
More informally, a binary matrix is -biclustered if there exists some permutation of the rows and columns into a grid of blocks each uniformly labeled or , as illustrated in Figure 1.
Many natural functions of matrix complexity are invariant to the presence of block structure. A function with respect to a class of matrices is block-invariant if for all with , , and we have that for any matrix . Thus margin complexity, rank and VC-dimension are all block-invariant. From the block-invariance of the margin complexity, we may conclude that for , . This follows since we may decompose for some , and and then use the observation in the preliminaries that the margin complexity of any matrix is bounded by .
In the following theorem we give a bound for . For the bound to be non-vacuous, two properties must hold. First, the comparator matrix must have a latent block structure, i.e., for some and . Second, in an informal sense, the side information matrices and must be at least “softly predictive” of the latent structure in . If only the first property holds, observe then that with an exact tuning. However, may be large if and are ill-conditioned, although in many cases . On the other hand, if only the second property holds, then the following theorem will be vacuous, although may still be small. However, there are cases where there is no latent block structure but both and are small.
The bound allows us to bound the quality of side information in terms of a hypothetical learning problem. Recall that is the upper bound on the mistakes per Novikoff’s theorem Novikoff62 for predicting the elements of vector with a kernel perceptron using as the kernel. Hence the term in (7) may be interpreted as a bound for a one-versus-all -class kernel perceptron where encodes a labeling from as one-hot vectors. We next show that with “ideal” side information that the bound .
3.1 Graph-based Side-Information
We may use separate graph Laplacians to represent the side information on the “rows” and the “columns.” A given row (column) corresponds to a vertex in either the “row graph” (“column graph”). The weight of edge represents our prior belief that row (column) and row (column) share the same underlying factor. Thus in the ideal case, the rows that share factors have an edge between them and there are no other edges. Given factors, we then have a graph that consists of disjoint cliques. However, to meet the technical requirement that the side information matrix is positive definite, we need to connect the cliques in a minimal fashion. We achieve this by connecting the cliques like a “star” graph. Specifically, one clique is arbitrarily chosen as the center. From each of the other cliques, a vertex is chosen arbitrarily and connected to the central clique. Observe that a property of this construction is that there is a path of length between any pair of vertices. Now we can use the bound from Theorem 3,
to bound in the ideal case. We focus on the rows, as a parallel argument may be made for the side information on the columns. Consider the term , where is the PDLaplacian formed from a graph with Laplacian . Then using the observation from the preliminaries that , we have that . To evaluate this, we use the well-known equality of . Observing that each of the rows of
is a “one-hot” encoding of the corresponding factor, only the edges between classes then contribute to the sum of the norms, and thus by construction. We bound , using the fact that the graph diameter is a bound on (See (herbster2005online, , Theorem 4.2)). Combining terms and assuming similar idealized side information on the columns, we obtain . Observe then that since the comparator matrix is -biclustered, we have in the realizable case (with exact tuning), that the mistakes of algorithm are bounded by . This upper bound is tight up to logarithmic factors, as the VC-dimension of is lower-bounded by .
3.2 Online Community Membership Prediction
A special case of matrix completion is the case where there are objects which are assumed to lie in classes. In this case, the underlying matrix is given by if and are in the same class and otherwise. Thus this may be viewed as an online version of community detection or “similarity” prediction. Observe that this is an example of a -biclustered matrix where and there exists such that . Since margin complexity is block-invariant, we have that . In the “worst-case”, . However in the case of “similarity prediction”, we have . This follows since we have a decomposition by with and , thus giving . This example also indicates the gap between rank and margin complexity as the rank of is (in CMSM07 this gap between the margin complexity and rank was previously observed). Therefore, if the side-information matrix is the same PDLaplacian on both the rows and columns, we obtain a mistake bound of for this problem, which recovers the bound of (gentile2013online, , Proposition 4). This work generalises the results in gentile2013online for similarity prediction, since we may now use general p.d. matrices in an inductive setting (see the following section) as well as refining the mistake bounds to regret bounds.
4 Inductive Matrix Completion
In the previous section, the learner was assumed to have complete foreknowledge of the side information through the matrices and . In the inductive setting, the learner has instead kernel side information functions and . With complete foreknowledge of the rows (columns) that will be observed, one may use () to compute () which correspond to an inverse of a submatrix of (). In the inductive, unlike the transductive setting, we do not have this foreknowledge and thus cannot compute () in advance. Notice that the assumption of side information as kernel functions is not particularly limiting, as for instance the side information could be provided by vectors in and the kernel could be the positive definite linear kernel . On the other hand, despite the additional flexibility of the inductive setting versus the transductive one, there are two limitations. First, only in a technical sense will it be possible to model side information via a PDLaplacian, since can only be computed given knowledge of the graph in advance. Second, the bound of Theorem 3 on quasi-dimension gains additional multiplicative factors and . Nevertheless, we will observe in Section 4.1 that, for a given kernel for which the side information associated with a given row (column) cluster is “well-separated” from other clusters, we can show that .
The following algorithm is prediction-equivalent to Algorithm 1 up to the value of . In WKZ12 , the authors provide very general conditions for the “kernelization” of algorithms with an emphasis on “matrix” algorithms. They sketch a method to kernelize the Matrix Exponentiated Gradient algorithm based on the relationship between the eigensystems of the kernel matrix and the Gram matrix. We take a different (direct) approach in which we prove correctness via Proposition 4.
The intuition behind the algorithm is that, although we cannot efficiently embed the row and column kernel functions and as matrices since they are potentially infinite-dimensional, we may instead work with the embedding corresponding to the currently observed rows and columns, recompute the embedding on a per-trial basis and then “replay” all re-embedded past examples to create the current hypothesis matrix.
The computational complexity of the inductive algorithm exceeds that of the transductive algorithm. For our analysis, assume . On every trial (with an update), Algorithm 1 requires the computation of the SVD of an matrix and thus requires time. On the other hand, for every trial (with an update) in Algorithm 2, the complexity is instead dominated by the sum of up to (i.e., in the regret setting we can collapse terms from multiple observations of the same matrix entry) matrices of size up to and thus has per-trial complexity . The following is our proposition of equivalency, proven in Appendix C.
The inductive and transductive algorithms are equivalent up to and . Without loss of generality assume and . Define and . Assume that for the transductive algorithm, the matrices and are given whereas for the inductive algorithm, only the kernel functions and are provided. Then, if and , and if the algorithms receive the same label and index sequences then the predictions of the algorithms are the same.
Thus, the only case when the algorithms are different is when or . This is a minor inequivalency, as the only resultant difference is in the term . Alternatively, if one uses a normalized kernel such as the Gaussian, then . In the following subsection, we describe a simple scenario where the quasi-dimension scales quadratically with the number of distinct factors.
4.1 Bounding in the inductive setting
In the following, we show that for online side information in that is well-separated into boxes, there exists a kernel for which the quasi-dimension grows no more than quadratically with the number of factors. For simplicity we use the min
kernel, which approximates functions by linear interpolation. In practice, similar results may be proven for other universal kernels, but the analysis with the min kernel has the advantage of simplicity.
Define the min kernel as . Define . A box in is a set defined by a pair of vectors .
Given boxes such that there exists a and , if and then .
Recall the bound on the quasi-dimension . If we assume that the side information on the rows (columns) lies in , then for the min kernel. Observe that the term . Thus by applying the proposition to each row of we have that . We observe then that, for an optimal tuning and well-separated side information on the rows and columns, the mistake bound for a -biclustered matrix in the inductive setting is of . However, our best lower bound in terms of and is just , as in the transductive setting, where the lower bound follows from the VC-dimension of . An open problem is to resolve this gap.
To illustrate the algorithm’s performance, preliminary experiments were performed in the transductive setting with graph side information. In particular, we took to be randomly generated square (9,9)-biclustered matrices with i.i.d. noise. A visualization of a noise-free example matrix can be found in Figure 1
. The noise process flipped the label of each matrix entry independently with probability. The side information on the rows and columns were represented by PDLaplacian matrices, for which the underlying graphs were constructed in the manner described in Section 3.1. Varying levels of side information noise were applied. This was introduced by considering every pair of vertices independently from the constructed graph and flipping the state between edge/not-edge with probability .
The parameters were chosen as follows. For the initialization of , was chosen conservatively to be . The learning rate was tuned through a grid search for a matrix of dimensions of , which resulted in the choice of . Each run of the algorithm consisted of predicting all matrix entries sampled uniformly random without replacement.
The per trial mistake rate is shown in Fig. 2 for matrix dimension , where each data point is averaged over 10 runs. We observe that with random side information, , that although the term could lead to a bound which is vacuous, the algorithm’s error rate was in the range of , being well below chance. With ideal side information the performance dramatically improved, as predicted by the bounds to an error rate in . The data used to generate the figure, as well as the link to the code and the data can be found in Appendix D.
-  J. Bennett and S. Lanning. The netflix prize. In Proceedings of the KDD Cup Workshop 2007, pages 3–6, New York, August 2007. ACM.
-  J. A. Hartigan. Direct Clustering of a Data Matrix. Journal of the American Statistical Association, 67(337):123–129, 1972.
C. Gao, Y. Lu, Z. Ma, and H. H. Zhou.
Optimal estimation and completion of matrices with biclustering structures.Journal of Machine Learning Research, 17:161:1–161:29, 2016.
-  E. Candès and B. Recht. Exact matrix completion via convex optimization. Commun. ACM, 55(6):111–119, June 2012.
-  S. Ben-David, N. Eiron, and H. U. Simon. Limitations of learning via embeddings in euclidean half spaces. Journal of Machine Learning Research, 3:441–461, 2003.
-  N. Linial, S. Mendelson, G. Schechtman, and A. Shraibman. Complexity measures of sign matrices. Combinatorica, 27(4):439–463, 2007.
-  N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In Proceedings of the 18th Annual Conference on Learning Theory, pages 545–560, 2005.
-  N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. Advances in Neural Information Processing Systems 17, 2005.
-  E. J. Candès and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inf. Theor., 56(5):2053–2080, May 2010.
-  A. Maurer and M. Pontil. Excess risk bounds for multitask learning with trace norm regularization. In Proceedings of The 27th Conference on Learning Theory, pages 55–76, 2013.
-  Kai Yang Chiang, Inderjit S. Dhillon, and Cho Jui Hsieh. Using side information to reliably learn low-rank matrices from missing and corrupted observations. Journal of Machine Learning Research, 19, 2018.
M. Fazel, H. Hindi, and S. P. Boyd.
A rank minimization heuristic with application to minimum orders system approximation.Proceedings of the American Control Conference, 2001.
-  J. Abernethy, F. Bach, T. Evgeniou, and J. Vert. Low-rank matrix factorization with attributes. In ArXiv preprint ArXiv: cs/0611124, 2006.
-  M Xu, R Jin, and Z. H. Zhou. Speedup matrix completion with side information: Application to multi-label learning. In Advances in Neural Information Processing Systems, 2013.
-  V. Kalofolias, X. Bresson, M. Bronstein, and P. Vandergheynst. Matrix completion on graphs. Technical report, EPFL, 2014.
-  N. Rao, P. Yu, H.-F.; Ravikumar, and I. Dhillon. Collaborative Filtering with Graph Information: Consistency and Scalable Methods. In Advances in Neural Information Processing Systems, 2015.
-  X. Zhang, S. S. Du, and Q. Gu. Fast and Sample Efficient Inductive Matrix Completion via Multi-Phase Procrustes Flow. In Proceedings of Machine Learning Research, 2018.
-  S. A. Goldman, R. L. Rivest, and R. E. Schapire. Learning binary relations and total orders. SIAM J. Comput., 22(5), 1993.
S. A. Goldman and M. K. Warmuth.
Learning binary relations using weighted majority voting.
Proceedings of the 6th Annual Conference on Computational Learning Theory, pages 453–462, 1993.
-  N. Cesa-Bianchi and O. Shamir. Efficient online learning via randomized rounding. In Advances in Neural Information Processing Systems 24, pages 343–351, 2011.
-  E. Hazan, S. Kale, and S. Shalev-Shwartz. Near-optimal algorithms for online matrix prediction. In Proceedings of the 23rd Annual Conference on Learning Theory, volume 23:38.1-38.13, 2012.
-  C. Gentile, M. Herbster, and S. Pasteris. Online similarity prediction of networked data from known and unknown graphs. In Proceedings of the 26th Annual Conference on Learning Theory, 2013.
-  M. Herbster, S. Pasteris, and S. Pontil. Predicting a switching sequence of graph labelings. Journal of Machine Learning Research, 16:2003–2022, 2015.
-  M. Herbster, S. Pasteris, and M. Pontil. Mistake bounds for binary matrix completion. In Advances in Neural Information Processing Systems 29, pages 3954–3962. 2016.
-  K. Tsuda, G. Rätsch, and M.K. Warmuth. Matrix exponentiated gradient updates for on-line learning and bregman projection. Journal of Machine Learning Research, 6:995–1018, 2005.
-  N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285–318, April 1988.
-  S. Sabato, S. Shalev-Shwartz, N. Srebro, Daniel J. Hsu, and T. Zhang. Learning sparse low-threshold linear classifiers. Journal of Machine Learning Research, 16:1275–1304, 2015.
-  M. Herbster and M. Pontil. Prediction on a graph with a perceptron. In Advances in Neural Information Processing Systems 19, pages 577–584, 2006.
-  A.B. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, pages 615–622, 1962.
-  M. Herbster, M. Pontil, and L. Wainer. Online learning over graphs. In Proceedings of the 22nd International Conference on Machine Learning, pages 305–312, 2005.
-  M. K. Warmuth, W. Kotłowski, and S. Zhou. Kernelization of matrix updates, when and how? In Algorithmic Learning Theory, pages 350–364, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
-  R. Bhatia. Matrix Analysis. Springer Verlag, New York, 1997.
-  S. Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2011.
-  M. Nielsen and I. L. Chuang. Quantum Computation and Quantum Information: 10th Anniversary Edition. Cambridge University Press, 2010.
-  M.K. Warmuth. Winnowing subspaces. In Proceedings of the 24th International Conference on Machine Learning, pages 999–1006, 2007.
Appendix A Proof of Theorem 1
The proof of Theorem 1 is organized as follows. We start with the required preliminaries in Subsection A.1, and then proceed to prove the regret statement of the theorem, given by Equation (4), in Subsection A.2. Finally, in Subsection A.3, we provide a proof for the mistake bound in the realizable case, as stated in Equation (6). Note that the cumulative hinge loss is an upper bound to the mistakes in the conservative case, so that the analysis for the regret bound can be further extended to give rise to a mistake bound, but we choose to perform a separate analysis instead to obtain improved constants.
a.1 Preliminaries for Proof
Suppose we have , , and as in Theorem 1. We define:
Define the matrix as
and construct as,
For all trials , all eigenvalues of
, all eigenvalues ofare in .
Recall from (3) that
and then bounding the first term on the right hand side gives,
The argument for the second term is parallel. Therefore since it is shown that the trace of is bounded by 1 and that is positive definite, this implies that all eigenvalues of are in . ∎
For all trials ,
where is as constructed from Definition 6.