1.1 Matrix completion
Matrix completion is an area of great mathematical interest and has numerous applications including recommender systems for e-commerce, bioactivity prediction and models of online content, such as the famous Netflix problem.
The recommender problem in the e-commerce setting is the following: given a database where rows are users and column entries indicate user preferences for certain products, fill in the entries of the database so as to be able to recommend new products based on the preferences of other users. Typically these matrices are highly incomplete, since most users will only have experienced a small fraction of all available products aggarwal2016recommender . Similarly, in bioactivity prediction, compound-protein interaction (CPI) databases record bioactivity between potential drug compounds (rows, or users) and target proteins (columns, or products). Obtaining experimental evidence of all possible interactions is prohibitively expensive however, and therefore there are few known entries relative to the overall size of the database, see for example liu2015improving .
Low rank approaches to matrix completion have been the focus of a great deal of theoretical and algorithmic exploration ever since the seminal work in fazel2002matrix ; candes2009exact . In many cases, the problem is formulated as follows. Given a database , with observed entries in , find a matrix with minimal rank that matches the observed entries up to a given tolerance, i.e. which solves
where is some small integer and is the projection to the space of known entries, such that the error is evaluated only for the .
A variety of algorithms have been proposed to solve this non-convex problem, see tanner2016low and references therein. Another popular approach is to solve a convex relaxation involving the nuclear norm candes2009exact . Such algorithms have been applied successfully in a wide range of applications, including recommender systems koren2009matrix .
1.2 Binary matrix completion
When applied to a binary matrix, the output of the matrix completion algorithms described above cannot be guaranteed to be binary, and a typical approach for matrix completion with binary data is to threshold the output from an algorithm for completion with real data vinayak2014graph ; ames2011nuclear . As highlighted in xu2014jointly , where databases follow model assumptions and recovery from solving is exact, factorisation reduces to assigning identical rows to the same cluster. However, this approach is not robust to the violation of assumptions and breaks down when recovery is not exact; this can occur for example when sampling is non-random. This motivates the search for algorithms which explicitly seek binary solutions.
The problem (1) can also be formulated as one of approximating as the product of factors, namely
We are interested in this paper in approximating our database with binary factors, due to the greater interpretability of the output. To appreciate this, note that a binary factorisation
decomposes a binary matrix into biclusters of rows and columns, often referred to in the itemset mining community as tiles, see for example geerts2004tiling . This decomposition provides an explicit characterization of the row/column clusters that best explain the database. Low rank decompositions designed for real matrices, on the other hand, tend to be SVD-like and orthogonal, with negative entries, and it is less clear how to interpret these. Low rank matrix completion with nonnegativity constraints, for example in xu2012alternating , enforces non-negativity of factors, but does not address the issue of rounding errors induced by rounding non-integer values. Matrix completion algorithms were proposed in xu2015cur ; wang2017provably which obtain decompositions in which one factor is composed of rows of the matrix, and is by consequence binary, although the other factor is generally non-integer.
However, in the case where both factors are required to binary, research to date has largely focused on the case of binary matrix factorisation (BMF) in which the matrix is fully observed. BMF has found applications in itemset mining of transactional and textual data geerts2004tiling , and also in the analysis of gene expression data zhang2010binary . In these applications, entries are in general fully observed, though possibly with noise.
Various algorithms for BMF have been proposed. The problem can be formulated as an integer program, but the use of an integer programming solver is only tractable for very small problem sizes. An approach combining convex quadratic programming relaxations and binary thresholding was proposed in zhang2010binary . For the special case of rank-one approximation, linear programming relaxations were proposed in shen2009mining ; lu2011weighted ; zhang2010binary , the first two of which also both proved that the linear programming solution yields a -approximation to the optimal objective. An approach to binary factorisation based on -means clustering was considered in jiang2014clustering
. A local search heuristic capable of improving solutions obtained by other methods was proposed inmirisaee2015improved . Of particular relevance to this paper is the PROXIMUS algorithm, proposed in koyuturk2006nonorthogonal , which identifies patterns within the data using recursive partitioning based on rank-one approximations.
Closely related to BMF is the Boolean matrix factorisation problem, in which is replaced by the OR operator, , which allows the tiles to overlap. A number of algorithms also exist for Boolean factorisation, including the iterative ASSO algorithm miettinen2008discrete and integer programming kovacs2018lowrank .
1.3 Our contribution and related work
The novelty of this paper is the use of a partitioning approach to solve the problem of BMF with missing data, which can be formulated as follows.
The contributions of this paper are as follows.
We show that our algorithm outperforms alternatives based on related heuristics and techniques for non-negative matrix completion and binary matrix completion, when tested on synthetic and real life data (Section 4).
The most closely related work we are aware of is the Spectral method proposed in xu2014jointly for bi-clustering databases with missing data. The authors use low rank completion to cluster neighbour rows and then redefine the column clusters based on cluster membership, in a similar fashion to -means. We show that our algorithm outperforms the Spectral method when solving Problem (3) for real world datasets.
The authors of yadava2012boolean extend the ASSO algorithm for Boolean matrix factorisation to deal with missing data. However, it is worth pointing out that their setup, as well as solving a different problem, is primarily intended for only small amounts of missing data. Methods that involve linear programs can be straightforwardly extended to the missing data case, by evaluating the objective and enforcing constraints only for , however we are not aware of any attempts to analyse their predictive power for the recommender problem.
1.4 Relation to graph-theoretic problems
Viewing the database as the adjacency matrix of a graph, we can view (3) as a clustering problem; in particular the problem is that of approximating the edge set of a partially observed graph as a union of bicliques. The factors and can be interpreted as indicating the rows and columns of these bicliques. The authors of vinayak2014graph solve the related problem for the diagonal block model, expressing the database as a sum of a low rank matrix plus a sparse component to account for noise. Our approach differs as we allow column overlap of clusters. Bi-clustering approaches, in particular for clustering gene expression data, have been used to solve formulations similar to the BMF problem, with different assumptions about the underlying data, equivalent to considering different constraints on and . However these have focused primarily on cluster recovery; our focus is on exploring the predictive power of different algorithms for solving Equation 3.
2 The TBMC algorithm
We present an algorithm for low rank binary matrix completion inspired by the partitioning approach of koyuturk2006nonorthogonal that generates a binary factorisation based on recursive rank-one partitions.
Our algorithm is inspired by the recursive partitioning approach of the PROXIMUS algorithm for binary matrix factorisation koyuturk2006nonorthogonal . For a given submatrix, , the algorithm first calculates a binary rank-one approximation . The matrix is then partitioned based upon the rows included in the tile : if the entry of is positive, then the row is included in , else it is included in . Both submatrices are added to the set of submatrices to be partitioned.
The vectoris used as the pattern vector for . If the normalised Hamming distance from to any of the rows in is less than some tolerance , then the tile is included in the decomposition and is not partitioned any further. In addition, if the rank-one approximation returns a row vector of s, the tile is added to the decomposition. This process is repeated on successive submatrices until convergence. The algorithm terminates when no partitions remain. An illustration of the splitting process can be seen in Figure 1.
2.2 Rank-one approximation with missing data
The partitioning method outlined above relies on rank-one approximations of the database. We outline two methods for this sub-problem.
2.2.1 Approximate rank-one using a linear program
Inspired by the work of shen2009mining , we propose a linear programming relaxation of the rank-one approximation problem with missing data.
The rank-one problem can be formulated as an integer linear program for the case where no data is missing mirisaee2015improved ; kovacs2018lowrank . The authors of shen2009mining show that a linear program relaxation of a related problem has no more than twice the error of the integer solution. In Section 3 we show that a similar result holds for the missing data case.
We can formulate the best binary rank-one approximation problem as a linear program as follows:
Since is fixed, (4) is equivalent to solving
We translate Problem (5
) to a linear program as follows: introducing dummy variableswhich serve as indicator variables for we can rewrite (5) as
If we relax the integer constraints, then for optimal , the second constraint will be satisfied as an equality for , so we can drop the corresponding dummy variables and replace their value in the objective with . Thus we consider the following formulation for approximating the solution to Equation 6
The corresponding constraint matrix is totally unimodular (as it is a submatrix of the constraint matrix in shen2009mining ), and so solving will give integral values for . We can obtain the corresponding rank-one approximation as . Note that is only defined for . The remaining values can be calculated as but are no longer guaranteed to be integers.
2.2.2 Alternating minimisation
Alternating minimisation has been widely used for matrix completion, see for example jain2013low . Given an initialisation vector we want to find to minimise the approximation error
where we have used the fact that and are binary. Where there are no missing entries, we can use the fact that is fixed to argue that the minimal error will be obtained when the coefficient of is non-positive. Thus the update for is given by
Then can be calculated in a similar way. Typically the process converges in only a few iterations.
For the case when there are missing entries, we consider instead defined such that
then, the updated binary is given by
The authors of koyuturk2006nonorthogonal outline a number of heuristics for initialising the iterative scheme. In particular, if not initialised correctly, this method can lead to the empty tile, which is always sub-optimal. We propose using alternating minimisation as an optional post-processing step for the TBMC algorithm stated in Section 2.3.
2.3 Statement of algorithm
Starting with the full database, we calculate the rank-one approximation obtained by solving Problem (7) for the current partition, then partition into the rows that are included in the tile and those that are not, and repeat for both sides of the partition. As in koyuturk2006nonorthogonal , when the partitioning process generates rows that are all within a given radius of or when is the vector of all s or all s, then is added to the factorisation. The algorithm is stated in detail in Algorithm 1. We could also consider using the alternating minimisation outlined in Section 2.2.2 to improve the approximation found by solving (7). We refer to this approach as TBMC-AM.
3 Approximation bounds
In Section 2.2, we proposed a linear program method for solving the rank-one step of (1). By closely following the approach in shen2009mining for the case in which all the entries in are known, we show that the approximation error of the tile found by solving Problem (7) is no more than twice the optimal.
Denoting by the optimal solution for Problem (7) obtained using a simplex method, then the approximation error satisfies
The minimal error for a single tile approximation is given by
where the inequality is a result of the fact that the optimal solution for Problem (6) is bounded above by the objective for Problem (7) since relaxing constraints does not decrease the value at optimality.
For such that , we know that , and are integers. Since the objective of (7) is to minimise , which is bounded below by , we have that if and only if . Thus we can split the summation terms on the right hand side of the inequality to obtain
Since is binary, we have that
which we can use to replace the sum in (14) corresponding to values where exactly one of and is 1 to derive
Now since is equal to if is positive and if is zero we can rewrite (16) as
This gives us our bound. ∎
Note that in the case where the true best solution has zero error, the lp relaxation will also have zero error. This leads to the following corollary.
In this case , and so by Theorem 1 the LP relaxation will also have zero error. ∎
4 Numerical Results
4.1 Verifying the rank-one 2-approximation
We first perform experiments to verify numerically that the 2-approximation result given in (12) is not violated. We generate synthetic matrices and solve (7) using a simplex method from COIN-OR lougee2003common , implemented using the cylp package in python. In fact, the LP relaxation finds the optimal rank-one approximation in the majority of cases.
We compare the LP approach to other methods for rank-one approximation, including two heuristics similar to those used in koyuturk2006nonorthogonal and NMF with missing values, based on the multiplicative scheme of lee1999learning . In particular we consider
‘average’: setting where the column has more than half of its entries as positive and if any of the entries of the row have a positive entry in one of those columns ;
‘partition’: selecting a column at random to be and calculating as the average of all the rows that have positive entries in that column;
‘NMF’: a non-negative rank-one factorisation is obtained using a multiplicative update scheme with missing data and the factors are binarised using a threshold of . We use the approach outlined in zhang2010binary to normalize the factors such that the 0.5 threshold is appropriate.
‘Spectral’: We obtain a rank-one approximation by implementing the method outlined in xu2014jointly
, which follows a k-means approach. Briefly, the factors are initially generated using a projection onto the firstsingular values; viewing the factorisation as a clustering, the footprints in are updated using the clustering given by , before reassigning rows to the cluster whose footprint most closely matches their own.
For TBMC, average and partition we also consider using an alternating minimisation step, outlined in Section 2.2.2 to improve the rank-one approximations. We refer to these approaches as TBMC-AM, average-AM and partition-AM.
We evaluate performance relative to the optimal solution, which for small databases, , we obtain by solving (6) directly as an integer program. We generate synthetic matrices with a single planted tile, of size ratio with respect to the database size. We simulate noisy data by randomly flipping a fraction of entries and remove a proportion of entries. We set and and plot the ratio, , of squared approximation error to the optimal squared error, in Figure 2. We observe that while the other methods all violate the 2-approximation bound for certain randomly generated matrices, the LP does not.
4.2 Performance of TBMC for rank-k decomposition
We next demonstrate that our TBMC algorithm is a practical method for generating interpretable rules for inferring missing information in real data sets (recommender systems). Moreover, in two of the three data sets that we investigate, we find that TBMC outperforms other state of the art methods upon this task. We consider then the performance of rank- binary matrix factorisation on the following datasets.
ALL-AML (GE): gene expression data golub1999molecular , containing information from human tumour samples of different leukemia types. This dataset has been widely used for cancer classification, and for the evaluation of BMF techniques in the case of no missing data brunet2004metagenes . We normalize the data by column.
RC: customer reviews from customers for restaurants RC . Since ratings are in , we map to binary data by converting star ratings to one, and converting or star ratings to zero.
ML: MovieLens k MovieLens , which contains ratings from users across films. We threshold to consider only star ratings.
We compare the performance of TBMC and TBMC-AM against NMF and the Spectral method (see Section 4.1). The rank parameter was optimised for NMF, with being found to be a good choice. The maximum rank was set to be in the Spectral method, although the method was found to often terminate before five tiles were found. We also compare with a variant of the recursive partitioning approach of TBMC in which the LP-based rank-one approximation step is replaced by the average and partition heuristics (see Section 4.1), again with and without alternating minimisation.
We consider a / split between known training entries and unseen test entries (). Note that MovieLens and RC have a low density of known entries . In all cases we record the percentage approximation error . We give approximation errors upon both the test set, and also upon the training set. The first score tells us how well each method is able to predict missing entries. The second score tells us how well the method is able to generate a tiling model for the known data.
From Table 1, we see that TBMC outperforms all other algorithms as a predictive method on all three data sets. The alternating minimisation (AM) step leads to improved results in some cases, but not always. The improvement over other methods is significant in the case of the gene expression (GE) and restaurant customer (RC) data. On the other hand, in the case of the MovieLens data, the improvement gained by using TBMC over the NMF and Spectral methods is marginal. For MovieLens, even replacing the LP rank-one approximation in TBMC with the simpler averaging heuristic is seen to give similar performance.
Approximation error on the training set allows us to compare how well the database tilings of the various methods are able to capture the known data. We see from Table 1 that TBMC outperforms all other methods on this task for the restaurant customer and gene expression datasets. For MovieLens, replacing the LP rank-one approximation in TBMC with the averaging heuristic is seen to give the best performance.
We speculate that the varying results for different datasets are due in large part to the degree to which each database can be modelled as a low-rank binary factorisation (tiling). It should be emphasised that the prediction algorithms considered here are restricted to those which generate interpretative binary factorisations. We are not claiming that TBMC competes favourably with all prediction algorithms (such as neural networks) if the interpretability requirement is removed. That said, our results show clearly that TBMC performs well compared to other algorithms for low-rank matrix completion with binary factors.
We conclude by making two further observations. Firstly, the fact that averaging heuristics can perform well highlights the importance of balancing higher accuracy with constraints on computational time. Secondly, the algorithm shows evidence of over-fitting the RC data as the percentage error on the known values (training set) is %.
The RC database appears to be an especially good fit for approximation by binary factors (tiles). The clustering found by TBMC on the RC database is provided as an illustration in Figure 3. The rows and columns have been reordered for visual effect.
5 Conclusion and future directions
Binary matrix completion for recommender systems has numerous applications. Influenced by approaches to binary matrix factorisation in the itemset mining literature, we have proposed the TBMC algorithm for binary matrix completion and shown that it outperforms alternatives on both synthetic and real data sets for certain regimes. These results make the case for the consideration of heuristic methods where typical assumptions for low rank matrix completion are violated and exact recovery is not guaranteed.
We have given error guarantees for the rank-one sub-problem which is a main building block of TBMC. A more ambitious goal would be error guarantees for the full algorithm.
Our current error guarantees are worst-case in the sense that they hold for any matrix. Another interesting direction would be to prove tighter error bounds assuming some underlying model such as a planted tile model.
- (1) Charu C Aggarwal. Recommender systems. Springer, 2016.
- (2) Brendan PW Ames and Stephen A Vavasis. Nuclear norm minimization for the planted clique and biclique problems. Mathematical Programming, 129(1):69–89, 2011.
- (3) Jean-Philippe Brunet, Pablo Tamayo, Todd R Golub, and Jill P Mesirov. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academy of Sciences, 101(12):4164–4169, 2004.
- (4) Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717, 2009.
- (5) Maryam Fazel. Matrix rank minimization with applications. PhD thesis, PhD thesis, Stanford University, 2002.
- (6) Floris Geerts, Bart Goethals, and Taneli Mielikäinen. Tiling databases. In International Conference on Discovery Science, pages 278–289, 2004.
- (7) Todd R Golub, Donna K Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, Jill P Mesirov, Hilary Coller, Mignon L Loh, James R Downing, Mark A Caligiuri, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531–537, 1999.
Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi.
Low-rank matrix completion using alternating minimization.
Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 665–674. ACM, 2013.
- (9) Peng Jiang, Jiming Peng, Michael Heath, and Rui Yang. A clustering approach to constrained binary matrix factorization. In Data Mining and Knowledge Discovery for Big Data, pages 281–303. Springer, 2014.
- (10) Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, pages 30–37, 2009.
Réka Kovács, Oktay Günlük, and Raphael Hauser.
Low-rank Boolean matrix approximation by integer programming.
NIPS Workshop on Optimization for Machine Learning, Long Beach, CA, 2017.
- (12) Mehmet Koyutürk, Ananth Grama, and Naren Ramakrishnan. Nonorthogonal decomposition of binary matrices for bounded-error data compression and analysis. ACM Transactions on Mathematical Software, 32(1):33–69, 2006.
- (13) Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788, 1999.
- (14) Hui Liu, Jianjiang Sun, Jihong Guan, Jie Zheng, and Shuigeng Zhou. Improving compound–protein interaction prediction by building up highly credible negative samples. Bioinformatics, 31(12):i221–i229, 2015.
- (15) Robin Lougee-Heimer. The common optimization interface for operations research: Promoting open-source software in the operations research community. IBM Journal of Research and Development, 47(1):57–66, 2003.
- (16) Haibing Lu, Jaideep Vaidya, Vijayalakshmi Atluri, Heechang Shin, and Lili Jiang. Weighted rank-one binary matrix factorization. In Proceedings of the 2011 SIAM International Conference on Data Mining, pages 283–294. SIAM, 2011.
- (17) F. Maxwell Harper and J. Konstan. The MovieLens datasets: History and context. ACM transactions on interactive intelligent systems (TiiS).
- (18) Rafael Medellín, Juan González Sern, and Blanca Vargas-Govea. https://www.kaggle.com/uciml/restaurant-data-with-consumer-ratings. Accessed: 2018-11-30.
- (19) Pauli Miettinen, Taneli Mielikäinen, Aristides Gionis, Gautam Das, and Heikki Mannila. The discrete basis problem. IEEE Transactions on Knowledge and Data Engineering, 20(10):1348–1362, 2008.
- (20) Seyed Hamid Mirisaee, Eric Gaussier, and Alexandre Termier. Improved local search for binary matrix factorization. In AAAI, pages 1198–1204, 2015.
- (21) Bao-Hong Shen, Shuiwang Ji, and Jieping Ye. Mining discrete patterns via binary matrix factorization. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 757–766, 2009.
- (22) Jared Tanner and Ke Wei. Low rank matrix completion by alternating steepest descent methods. Applied and Computational Harmonic Analysis, 40(2):417–429, 2016.
- (23) Ramya Korlakai Vinayak, Samet Oymak, and Babak Hassibi. Graph clustering with missing data: Convex algorithms and analysis. In Advances in Neural Information Processing Systems, pages 2996–3004, 2014.
- (24) Yining Wang and Aarti Singh. Provably correct algorithms for matrix column subset selection with selectively sampled data. The Journal of Machine Learning Research, 18(1):5699–5740, 2017.
- (25) Jiaming Xu, Rui Wu, Kai Zhu, Bruce Hajek, R Srikant, and Lei Ying. Jointly clustering rows and columns of binary matrices: Algorithms and trade-offs. In ACM SIGMETRICS Performance Evaluation Review, volume 42, pages 29–41. ACM, 2014.
- (26) Miao Xu, Rong Jin, and Zhi-Hua Zhou. Cur algorithm for partially observed matrices. In International Conference on Machine Learning, pages 1412–1421, 2015.
- (27) Yangyang Xu, Wotao Yin, Zaiwen Wen, and Yin Zhang. An alternating direction algorithm for matrix completion with nonnegative factors. Frontiers of Mathematics in China, 7(2):365–384, 2012.
- (28) Prashant Yadava. Boolean matrix factorization with missing values. Master’s thesis, Universitat des Saarlandes, 2012.
- (29) Zhong-Yuan Zhang, Tao Li, Chris Ding, Xian-Wen Ren, and Xiang-Sun Zhang. Binary matrix factorization for analyzing gene expression data. Data Mining and Knowledge Discovery, 20(1):28, 2010.