A common problem in machine learning and regression analysis is to predict the value of an as of yet unobserved output variable of interest as a function of observed input variables called features. The functional dependence between the input and output variables has to be learned from a set of samples, or items, each of whose inputs and outputs are known. When is small in relation to , this task may be affected by overfitting [HTF13], but in applications it is typically observed that the data is inherently well approximated by a lower dimensional representation, and that reducing the dimensionality dramatically reduces the overfitting. A classical technique to achieve dimensionality reduction is linear factor analysis [Jol10, GV89, Mul09]: Given a data matrix whose rows correspond to items and columns to features, compute and such that the Frobenius norm of the approximation error is minimal for some fixed rank . The rank- approximation describes the data matrix using only implicit features: the rows of specify how the observed variables relate to the implicit features, while the rows of show how the observed variables of each item can be (approximately) expressed as a linear combination of the implicit features.
Many practical data sets contain a mixture of different data types. In this paper we concentrate on categorical variables. For example, in the data set of congressional votes discussed in the numerical section of this paper, items correspond to members of congress, and features to votes on different bills. The voting behavior of each member can be represented by two Boolean variables per bill, a first variable taking the value if the member voted “yes”, and a second variable that takes the value if they voted “no”. Note that in case of abstentions, both variables take the value , indicating an absence of both categorical features. Such an expansion of categorical variables into Boolean variables proportional to the number of different categories is both typical and necessary because of an asymmetry of treating s and s under Boolean arithmetic.
For Boolean data matrices it is natural to require that the factor matrices and are Boolean as well. This requirement introduces an intrinsic difficulty into the problem because real arithmetic is replaced by arithmetic over the Boolean semiring in which holds. Boolean matrix multiplication is defined as for some Boolean matrices
. Note that for Boolean vectors, it holds that . An optimal rank- Boolean matrix approximation for and is then given by and for which is minimal. The Boolean rank of is defined as the smallest for which the approximation error is zero [Kim82]. Note that the Boolean rank of a Boolean matrix may differ from its linear algebraic rank. In the following example, inspired by [Mie13], let be a data matrix of Boolean variables that indicate if worker has access to room . The Boolean factorization is of exact Boolean rank and reveals that there are two different roles, one requiring access to rooms and , and the other requiring access to rooms and , and that worker serves in both roles, whereas workers and serve only in one. In contrast, treating as a real matrix renders it of linear algebraic rank , and the best rank- approximation fails to reveal a clear interpretation.
Interpreting as the node-node incidence matrix of a bipartite graph , the problem of finding the Boolean rank of has an interpretation as a minimum edge covering of by bi-cliques, which is a well known NP-complete problem [Orl77]. Correspondingly finding the best Boolean rank- approximation of has an interpretation of minimizing the number of errors in approximate coverings of by bi-cliques.
approximation is a problem that is generally solved via heuristics that may[FHMP07, FHP16] or may not yield a Boolean factorization [dL06, SSU03, TT06, UHZB16]. The method developed in [BV07, BV10] can be used to find exact Boolean rank- decompositions, but not approximations. The method of [GGM04] produces Boolean rank- approximations but treat the errors in and asymmetrically. [MMG06] presents a powerful heuristic for the correct error term and returns genuinely Boolean factors. Building on methods of [VAW06], the authors of [LVA08] presented an integer programming model that is closely related to low-rank Boolean matrix approximation but relies on mining an initial set of patterns from which the approximating factors are composed. They also provide an integer programming model with an exponential number of variables and constraints for low-rank Boolean matrix approximation, but do not detail its solution. Our paper further contributes to this discussion by introducing an integer programming model that relies on only a polynomial number of variables and constraints that can be solved by CPLEX [CPL17] for problem sizes that are realistic in applications.
2 Problem Formulation
For a given Boolean matrix and integer parameter , we next describe how to construct two Boolean matrices and so as to minimize the approximation error , where and denote the -th row of and -th column of , and , and . In the IP formulation below we denote the McCormick envelope [Mcc76] of by . Note that when , then only contains the point , allowing us to express the nonlinear relationship in terms of linear constraints only. Optimal factors may now be computed by solving the following binary integer program,
where variables and denote the coefficients of and respectively, variables the coefficients of , the elements of , and the product of the variables and . Constraints (4) ensure that all variables take values in any feasible solution. Constraints (1) imply that for all and due to the objective function, it is easy to see that in any optimal solution, as desired. Furthermore, any integral solution satisfies for all due to constraints (3) and therefore constraints (2) imply that , as desired.
2.1 Improved Formulation
Note that (BP) uses constraints and variables. To the best of our knowledge, previous IP models for the Boolean rank- approximation problem from the literature all required an exponential number of constraints [LVA08]. We remark that the second set of constraints (2), , may be summed and replaced by constraints without changing the feasible set. Even though this formulation has fewer constraints, we found it to be less effective in computations because it caused greater branching in the code. However, the formulation (BP) can be significantly improved in other ways to make more efficacious use of the computational power of commercial solvers such as CPLEX [CPL17]. We next describe these ideas.
2.1.1 Relaxation of Integrality Constraints
First note that, due to the nature of McCormick envelopes, the variables are guaranteed to take integer values when . Consequently, the integrality constraint on may be relaxed. Further note that if all variables are integral, then for any fixed , (2) either implies that (if for all ) or that (if at least one for some ). Therefore, -variables may also be treated as continuous. Finally, since holds at all optimal solutions, the integrality of the variables can also be relaxed. Consequently, only the and variables need to be declared integral which leads to a mixed integer programming formulation with binary variables only, which is obtained by replacing constraints (4) by
2.1.2 Deleting Redundant Constraints
Note that for any and the variable takes a binary value in any feasible solution, and since the input data parameter is binary as well, one of the two constraints (1) is redundant. Specifically, we may replace (1) by
Using the same reasoning, one finds that half the constraints that define are redundant, so that (3) may be replaced by
Consequently, approximately half of the constraints in the original formulation can be deleted.
2.1.3 Preprocessing the Input Data
In practice, the matrix of input data may contain rows (or columns) of all zeros. Deleting these rows (or columns) leads to an equivalent problem whose solution and can easily be translated to a solution for the original problem by inserting a row of zeros to (respectively a column of zeros to ) in the corresponding place. In addition, may contain duplicate rows (or columns). In this case it is sufficient to keep only one copy of each, solve the problem on the reduced data, and to reinsert the relevant copies of rows of (respectively columns of ) in the optimal factors and found for the reduced data. In order for this process to yield the correct result, the objective function of the reduced problem must correspond to the approximation error of the original problem. Therefore, if row of is repeated times, and column is repeated times, then the variable that corresponds to the remaining copy of these rows and columns must be multiplied by in the objective function of the reduced problem.
3 Computational Experiments
In this section we report numerical tests of the model (BP) on artificial and real-world datasets. The model was solved by CPLEX [CPL17] in each instance on a 2015 MacBook Pro with a 3.1 GHz Intel Core i7 processor and 16 GB of memory. The experiments with artificial data reveal that the improved formulation leads to significant speed-up.
3.1 Artificial Datasets
Artificial datasets were generated by sampling matrices and with i.i.d. random coefficients for fixed parameters . The data matrix was computed by forming the Boolean product , which is a matrix with exact Boolean rank , and by randomly flipping of the coefficients. The sparsity of was controlled so that
with probability. The results of Figure 2 show how noise in the data affects the complexity of the problem. Each experiment was repeated times, and the plot shows averaged running times and error bars. We observe that the complexity of the problem grows rapidly with the level of noise, an effect that occurs partly, but not solely, because the preprocessing step is less powerful at reducing the dimension of noisy matrices. This is confirmed by Figure 1, which shows the effect of the improved formulation and the preprocessing steps. Each randomly generated input was used in the full model, the improved model, and the improved model with a preprocessing step, and the running times were averaged over repeats of the experiment.
3.2 Real-World Datasets
The SPECT Heart dataset [CK] describes cardiac Single Proton Emission Computed Tomography images of patients by binary feature patterns, providing us a Boolean data matrix of dimension . The Primary Tumor dataset [KC88] contains observations of categorical variables for patients. of the variables were non-Boolean and were converted to Boolean variables, resulting in an all Boolean representation of the data matrix in dimension . The 1984 United States Congressional Voting Records dataset [Sch87] includes votes for each of the U.S. House of Representatives Congressmen on the key votes identified by the CQA. The categorical variables taking values of “voted for”, “voted against” or “did not vote”, are converted into Boolean variables. The resulting Boolean data matrix is of dimension . Rank- approximations were computed for all three datasets with separately. Each run was limited to a budget of 2h, after which the incumbent solution was used as an approximation and the error reported in Figure 3. We observe that even a suboptimal rank- approximation correctly reconstructs at least of the data in each case. If the problems were solved to optimality, the approximation error would decrease in , but since the figure shows the approximation errors of suboptimal solutions the curves can be non-monotonic.
To the best of our knowledge, the MIP formulation of the optimal low-rank Boolean matrix approximation problem discussed in this paper is the first model that relies on only polynomially many variables and constraints and constitutes the first exact method that is viable for realistic problem sizes. Our preliminary computational experiments suggest that our technique is applicable to real world data sets that were hitherto only approachable via heuristics [MMG06].
- [BV07] Radim Belohlávek and Vilém Vychodil. Formal concepts as optimal factors in boolean factor analysis: Implications and experiments. In CLA, 2007.
- [BV10] Radim Belohlavek and Vilem Vychodil. Discovery of optimal factors in binary data via a novel method of matrix decomposition. Journal of Computer and System Sciences, 76(1):3 – 20, 2010.
- [CK] Krzysztof J. Cios and Lukasz A. Kurgan. Uci machine learning repository: Spect heart data.
- [CPL17] CPLEX Optimization, Inc., Incline Village, NV. Using the CPLEX Callable Library, V.12.6, 2017.
- [dL06] Jan de Leeuw. Principal component analysis of binary data by iterated singular value decomposition. Computational Statistics & Data Analysis, 50(1):21 – 39, 2006. 2nd Special issue on Matrix Computations and Statistics.
A. A. Frolov, D. Husek, I. P. Muraviev, and P. Y. Polyakov.
Boolean factor analysis by attractor neural network.IEEE Transactions on Neural Networks, 18(3):698–707, May 2007.
- [FHP16] A. A. Frolov, D. Húsek, and P. Y. Polyakov. Comparison of seven methods for boolean factor analysis and their evaluation by information gain. IEEE Transactions on Neural Networks and Learning Systems, 27(3):538–550, March 2016.
- [GGM04] Floris Geerts, Bart Goethals, and Taneli Mielikäinen. Tiling Databases, pages 278–289. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004.
- [GV89] G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins U. Press, Baltimore, 1989.
- [HTF13] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer New York, 2013.
- [Jol10] I.T. Jolliffe. Principal Component Analysis. Springer Series in Statistics. Springer, 2010.
- [KC88] Igor Kononenko and Bojan Cestnik. Uci mach. learn. rep.: Primary tumor domain, 1988.
- [Kim82] K.H. Kim. Boolean matrix theory and applications. Monographs and textbooks in pure and applied mathematics. Dekker, 1982.
- [LVA08] Haibing Lu, Jaideep Vaidya, and Vijayalakshmi Atluri. Optimal boolean matrix decomposition: Application to role engineering. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE ’08, pages 297–306, Washington, DC, USA, 2008. IEEE Computer Society.
- [Mcc76] Garth P. Mccormick. Computability of global solutions to factorable nonconvex programs: Part i – convex underestimating problems. Math. Program., 10(1):147–175, December 1976.
- [Mie13] Pauli Miettinen. Boolean matrix factorization in data mining and elsewhere, 2013.
- [MMG06] Pauli Miettinen, Taneli Mielikäinen, Aristides Gionis, Gautam Das, and Heikki Mannila. The discrete basis problem. In Proc. of 10th Eur. Conf. on Principles and Practice of Knowledge Discovery in Databases, ECMLPKDD’06, pages 335–346, Berlin, 2006. Springer-Verlag.
- [Mul09] Stanley A Mulaik. Foundations of Factor Analysis, Second Edition. Chapman and Hall/CRC Statistics in the Social and Behavioral Sciences. Chapman and Hall/CRC, 2 edition, 2009.
- [Orl77] James Orlin. Contentment in graph theory: covering graphs with cliques. In Indagationes Mathematicae (Proceedings), volume 80, pages 406–424. North-Holland, 1977.
- [Sch87] Jeff Schlimmer. Uci machine learning repository: 1984 US Cong. Voting Records Database, 1987.
- [SSU03] Andrew I. Schein, Lawrence K. Saul, and Lyle H. Ungar. A generalized linear model for principal component analysis of binary data. page 546431, 2003.
- [TT06] F. Tang and H. Tao. Binary principal component analysis. In BMVC 2006 - Proceedings of the British Machine Vision Conference 2006, pages 377–386, 2006.
- [UHZB16] M. Udell, C. Horn, R. Zadeh, and S. Boyd. Generalized low rank models. Foundations and Trends in Machine Learning, 9(1):1–118, 2016.
- [VAW06] Jaideep Vaidya, Vijayalakshmi Atluri, and Janice Warner. Roleminer: Mining roles using subset enumeration. In Proceedings of the 13th ACM Conference on Computer and Communications Security, CCS ’06, pages 144–153, New York, NY, USA, 2006. ACM.