Binary Matrix Factorisation and Completion via Integer Programming

06/25/2021 ∙ by Reka A. Kovacs, et al. ∙ University of Oxford cornell university 0

Binary matrix factorisation is an essential tool for identifying discrete patterns in binary data. In this paper we consider the rank-k binary matrix factorisation problem (k-BMF) under Boolean arithmetic: we are given an n x m binary matrix X with possibly missing entries and need to find two binary matrices A and B of dimension n x k and k x m respectively, which minimise the distance between X and the Boolean product of A and B in the squared Frobenius distance. We present a compact and two exponential size integer programs (IPs) for k-BMF and show that the compact IP has a weak LP relaxation, while the exponential size IPs have a stronger equivalent LP relaxation. We introduce a new objective function, which differs from the traditional squared Frobenius objective in attributing a weight to zero entries of the input matrix that is proportional to the number of times the zero is erroneously covered in a rank-k factorisation. For one of the exponential size IPs we describe a computational approach based on column generation. Experimental results on synthetic and real word datasets suggest that our integer programming approach is competitive against available methods for k-BMF and provides accurate low-error factorisations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction.

For a given binary matrix and a fixed positive integer , the rank- binary matrix factorisation problem (-BMF) is concerned with finding two matrices , such that the product of and is a binary matrix closest to in the squared Frobenius norm. One can define different variants of this problem depending on the underlying arithmetic used when computing the product of the matrices. In this paper we focus on solving -BMF under Boolean arithmetic where the product of the binary matrices and is computed by interpreting s as false and s as true, and using logical disjunction () in place of addition and logical conjunction () in place of multiplication. Observe that Boolean multiplication () coincides with standard multiplication on binary input, hence we adopt the notation in place of in the rest of the paper. We therefore compute the Boolean matrix product of and as:

Note that Boolean matrix multiplication can be equivalently written as using standard arithmetic summation. The problem then becomes computing matrices and whose Boolean product best approximates the input matrix .

Our motivation for this study comes from data science applications where rows of the matrix

correspond to data points and columns correspond to features. In these applications low-rank matrix approximation is an essential tool for dimensionality reduction which helps understand the data better by exposing hidden features. Many practical datasets contain categorical features which can be represented by a binary data matrix using unary encoding. For example, consider a data matrix below (inspired by [Miettinen:2008:DBP:1442800.1442809]), where rows correspond to patients and columns to symptoms, indicating patient presents symptom :

(1)

In this example matrix describes exactly using derived features where the rows of specify how the original features relate to the derived features, and the rows of give the derived features of each patient. In other words, factor matrix reveals that there are 2 underlying diseases that cause the observed symptoms: Disease is causing symptoms 1 and 2, and disease is causing symptoms 2 and 3. Matrix reveals that patient 1 has disease , patient 3 has and patient 2 has both.

We note that it is also possible to use classical methods such as singular value decomposition (SVD)

[SVD] or non-negative matrix factorisation (NMF) [NMF] to obtain low-rank approximations of but the resulting factor matrices or their product would typically not be binary unlike BMF [Miettinen:2006:PKDD]. To demonstrate this we next give the best rank- SVD and NMF approximations of the matrix in (1), respectively:

(2)

Note that neither of these rank-2 approximations provide a clear interpretation. The rank-2 NMF of suggests that symptom 2 presents with lower intensity in both and , an erroneous conclusion (caused by patient 2) that could not have been learned from data which is of “on/off” type.

We note that in addition to healthcare applications, BMF-derived features of data have also been shown to be interpretable in biclustering gene expression datasets [Zhang:2007], role based access control [Lu:2008:OBM:1546682.1547186, Lu:2014:9400730720140101] and market basket data clustering [Li:2005].

1.1 Complexity and related work.

The Boolean rank [Monson:1995, Kim:1982] of a binary matrix is defined to be the smallest integer for which there exist binary matrices and such that . In an equivalent definition, the Boolean rank of is the minimum value of for which it is possible to factor into a Boolean combination of rank- binary matrices

for . Occasionally, the Boolean rank is also referred to as the rectangle cover number, and rank- binary matrices are called rectangle matrices or simply rectangles [Conforti:2014:IP:2765770].

Interpreting as the node-node incidence matrix of a bipartite graph with vertices on the left and vertices on the right, the problem of computing the Boolean rank of is in one-to-one correspondence with finding a minimum edge covering of by complete bipartite subgraphs (bicliques)[Monson:1995]. Since the biclique cover problem is NP-hard [Orlin:1977, Theorem 8.1],[Garey:1979:CIG:578533, Problem GT18], and hard to approximate [Simon:1990OnAS, Chalermsook:2014], computing the Boolean rank is hard as well. Finding an optimal rank- binary factorisation of under Boolean arithmetic has a graphic interpretation of minimizing the number of errors in an approximate covering of by bicliques which are allowed to overlap. In the rank- case the Boolean arithmetic coincides with standard arithmetic and -BMF can be interpreted as computing a maximum weight biclique on the complete bipartite graph whose edges that are in have weight and others weight . The maximum edge biclique problem with edge weights in is NP-hard [Gillis:2018], hence even the computation of a rank- BMF is computationally challenging.

Due to the hardness results, the majority of methods developed for BMF rely on heuristics. The earliest heuristic for BMF, Proximus

[Koyuturk:2002:ATA:645806.670310, Koyuturk:2003:PFA:956750.956770], computes BMF under standard arithmetic using a recursive partitioning idea and computing -BMF at each step. Since Proximus, much research has focused on computing efficient and accurate methods for -BMF. [Shen:2009:MDP:1557019.1557103] proposes an integer program (IP) for -BMF and several relaxations of it, one of which leads to a -approximation, while [Shi:2014:6899417] provides a rounding based -approximation. In [Beckerleg:2020] an extension of the Proximus framework is explored which uses the formulations from [Shen:2009:MDP:1557019.1557103] to compute -BMF at each step. -BMF under Boolean arithmetic is explicitly introduced in [Miettinen:2006:PKDD, Miettinen:2008:DBP:1442800.1442809], along with a heuristic called ASSO, which is based on an association rule-mining approach. ASSO is further improved in [Barahona:2019] into an alternating iterative heuristics. Another approach based on an alternating style heuristic is explored in [Zhang:2007] to solve a non-linear unconstrained formulation of -BMF with penalty terms in the objective for non-binary entries.

In [Lu:2008:OBM:1546682.1547186, Lu:2014:9400730720140101] a series of integer programs for -BMF and exact BMF are introduced. These IPs have exponentially many variables and constraints and require an explicit enumeration of the

possible binary row vectors for factor matrix

. To tackle the exponential explosion of rows considered, a heuristic row generation using association rule mining and subset enumeration is developed. An exact linear IP for -BMF with polynomially many variables and constraints is presented in our previous work [Kovacs:2017]. This model uses McCormick envelopes [McCormick:1976] to linearize the quadratic terms coming from the matrix product. We note that both of these integer programs for -BMF, as well as any other element-wise models can be naturally applied in the context of rank- binary matrix completion by simply setting the objective coefficients corresponding to missing entries to .

1.2 Our contribution.

In this paper, we present a comprehensive study on integer programming methods for -BMF. We examine three integer programs in detail: our compact formulation introduced in [Kovacs:2017], the exponential formulation of [Lu:2008:OBM:1546682.1547186] and a new exponential formulation which we introduced in a preliminary version of this paper in [Kovacs:2020]. We prove several results about the strength of LP-relaxations of the three formulations and their relative comparison. In addition, we show that the new exponential formulation overcomes several limitations of earlier approaches. In particular, it does not suffer from permutation symmetry and it does not rely on heuristically guided pattern mining. Moreover, it has a stronger LP relaxation than that of [Kovacs:2017]. On the other hand, our new formulation has an exponential number of variables which we tackle using a column generation approach that effectively searches over this exponential space without explicit enumeration, unlike the complete enumeration used for the exponential size model of [Lu:2008:OBM:1546682.1547186]. In addition, we introduce a new objective function for -BMF under which the problem becomes computationally easier and we explore the relationship between this new objective function and the original squared Frobenius distance. Finally, we demonstrate that our proposed solution method is able to prove optimality for smaller datasets, while for larger datasets it provides solutions with better accuracy than the state-of-the-art heuristic methods. In addition, the entry-wise modelling of -BMF in our formulations naturally extends to handle matrices with missing entries and perform binary matrix completion, we illustrate this way of application experimentally.

The rest of this paper is organised as follows. In Section 2 we detail the three IP formulations for -BMF and prove several results about their LP-relaxations. In Section 3, we introduce a new objective function and explore its relation to the original squared Frobenius objective. In Section 4 we detail a framework based on the large scale optimisation technique of column generation for the solution of our exponential formulation and discuss heuristics for the arising pricing problems. Finally, in Section 5 we demonstrate the practical applicability of our approach on several artificial and real world datasets.

2 Formulations.

Given a binary matrix and a fixed positive integer we wish to find two binary matrices and so that is minimised, where is the product of and and denotes the Frobenius norm. Let denote the index set of nonzero entries of where . Both and being binary matrices, the squared Frobenius and the entry-wise norm coincide and we may expand the objective function to get a linear expression

(3)

For an incomplete binary matrix with missing entries, the above objective is slightly changed to where , to emphasise that , and the factorisation error is only measured over known entries. In the following sections we present three different integer programs for -BMF all with the above derived linear objective function.

2.1 Compact formulation.

We start with a formulation that uses a polynomial number of variables and constraints where we denote the McCormick envelope [McCormick:1976] of by

(4)

Note that if then only contains the point corresponding to the product of and

. The following Compact Integer linear Program (CIP) models the entries of matrices

directly via binary variables

, and respectively (for ) and uses McCormick envelopes to avoid the appearance of quadratic terms that would correspond to the constraints ,

(5)
s.t. (6)
(7)
(8)

Constraints (6) encode Boolean matrix multiplication, while a simple modification of the model in which constraints (6) are replaced by models -BMF under standard arithmetic. The McCormick envelopes in constraints (7) ensure that for , are binary variables taking the value . Due to the objective function, constraints (6) and the binary nature of , the binary constraints on variables may be relaxed to without altering optimal solutions of the formulation.

The LP relaxation of CIP (CLP) is obtained by replacing constraints (8) by . For , we have and the feasible region of CIP is the Boolean Quadric Polytope (BQP) over a bipartite graph [Padberg:1989:BQP:70486.70495]. The LP relaxation of BQP has half-integral vertices [Padberg:1989:BQP:70486.70495], which implies that CLP for has half-integral vertices as well. One can show that in this case, a simple rounding in which fractional values of CLP are rounded down to gives a -approximation to -BMF [Shi:2014:6899417]. This however, does not apply for . We next show that CLP for has an objective function value . Given a binary matrix , CLP has optimal objective value for . Moreover, for CLP has at least vertices with objective value .

Proof.

Proof. For each let such that and consider the point

For all and , setting implies that and holds for all , hence this point gives a feasible solution to CLP with objective value . For , we can only set for all , hence the above construction leads to a single unique point. For however, as the choice of ’s is arbitrary, there are many feasible points with objective value of this form. As each of these points can differ at only entries corresponding to entries for , , there are at most affinely independent points among them. Next we present affinely independent points of this form. Since the objective value is at these points, they must lie on a face of dimension at least and this face must have at least vertices of CLP with objective value . For each and , letting for all and provides different points of the above form. Each such point has exactly one entry along the indices which is zero. Hence the matrix whose columns correspond to these points has a square submatrix of the form corresponding to entries for , where is the all ones matrix of size and

is the identity matrix of size

. Since matrix is nonsingular, the points are linearly independent. In addition, letting for all gives an additional point for which for all , hence the corresponding part of this point is . Now subtracting from the columns of , we get the nonsingular matrix , hence the above constructed points are affinely independent. ∎

The above result suggests that unless the factorisation error is i.e. the input matrix is of Boolean rank less than or equal to , before improving the LP bound of CIP many fractional vertices need to be cut off. Furthermore, for , any feasible rank- factorisation and a permutation matrix provide another feasible solution with the same objective value. Hence, CIP is highly symmetric for . These properties of CIP make it unlikely to be solved to optimality for in a reasonable amount of time for a large matrix , though some symmetries may be broken by enforcing lexicographic ordering of rows of . For small matrices however, CIP constitutes the first approach to get optimal solutions to -BMF.

2.2 Exponential formulation I.

Any Boolean rank- matrix can be equivalently written as the Boolean combination of rank- binary matrices for some . This suggest to directly look for rank-1 binary matrices instead of introducing variables for all entries of factor matrices and . The second integer program we detail for -BMF relies on this approach by considering an implicit enumeration of rank- binary matrices. Let denote the set of all rank- binary matrices of dimension and let denote the subset of rank- matrices of which have the -th entry equal to ,

(9)
(10)

Introducing a binary variable for each rank- matrix in and variables corresponding to the known entries of the , we obtain the following Master Integer linear Program (MIP),

(11)
s.t. (12)
(13)
(14)
(15)

The objective, as before, measures the factorisation error in squared Frobenius norm, and subscript F in stands for Frobenius. Constraints (12) and (13) enforce Boolean matrix multiplication: takes value if there is at least one active rank-1 binary matrix that covers entry , otherwise it takes value . Notice, that due to the difference in sign of objective coefficients for variables with and it is enough to declare constraints (12) and (13) for indices and respectively. Constraint (14) ensures that at most rank-1 binary matrices are active and hence we get a rank- factorisation of . Observe that constraints (12) together with being binary imply that automatically takes binary values for , and due to the objective function it always takes the value at its upper bound, hence may be replaced by for all without altering the optimum. In contrast, for need to be explicitly declared binary as otherwise, if there are some active rank- matrices () which cover a zero of (, ) then variable corresponding to that zero takes the possibly fractional value . One can also consider a strong formulation of with exponentially many constraints, in which constraints (13) are replaced by for all and .

The LP relaxation of () is obtained by replacing the integrality constraints by . Unlike CLP, the optimal objective value of () is not always zero. By comparing the rank of the factorisation, to the isolation number of the input matrix we can deduce when will take non-zero objective value. We next give an extension of the definition of isolation number for binary matrices presented in [Monson:1995, Section 2.3]. Let be a binary matrix with possibly missing entries. A set is said to be an isolated set of ones if whenever are two distinct members of then , and or or both. The size of the largest cardinality isolated set of ones of is denoted by and is called the isolation number of . From the definition it follows that members of an isolated set of ones cannot be covered by a common rank-1 submatrix, and hence the isolation number provides a lower bound on the Boolean rank. The following result shows that must have non-zero objective value whenever , the rank of the factorisation, is chosen so that it is strictly smaller than the isolation number. Let have isolation number , then .

Proof.
Proof. Let be an isolated set of ones of of cardinality . We will establish a feasible solution to the dual of () with objective value implying the result. Let us apply a change of variables for for the ease of avoiding the constant term in the objective function of . Then the bound constraints of can be written as for , for and , as the objective function is minimising both and and we have the cardinality constrains on . Associating dual variables with constraints , with constraints (13) and with constraint (14), the Master Dual Program () of is
(16) s.t. (17) (18) (19) (20)
where .
Let for and let for and for all other . The bound constraints on and are satisfied then. It remains to choose such that we satisfy constraint (17) for all rank- binary matrices . Let be a submatrix of , so we have . Then by the definition of isolated sets of ones, can contain at most one element from and hence we have . This tells us that for any , constraint (17) is satisfied for all that is a submatrix of . Now let be a rank- binary matrix which covers at least one zero entry of . Then may contain more than one element from . However, if it contains more than one element from then it must also contain at least -many zeros as for any two distinct elements in we have or by the definition of isolated set of ones. Hence, for all such that , constraint (17) satisfies
(21)
Thus we can set to get the objective value , which provides a non-zero bound on for all . ∎
The following example shows that we cannot strengthen Proposition 2 by replacing the condition with the requirement that has to be strictly smaller than the Boolean rank of . Let , where is the matrix of all s and is the identity matrix. One can verify that the Boolean rank of is and its isolation number is . For , the optimal objective value of is which is attained by a fractional solution in which the following rank- binary matrices are active with weight .

2.3 Exponential formulation II.

For let be the vector denoting the binary encoding of and note that these vectors give a complete enumeration of all non-zero binary vectors of size . Let denote the -th entry of . In [Lu:2008:OBM:1546682.1547186], the authors present the following Exponential size Integer linear Program (EIP) formulation using a separate indicator variable for each one of these exponentially many binary vectors ,

(22)
s.t. (23)
(24)
(25)
(26)
(27)

The above formulation has an exponential number of variables and constraints but it is an integer linear program as are input parameters to the model. Let ELP be the LP relaxation of EIP. Observe that due to the objective function the bound constraints in ELP may be simplified to for all and for without changing the optimum. To solve EIP or ELP explicitly, one needs to enumerate all binary vectors , which is possible only up to a very limited size. To the best of our knowledge, no method is available that avoids explicit enumeration and can guarantee the optimal solution of EIP. Previous attempts at computing a rank- factorisation via EIP all relied on working with only a small heuristically chosen subset of vectors [Lu:2008:OBM:1546682.1547186, Lu:2014:9400730720140101]. However, if there was an efficient method to solve ELP, the following result shows it to be as strong as the LP relaxation of . The optimal objective values of ELP and are equal.

Proof.

Proof. Note that due to constraints (12) and (13) in and constraints (23) and (24) in ELP, it suffices to show that for any feasible solution of ELP one can build a feasible solution of for which , and vice-versa.

First consider a feasible solution (for ) to ELP and note that by constraint (26) we have for all and . We can therefore express each as a convex combination of binary vectors in scaled by ,

(28)

where denotes the binary encoding of . Note that we do not require ’s to add up to 1 as we exclude the zero vector. We can therefore rewrite the solution of ELP as follows

(29)

Now it is easy to see that and since holds in any feasible solution to ELP, we get , which shows that is feasible for .

The construction works backwards as well, as any feasible solution to can be written as
for some rank- binary matrices and corresponding variables . Now let and to satisfy . Then since we started from a feasible solution to , we have and hence is satisfied too. ∎

3 Working under a new objective

In the previous section, we presented formulations for -BMF which measured the factorisation error in the squared Frobenius norm, which coincides with the entry-wise norm as showed in Equation (3). In this section, we explore another objective function which introduces an asymmetry between how false negatives and false positives are treated. Whenever a entry is erroneously covered in a rank- factorisation, it may be covered by up to rank-1 binary matrices. Our new objective function attributes an error term to each entry which is proportional to the number of rank-1 matrices covering that entry. As previously, by denoting a rank- factorisation of , the new objective function is

(30)

Note that the constraints encoding Boolean matrix multiplication imply that . Therefore, denoting the original squared Frobenius norm objective function in Equation (3) by , for any and rank- factorisation of the following relationship holds between and , ,

(31)

We next show that this new objective function with can overestimate the original objective by a factor of . But first, we need a technical result which shows that whenever the input matrix contains repeated rows or columns we may assume that an optimal factorisation exists which has the same row-column repetition pattern. [Preprocessing] Let contain some duplicate rows and columns. Then there exists an optimal rank- binary matrix factorisation of under objective (or ) whose rows and columns corresponding to identical copies in are identical.

Proof.

Proof. Since the transpose of an optimal rank- factorisation is optimal for , it suffices to consider the rows of . Furthermore, it suffices to consider only one set of repeated rows of , so let be the index set of a set of identical rows of . We then need to show that there exists an optimal rank- factorisation whose rows indexed by are identical. Let be an optimal rank- factorisation of under objective . For all we must have

(32)

as otherwise replacing for each with row where is a row index for which the above sum is minimised leads to a smaller error factorisation. Then since (32) holds, replacing for each with row for any leads to an optimal solution of the desired property. Similarly, if is an optimal factorisation under objective , then for all the corresponding objective terms must equal and hence an optimal solution of the desired property exists. ∎

This result implies that whenever the input matrix contains repeated rows or columns we may solve the following problem on a smaller matrix instead. Let be the binary matrix obtained from by replacing each duplicate row and column by a single representative and let and be the counts of each unique row and column of in respectively. Let and denote the non-zero and zero entry index sets of respectively. By Lemma 3 an optimal rank- factorisation of under the updated objective function

(33)

(or ) leads to an optimal rank- factorisation of under the original objective function (or ). For each positive integer there exists a matrix for which the optimal rank- binary matrix factorisations under objectives and satisfy .

Proof.

Proof. The idea behind the proof is to consider a matrix of exact Boolean rank- in which all the rank- components (rectangles) overlap at a unique middle entry and then replace this entry with a to obtain . Now and are exactly at distance in the squared Frobenius norm and hence is a rank- factorisation of with objective value . On the other hand, since exactly rectangles cover the entry at which and differ, if is taken as a rank- factorisation of under objective it incurs an error of size . Figure 1 shows the idea how to build such a for . Each colour corresponds to a rank-1 component and white areas correspond to s.

(a)
(b)
</