We define a new clustering problem which encompasses a number of well studied problems about low-rank approximation of binary matrices and clustering of binary vectors.
In order to obtain approximation algorithms for low-rank approximation problem, we design approximation algorithms for a “constrained” version of binary clustering.
A -ary relation is a set of binary -tuples with elements from . A -tuple satisfies , we write , if is equal to one of -tuples from .
Definition 1 (Vectors satisfying ).
Let be a set of -ary relations. We say that a set of binary -dimensional vectors satisfies and write , if for all .
For example, for , , , and , the set of vectors
satisfies because and .
Let us recall that the Hamming distance between two vectors , where and , is or, in words, the number of positions where and differ. For a set of vectors and vector , we define , the Hamming distance between and , as the minimum Hamming distance between and a vector from . Thus .
Then we define the following problem.
Binary Constrained Clustering Input: A set of vectors, a positive integer and a set of -ary relations . Task: Among all vector sets satisfying , find a set minimizing the sum .
First we prove the following theorem.
There is a deterministic algorithm which given instance of Binary Constrained Clustering and , runs in time and outputs a -approximate solution.
Our main result is the following theorem.
There is an algorithm which for a given instance of Binary Constrained Clustering and in time outputs a -approximate solution with probability at least .
In other words, the algorithm outputs a set of vectors satisfying such that , where is the value of the optimal solution.
1.1 Applications of the main theorem
Binary matrix factorization is the following problem. Given a binary matrix, that is a matrix with entries from domain ,
the task is to find a “simple” binary matrix which approximates subject to some specified constrains. One of the most widely studied error measures is the Frobenius norm, which for the matrix is defined as
Here the sums are taken over . Then we want to find a matrix with certain properties such that
In particular, two variants of the problem were studied in the literature, in the first variant on seeks for a matrix of small GF(2)-rank. In the second variant, matrix should be of Boolean rank . Depending on the selection of the rank, we obtain two different optimization problems.
Low GF(2)-Rank Approximation.
Here the task is to approximate a given binary matrix by that has GF(2)-rank .
Low GF(2)-Rank Approximation Input: An -matrix over GF(2) and a positive integer . Task: Find a binary -matrix with GF(2)-rank such that is minimum.
Low Boolean-Rank Approximation.
Let be a binary matrix. Now we consider the elements of to be Boolean variables. The Boolean rank of is the minimum such that for a Boolean matrix and a Boolean matrix , where the product is Boolean, that is, the logical plays the role of multiplication and the role of sum. Here , , , , , and . Thus the matrix product is over the Boolean semi-ring . This can be equivalently expressed as the normal matrix product with addition defined as . Binary matrices equipped with such algebra are called Boolean matrices.
Low Boolean-Rank Approximation Input: A Boolean matrix and a positive integer . Task: Find a Boolean matrix of Boolean rank at most such that is minimum.
Low-rank matrix approximation problems can be also treated as special cases of Binary Constrained Clustering.
For any instance of Low GF(2)-Rank Approximation, one can construct in time an instance of Binary Constrained Clustering with the following property. Given any -approximate solution of , an -approximate solution of can be constructed in time and vice versa.
Observe that if , then has at most pairwise distinct columns, because each column is a linear combination of at most vectors of a basis of the column space of . Also the task of Low GF(2)-Rank Approximation can equivalently be stated as follows: find vectors over GF(2) such that
is minimum, where are the columns of . Respectively, to encode an instance of Low GF(2)-Rank Approximation as an instance of Binary Constrained Clustering, we construct the following relation . Set . Let be the -tuple composed by all pairwise distinct vectors of . Thus each element is a binary -vector. We define Thus consists of -tuples and every -tuple in is a row of the matrix . Now we define to be the set of columns of and for each , . Note that since all are equal, we can construct and keep just one copy of .
One can show that if is a solution to , then all linear combinations of a basis of the column vectors of is a solution to and the cost of is at most the cost of the solution of . Similarly, if is a solution to , then solution to is constructed from by taking the the -th column of be equal to the vector in which is closest to the -th column vector of . Clearly the cost of is at most the cost of . It is easy to see that given , one can construct in time and vice versa. ∎
Similarly for Low Boolean-Rank Approximation we have the following lemma.
For any instance of Low Boolean-Rank Approximation, one can construct in time an instance of Binary Constrained Clustering with the following property. Given any -approximate solution of an -approximate solution of can be constructed in time and vice versa.
The proof essentially repeats the proof of Lemma 1. We are now working with the Boolean semi-ring but still we can use exactly the same trick to reduce Low Boolean-Rank Approximation to Binary Constrained Clustering. The only difference is that GF(2) summations and products are replaced by and respectively in the definition of the relation . Thus every -tuple in is a row of the matrix .
Hence to design approximation schemes for Low Boolean-Rank Approximation and Low GF(2)-Rank Approximation, it suffice to give an approximation scheme for Binary Constrained Clustering. The main technical contribution of the paper is the proof of the following theorem.
For , we say that an algorithm is an -approximation algorithm for the low-rank approximation problem if for a matrix and an integer it outputs a matrix satisfying the required constrains such that , where . By Theorems 1 and 2 and Lemmata 1 and 2, we obtain the following.
There is a deterministic algorithm which for a given instance of Low Boolean-Rank Approximation (Low GF(2)-Rank Approximation) and in time outputs a -approximate solution. There is an algorithm which for a given instance of Low Boolean-Rank Approximation (Low GF(2)-Rank Approximation) and in time outputs a -approximate solution with probability at least .
Let us observe that our results also yield randomized approximation scheme for the “dual” maximization versions of the low-rank matrix approximation problems. In these problems one wants to maximize the number of elements that are the same in and or, in other words, to maximize the value of . It is easy to see that for every binary matrix there is a binary matrix with GF(2)-rank such that . This observation implies that
1.2 Binary clustering and variants
The special case of Binary Constrained Clustering where no constrains are imposed on the centers of the clusters is Binary -Means.
Binary -Means Input: A set of vectors and a positive integer . Task: Find a set minimizing the sum .
Equivalently, in Binary -Means we seek to partition a set of binary vectors into clusters such that after we assign to each cluster its mean, which is a binary vector (not necessarily from ) closest to , then the sum is minimum.
Of course, Binary Constrained Clustering generalizes Binary -Means: For given instance of Binary -Means by taking sets , , consisting of all possible -tuples , we construct in time an instance of Binary Constrained Clustering equivalent to . Note that since all the sets are the same, it is sufficient to keep just one copy of the set for the instance . That is, any -approximation to one instance is also a -approximation to another. Theorems 1 and 2 implies that
There is a deterministic algorithm which for a given instance of Binary -Means and in time outputs a -approximate solution. There is an algorithm which for a given instance of Binary -Means and in time outputs a -approximate solution with probability at least .
For example, the following generalization of binary clustering can be formulated as Binary Constrained Clustering. Here the centers of clusters are linear subspaces of bounded dimension . (For this is Binary Constrained Clustering and for this is Low GF(2)-Rank Approximation.) More precisely, in Binary Projective Clustering we are given a set of vectors and positive integers and . The task is to find a family of -dimensional linear subspaces over GF(2) minimizing the sum
To see that Binary Projective Clustering is the special case of Binary Constrained Clustering, we observe that the condition that is a -dimensional subspace over GF(2) can be encoded (as in Lemma 1) by constrains. Similar arguments hold also for the variant of Binary Projective Clustering when instead of -dimensional subspaces we use -flats (-dimensional affine subspaces).
In Correlative -Bicluster Editing, we are given a bipartite graph and the task is to change the minimum number of adjacencies such that the resulting graph is the disjoint union of at most complete bipartite graphs . This is the special case of Binary Constrained Clustering where each constrain consists of -tuples and each of the -tuples contains exactly one element and all other elements . Another problem which can be reduced to Binary Constrained Clustering is the following variant of the Biclustering problem . Here for matrix , and positive integers , we want to find a binary -matrix such that has at most pairwise-distinct rows, pairwise distinct-columns such that is minimum.
1.3 Previous work
Low-rank binary matrix approximation
Low-rank matrix approximation is a fundamental and extremely well-studied problem. When the measure of the similarity between and is the Frobenius norm of matrix , the rank- approximation (for any ) of matrix
can be efficiently found via the singular value decomposition (SVD). This is an extremely well-studied problem and we refer to surveys and books[23, 28, 40] for an overview of this topic. However, SVD does not guarantee to find an optimal solution in the case when additional structural constrains on the low-rank approximation matrix (like being non-negative or binary) are imposed. In fact, most of the variants of low-rank approximation with additional constraints are NP-hard.
For long time the predominant approaches for solving such low-rank approximation problems with NP-hard constrains were either heuristic methods based on convex relaxations or optimization methods. Recently, there has been considerable interest in the rigorous analysis of such problems[4, 11, 33, 36].
Low GF(2)-Rank Approximation
arises naturally in applications involving binary data sets and serve as important tools in dimension reduction for high-dimensional data sets with binary attributes, see[12, 22, 19, 25, 35, 37, 42] for further references. Due to the numerous applications of low-rank binary matrix approximation, various heuristic algorithms for these problems could be found in the literature [21, 22, 15, 25, 37].
When it concerns a rigorous analysis of algorithms for Low GF(2)-Rank Approximation, the previous results include the following. Gillis and Vavasis  and Dan et al.  have shown that Low GF(2)-Rank Approximation is NP-complete for every . A subset of the authors studied parameterized algorithms for Low GF(2)-Rank Approximation in .
formulated the rank-one problem as Integer Linear Programming and proved that its relaxation gives a 2-approximation. They also observed that the efficiency of their algorithm can be improved by reducing the linear program to the Max-Flow problem. Jiang et al. found a much simpler algorithm by observing that for the rank-one case, simply selecting the best column of the input matrix yields a 2-approximation. Bringmann et al.  developed a 2-approximation algorithm for which runs in sublinear time. Thus even for the special case no polynomial time approximation scheme was known prior to our work.
For rank , Dan et al.  have shown that a -approximate solution can be formed from columns of the input matrix . Hence by trying all possible columns of , we can obtain -approximation in time . Even the existence of a linear time algorithm with a constant-factor approximation for was open.
Low Boolean-Rank Approximation
in case of coincides with Low GF(2)-Rank Approximation. Thus by the results of Gillis and Vavasis  and Dan et al.  Low Boolean-Rank Approximation is NP-complete already for . While computing GF(2)-rank (or rank over any other field) of a matrix can be performed in polynomial time, deciding whether the Boolean rank of a given matrix is at most is already an NP-complete problem. This follows from the well-known relation between the Boolean rank and covering edges of a bipartite graph by bicliques . Thus for fixed , the problem is solvable in time [17, 14] and unless Exponential Time Hypothesis (ETH) fails, it cannot be solved in time .
There is a large body of work on Low Boolean-Rank Approximation, especially in the data mining and knowledge discovery communities. In data mining, matrix decompositions are often used to produce concise representations of data. Since much of the real data such as word-document data is binary or even Boolean in nature, Boolean low-rank approximation could provide a deeper insight into the semantics associated with the original matrix. There is a big body of work done on Low Boolean-Rank Approximation, see e.g. [7, 8, 12, 27, 29, 30, 38]. In the literature the problem appears under different names like Discrete Basis Problem  or Minimal Noise Role Mining Problem [39, 27, 31].
Since for Low Boolean-Rank Approximation is equivalent to Low GF(2)-Rank Approximation, the 2-approximation algorithm for Low GF(2)-Rank Approximation in the case of is also a 2-approximation algorithm for Low Boolean-Rank Approximation. For rank , Dan et al.  described a procedure which produces a -approximate solution to the problem.
Let us note that independently Ban et al.  obtained a very similar algorithmic result for low-rank binary approximation. Their algorithm runs in time . Moreover, they also obtained a lower bound of for a constant under Small Set Expansion Hypothesis and Exponential Time Hypothesis. Surprisingly, at first glance, the technique and approach in  to obtain algorithmic result for low-rank binary approximation is similar to that of ours.
was introduced by Kleinberg, Papadimitriou, and Raghavan  as one of the examples of segmentation problems. Ostrovsky and Rabani  gave a randomized PTAS for Binary -Means. In other words they show that for any and there is an algorithm finding an -approximate solution with probability at least . The running time of the algorithm of Ostrovsky and Rabani is for some function . No Efficient Polynomial Time Approximation Scheme (EPTAS), i.e. of running time , for this problem was known prior to our work.
For the dual maximization problem, where one wants to maximize a significantly faster approximation is known. Alon and Sudakov  gave a randomized EPTAS. For a fixed and the running time of the -approximation algorithm of Alon and Sudakov is linear in the input length.
Binary -Means can be seen as a discrete variant of the well-known -Means Clustering
. This problem has been studied thoroughly, particularly in the areas of computational geometry and machine learning. We refer to[1, 5, 26] for further references to the works on -Means Clustering. In particular, the ideas from the algorithm for -Means Clustering of Kumar et al.  form the starting point of our algorithm for Binary Constrained Clustering.
1.4 Our approach.
Sampling lemma and deterministic algorithm
Our algorithms are based on Sampling Lemma (Lemma 3). Suppose we have a relation and weight tuple , where for all . Then Sampling Lemma says that for any , there is a constant such that for any tuple , ,
random samples from Bernoulli distributionfor each gives a
good estimateof the minimum weighted distance of from the tuples in . For more details we refer to Lemma 3.
Here we explain how our sampling lemma works to design a PTAS. Let of Binary Constrained Clustering, be an instance Binary Constrained Clustering and let be an optimum solution to . Let be a partition of such that . Informally, using Sampling Lemma, we prove that there is a constant such that given and vectors uniformly at random chosen with repetition from for all , we can compute in linear time a -approximate solution to (see proof of Lemma 4). This immediately implies a PTAS for the problem.
Linear time algorithm (Theorem 2)
The general idea of our algorithm for Binary Constrained Clustering is inspired by the algorithm of Kumar et al. . Very informally, the algorithm of Kumar et al.  for -Means Clustering is based on repeated sampling and does the following. For any (optimum) solution there is a cluster which size is at least -th of the number of input vectors. Then when we sample a constant number of vectors from the input uniformly at random, with a good probability, the sampled vectors will be from the largest cluster. Moreover, if we sample sufficiently many (but still constant) number of vectors, they not only will belong to the largest cluster with a good probability, but taking the mean of the sample as the center of the whole cluster in the solution, we obtain a vector “close to optimum”. This procedure succeeds if the size of the largest cluster is a large fraction of the number of vectors we used to sample from. Then the main idea behind the algorithm of Kumar et al. to assign vectors at a small distance from the guessed center vectors to their clusters. Moreover, once some vectors are assigned to clusters, then the next largest cluster will be a constant (depending on and ) fraction the size of the yet unassigned vectors. With the right choice of parameters, it is possible to show that with a good probability this procedure will be a good approximation to the optimum solution.
On a very superficial level we want to implement a similar procedure: iteratively sample, identify centers from samples, assign some of the unassigned vectors to the centers, then sample again, identify centers, etc. Unfortunately, it is not that simple. The main difficulty is that in Binary Constrained Clustering, even though we could guess vectors from the largest cluster, we cannot select a center vector for this cluster because the centers of “future” clusters should satisfy constrains from —selection of one center could influence the “future” in a very bad way. Since we cannot select a good center, we cannot assign vectors to the cluster, and thus we cannot guarantee that sampling will find the next cluster. The whole procedure just falls apart!
Surprisingly, the sampling idea still works for Binary Constrained Clustering but we have to be more careful. The main idea behind our approach is that if we sample all “big” clusters simultaneously and assign the centers to these clusters such that the assigned centers “partially” satisfy , then with a good probability this choice does not mess up much the solution. After sampling vectors from all big clusters, we are left with two jobs– (i) find centers for the clusters sampled simultaneously and these centers will be a subset of our final solution. The condition is guaranteed by our new Sampling Lemma (Lemma 5). Towards maintaining condition , we prove that even after finding “approximately close centers” for the big clusters, there exist centers for the small clusters which together with the already found centers is a good approximate solution (i.e, they obey the relations and its cost is small). As far as we succeed in finding with a good probability a subset of “good” center vectors, we assign some of the remaining input vectors to the clusters around the centers. Then we can proceed iteratively.
Now we explain briefly how to obtain the running time to be linear. In each iteration after finding some center vectors we have mentioned that the vectors in the remaining input vectors which are close to already found centers can be safely assigned to the clusters of the center vectors already found. In fact we show that if the number of such vectors (vectors which are close to already found centers) are at most half the fraction of remaining input vectors, then there exist at least one cluster (whose center is yet be computed) which contains a constant fraction of the remaining set of vectors. In the other case we have that half fraction of the remaining vectors can be assigned to already found centers. This leads to a recurrence relation and , where and are constants depending on and , and , provided we could find approximate cluster centers from the samples of large clusters in linear time. The above recurrence will solves to , for some function . We need to make sure that we can compute cluster centers from the samples. In the case of designing a PTAS, we have already explained that we could compute approximate cluster centers using samples if know the size of each of those clusters. In fact we show that if the sizes of large clusters are comparable and know them approximately, then we could compute approximate cluster centers in linear time (see Lemma 5).
In Section 2, we give notations, definitions and some known results which we use through out the paper. In Section 3 we give notations related to Binary Constrained Clustering. In Section 4, we prove the important Sampling Lemma, which we use to design both PTAS and linear time randomized approximation scheme for Binary Constrained Clustering. Then in Section 5, we prove how Sampling Lemma can be used to get a (deterministic) PTAS for the problem. The subsequent sections are building towards obtaining a linear time approximation scheme for the problem.
We use to denote the set . For an integer , we use as a shorthand for . For a set and non-negative integer , and denote the set of subsets of and set of sized subsets of , respectively. For a tuple and an index , we use denotes the th entry of , i.e, . We use to denote the logarithm with base .
In the course of our algorithm, we will be construction a solution iteratively. When we find a set of vectors , , which will be a part of the solution, these vectors should satisfy relations . Thus we have to guarantee that for some index set of size , the set of vectors satisfies the part of “projected” on . More precisely,
Definition 2 (Projection of on , ).
Let be a relation and be a subset of indices, where . We say that a relation is a projection of on , denoted by , if is a set of -tuples from such that if and only if there exists such that for all . In other words, the tuples of are obtained from tuples of by leaving only the entries with coordinates from . For a family of relations, where , we use to denote the family .
Thus a set of vectors satisfies if and only if for every there exists such for every , . As far as we fix a part of the solution and index set of size , such that satisfies , we can reduce the family of relations by deleting from each relation all -tuples not compatible with and . More precisely, for every , we can leave only -tuples which projections on are equal to . Let the reduced family of relations be . Then in every solution extending , the set of vectors should satisfy the projection of on . This brings us to the following definitions.
Definition 3 (Reducing relations to ).
Let be a relation and be a subset of indices, where , and let be an -tuple. We say that relation is obtained from subject to and and write , if
For a set of vectors , set of size , and a family of relations , we denote by the family of relations , where , .
Definition 4 (: Projection of on ).
For relation , -sized subset of indices and -tuple , we use to denote the projection of on .
For a family of relations, set of vectors from , and -sized set of indices , we use to denote the family , where and .
In other words, consists of all -tuples , such that “merging” of and results in a -tuple from . In particular, the extension of and in can be generated by “merging” and all vectors of .
We also use and to denote vectors with all entries equal to and , respectively, where the dimension of the vectors will be clear from the context. For vector and set , we use to denote the minimum Hamming distance (the number of different coordinates) between and vectors in . For sets , we define
For a vector and integer , we use to denote the open ball of radius centered in , that is, the set of vectors in at Hamming distance less than from .
In the analysis of our algorithm we will be using well known tail inequalities like Markov’s and Hoeffding’s inequalities.
Proposition 4 (Hoeffding’s inequality ).
Let be independent random variables such that each is strictly bounded by the intervals . Let and . Then
3 Notations related to Binary Constrained Clustering
Let be an instance of Binary Constrained Clustering and be a solution to , that is, a set of vectors satisfying . Then the cost of is . Given set , there is a natural way we can partition the set of vectors into sets such that
Thus for each vector in , the closest to vector from is . We call such partition clustering of induced by and refer to sets as to clusters corresponding to .
We use to denote the optimal solution to . That is
Note that in the definition of a vector set satisfiying relations , we require that the size of is . We also need a relaxed notion for vector sets of size smaller than to satisfy a part of .
Definition 5 (Vectors respecting ).
Let be a set of binary vectors, where , we say that respects if there is an index set such that , that is, satisfies . In other words, is a solution to .
Notice that given a set of vectors which respects , one can extend it to a set in time linear in the size of such that satisfies . Thus is a (maybe non-optimal) solution to such that . We will use this observation in several places and thus state it as a proposition.
Let be an instance of Binary Constrained Clustering and for be a set of vectors respecting . Then there is linear time algorithm which finds a solution to such that .
Let be an instance of Binary Constrained Clustering. For , we define
An equivalent way of defining is
4 Sampling probability distributions
One of the main ingredient of our algorithms is the lemma about sampling of specific probability distributions. To state the lemma we use the following notations. For a realbetween and we will denote by the Bernoulli distribution which assigns probability to and to . We will write to denote that is a random variable with distribution .
Definition 7 (Weighted distance ).
For two -tuples and over reals and -tuple with , the distance from to weighted by is defined as
Informally, Sampling Lemma proves the following. For an integer , a relation , a sequence of probability distribution and , there is a constant (depending on and ) such that for every a sample of random values from gives us a tuple which is a good estimate of .
Lemma 3 (Sampling Lemma).
There exists such that for every , positive integers and , -tuples with , and with , and relation , the following is satisfied.
For every and , let , and let be the -tuple of random variables, where . Let be the minimum distance weighted by from to a -tuple from . Let be a -tuple from within the minimum weighted by distance to , that is, , and let . Then
Let . Then . Let be the set of all tuples such that . Let . We will prove the following claim.
For every ,
Assuming Claim 1 we complete the proof of the lemma:
Hence, all that remains to prove the lemma is to prove Claim 1.
Proof of Claim 1.
We will assume without loss of generality that . By renaming to and vice versa at the coordinates where , we may assume that . Thus . We may now rewrite the statement of the claim as:
Consider now the weight -tuple where if and if . We have that , and that