Low rank approximation is a common data analysis problem that has various applications in computer science. The most general version of the problem, the -low rank approximation problem, is defined in the following manner:
-low rank approximation: Given a matrix (with ) and an integer , find a rank- matrix such that is minimised.
The above definition is for any positive value of . When , the objective is to minimise which is defined to be the number of mis-matches in the matrices and . The -low rank approximation problem is known to be -hard for while for
the problem can be solved using SVD (Singular Value Decomposition). Even though efficient approximation algorithms for these problems have been known, the approximation factor is large (polynomial in). Recent work of Ban et al. [ban] addressed the open question about whether a PTAS is possible for these problems. They showed that for there is no constant factor approximation algorithm running in time for a constant under the small set expansion hypothesis and exponential time hypothesis (ETH). This shows that an exponential dependence on is necessary. On the upper bound side, they give a -approximation algorithm with running time for the cases when .
The case when has also been studied. There exists bicriteria approximation algorithm for the -low rank approximation problem [bkw17]. The problem can alternatively be stated as: given an matrix , find an matrix and a matrix such that is minimised. There is an interest in specific class of instances of the -low rank approximation problem where the matrices in the above formulation are binary matrices. In fact, we can generalise even further by making the notion of in the above definition more flexible in the following manner: If , then is the inner product of the row of and the column of . We can consider various fields for this inner product. The two popularly explored fields are: (i) with inner product defined as , and (ii) Boolean semiring with inner product defined as . There are various previous works that consider the above specific versions. We can generalise the problem (using the formulation in terms of and ) so that the above versions become special cases. This was done by Ban et al. [ban] and they called this problem generalised binary -rank- problem that is defined below.
Generalised binary -rank-: Given a matrix with , an integer , and an inner product function , find matrices and that minimises , where is computed using the inner product function. That is is the inner product of the row of with the column of .
Ban et al. [ban] showed that there is no approximation algorithm for the generalised binary -rank- problem running in time for a constant . The work of Ban et al. [ban] and Fomin et al. [fomin18] addressed one of the main open questions for generalised binary rank- problem – whether a PTAS for constant is possible. They give such a PTAS using very similar set of ideas (even though they were obtained independently). We extend the previous work of Ban et al. and Fomin et al. to the streaming setting by using the connection of this problem to the constrained binary -means problem. This connection was given and used by both Ban et al. [ban] and by Fomin et al. [fomin18]. We will now talk about the constrained binary -means problem and its connection to generalised binary rank- problem. We will start with the binary -means problem which is an interesting problem on its own.
Binary -means: Given a set of points and a positive integer , find a set of centers such that the -means cost function is minimised.
Note that and are restricted to be from the set , the elements of which can alternatively be interpreted as -bit strings. So, the squared Euclidean distance between any two points that is used in the -means cost function can alternatively be written as , where denotes the Hamming distance between strings and 111The hamming distance between two binary strings of equal length is the number of bits on which they differ.. Note that is a metric which means, in particular, that the distance function over the domain satisfies the triangle inequality:
The binary -means problem has been examined in the past by Kleinberg, Papadimitriou, and Raghavan [kpr04], Ostrovsky and Rabani [or02], and Alon and Sudakov [as99]222Alon and Sudakov [as99] consider the dual maximization problem that involves maximising instead of minimizing .. Ostrovsky and Rabani [or02] gave a -approximation algorithm with running time for some function . Fomin et al. [fomin18] gave an EPTAS (under the assumption that is a constant) with running time for some function . In fact, Fomin et al. [fomin18] gave an EPTAS for a more generalised version of the binary -means problem that they call the constrained binary -means problem and the EPTAS for the binary -means problem trivially follows from this. We talk about this problem next.
We will work with the definition of the constrained binary -means problem given by Fomin et al. [fomin18]. For this, we first need to define the concept of a set of centers satisfying a set of -ary relations. Given a set of , -ary binary relations (i.e., for every ), a set of centers is said to satisfy iff for every . Here, is thought of as a
-dimensional vector anddenotes the coordinate of this vector. We can now define the constrained binary -means problem.
Constrained binary -means: Given a set of points , a positive integer , and a set of -ary relation s , find a set of centers satisfying such that the cost function is minimised.
Note that the distance between point and center is given as as opposed to by Fomin et al. [fomin18]. However, the formulations are equivalent since the distances are the same when are binary vectors. It is important to distinguish between the definition of constrained binary -means problem given above with the constrained -means problem that has been a known problem in the -means clustering literature.
Comparison with constrained -means problem
Ding and Xu [dx15] gave a unified framework for constrained versions of the -means problem. As per this framework, a constrained version of the -means problem is defined by an input set , a positive integer , and a set of constraints on the clusters. The goal is to output a clustering of the dataset satisfying the cluster constraints such that the -means cost function with respect to centers defined by the centroid of clusters, is minimised. That is, find a clustering satisfying the cluster constraints such that the following cost function gets minimised: , where is the centroid of the points in . This way of defining constrained -means has the advantage that a number of known constrained versions of the -means/median problem fit into the framework. For instance, consider the -gather clustering problem where the constraint on the cluster is that every cluster should contain at least points. See [dx15] for a more elaborate list of problems that fit into the unified framework of Ding and Xu. This kind of generalisation raises questions about the representation and conciseness of the cluster constraints. This is an important consideration while defining a unified framework since the number of possible clusterings of the data can be very large. This problem is resolved by defining a partition algorithm as an alternative to stating all valid clusterings for a particular problem. A partition algorithm, when given a center set , outputs a valid clustering such that is minimised333Note that for the unconstrained or standard -means problem, the partition algorithm is simply the Voronoi partitioning algorithm.. Interestingly, for many of the constrained versions of the -means problem, such as the -gather/capacity problem, there is a partition algorithm available. One of the contributions of Ding and Xu [dx15] was to give such partition algorithms for a number of constrained clustering problems. The other main contribution of Ding and Xu [dx15] was to give a polynomial time approximation scheme (PTAS) for the constrained -means problem. This means that one gets a PTAS for any constrained version of the -means problem as long as there is an efficient partition algorithm for that version of the -means problem. Subsequently, Bhattacharya et al. [bjk] gave a faster PTAS while Goyal et al. [gjk] gave a streaming PTAS for the problem, both using the -sampling technique. At a high-level, the constrained binary -means problem that we defined earlier, seems to be yet another constrained version of the -means problem. So, the relevant question in the context of the current discussion is:
Does the constrained binary -means problem fit into the unified framework of Ding and Xu [dx15]?
If the answer to the above question were yes, then a streaming PTAS for the constrained binary -means problem would trivially follow from the recent work of Goyal et al. [gjk]. Unfortunately, this is not true. Note that the framework of Ding and Xu [dx15] defines the constraints on the clusters while the definition of constrained binary -means problem defines constraints on the centers. However, we note that the -sampling based techniques of [bjk, gjk] can be extended to this setting.
Goyal et al. [gjk] used a constant factor approximation algorithm for the standard -means problem as a subroutine to obtain a PTAS for the constrained -means problem within the framework of Ding and Xu [dx15]. Since their algorithm is based on a simple sampling idea, their algorithm can be converted to a constant pass streaming algorithm using the reservoir sampling technique. In this work, we try to extend the same ideas to the constrained binary -means problem. However, to make the sampling ideas work, we will need additional results. First, we will need a streaming algorithm that gives a constant factor approximate solution for the binary -means problem (i.e., the unconstrained problem). Second, we will need a result that says that it is possible to obtain good constrained centers of any target clusters if we have uniform samples from each of the clusters. Fortunately, both the results are already known. We discuss these next. Let us start with the streaming constant factor approximation algorithms for the binary -means problem.
Streaming constant approximation for binary -means
The binary -means problem is basically the unconstrained version of the constrained binary -means problem. The following result follows from the work of Braverman et al. [brav11] on the standard discrete version of the -means problem over arbitrary metric spaces.
There is a constant factor streaming algorithm for the binary -means problem that runs in one pass over the data while storing points in memory with overall running time .
Let us now discuss the second issue about obtaining good centers using uniform samples from every cluster.
Good centers from uniform samples
We will now discuss the possibility of obtaining good centers for the constrained binary -means problem using uniform samples from each of the target clusters. Note that for the standard -means problem, this is possible since the centroid of a uniformly sampled set of points from any cluster gives a good center for that cluster. A good center here means that the cost with respect to this center is within -factor of the cost with respect to the optimal center of the cluster. This result follows from a result of Inaba et al. [inaba]. In the context of the constrained binary -means problem, just having a uniform sample from each of the clusters may not be sufficient to obtain good centers for the clusters in the -approximation sense. However, Fomin et al. [fomin18] and Ban et al. [ban]
showed that if one has an estimate of the size of the optimal clusters in addition to certain minimum number of uniform samples from them, then one can obtain good constrained centers for the clusters in the-approximation sense. The following theorem states the result of Fomin et al. [fomin18] formally. A similar sampling result was given and used by Ban et al. [ban]. We use the formulation of Fomin et al.
Theorem 1.2 (Follows from Fomin et al. [fomin18])
For a given instance of the problem, let denote an arbitrary partition of the points in , and let be a center set satisfying that minimises . Let and . Let be such that for every , . Let denote a multiset of the points such that for every , consists of points from sampled independently and uniformly with replacement. Then there is a simple algorithm given below that outputs a center satisfying such that .
The running time of the algorithm is .
Note that the statement of the above lemma deviates from the statement of its parent lemma by Fomin et al. [fomin18]. The lemma by Fomin et al. [fomin18] is not for an arbitrary partition of the given dataset and center set satisfying that optimises . It is for the optimal partitioning . Note that for the optimal partitioning and its corresponding optimal center set , we have . So, the final bound on the expectation given by Fomin et al. [fomin18] is in terms of . As far as the proof of the above theorem is concerned, we comment that the proof is essentially the same since the proof of Fomin et al. does not use the optimality of the partition and in fact the proof holds for any partition as stated in the theorem above.
The above theorem tells us that as long as we can guess the size of the target clusters and obtain uniform samples from every cluster, we should be able to obtain good centers for these clusters. For the size of the clusters, since we only need an estimate within a multiplicative factor of , we can employ a brute-force strategy of trying out all the possibilities. This brute-force strategy contributes a multiplicative factor of in the running time of the PTAS444 Note that Fomin et al. [fomin18] have managed to remove this factor from the running time of their PTAS using a peeling strategy that handle optimal clusters in a sequence of iterations. However, using this peeling strategy to design a streaming PTAS that works in a few passes becomes difficult. . As for obtaining uniform samples from each of the target clusters, this is a more tricky issue and it is not immediately clear how to do this. If all the clusters are roughly of equal size, then one can uniformly sample points from with replacement and then try out all possible -sized (multi) subsets. If all clusters are roughly of equal size, we can argue that one of the (multi) subsets so attempted, will be -sized uniform samples from . However, if the optimal clusters are of very different sizes, then uniform sampling from will clearly not work since the chance of sampling from a very small sized cluster may be very small. -sampling has turned out to be a very useful tool in such cases where uniform sampling does not work. Given a set of centers
, the idea is to sample points with probability proportional to the squared distance of the points from the closest center in. This boosts the probability of sampling from small sized clusters that does not have a representative in the center set . This idea suggests an iterative way of picking good centers in rounds where one argues that either there is a good chance of picking good centers from uncovered clusters in subsequent rounds or that the current set of centers gives good -means cost. This idea, however, is not likely to lead to a streaming algorithm with few passes, especially in the current context where uniform samples from all clusters are needed simultaneously to meet the constraints. So, we use the idea of Bhattacharya et al. [bjk] that was subsequently used by Goyal et al. [gjk] to obtain streaming algorithms for various constrained versions of the -means problems. The main idea is to start with a constant factor approximate solution for the binary -means problem and then consider the data points constructed (though not explicitly algorithmically) using points from and . The point sets has two advantages over the optimal partition – (i) good centers for are also good for , and (ii) it is possible to simultaneously obtain uniform samples from each of . Along with the result of Fomin et al. [fomin18] (Theorem 1.2), this should be sufficient to obtain good centers for the constrained binary -means problem. The PTAS based on the above ideas is extremely simple and can be stated as a short pseudocode given below.
We prove the following result with respect to the algorithm given above:
Let . For any constrained binary -means instance , let denote an -approximate solution for the binary -means problem instance . The algorithm , returns a center set satisfying such that:
The running time of the algorithm is .
We prove the above theorem in Section 2. Let us now see how this algorithm gives a 3-pass streaming PTAS for the problem. The first pass of the algorithm is used to find a constant factor approximate solution to the binary -means problem corresponding to the given instance . From the discussion earlier, we know that there is a one pass algorithm that returns a constant factor approximate solution to the binary -means problem and that uses space. In the second pass, we execute lines (1-7) of the algorithm . This can indeed be executed in a single pass. Note that line (2) is for probability amplification and all iterations can be executed independently. The main step is line (3) where we need to -sample points w.r.t. center-set independently from . This can be done using reservoir sampling in a single pass555Reservoir sampling: Let denote the squared distance of points to the nearest center in the center set . Then one -sample can be obtained while making a pass over the data in the following manner: Store the first point and on seeing the point for , replace the stored element with probability and continue with the remaining probability.. Lines (5-7) accumulates -center sets in corresponding to all possible subsets and all possible choices for . The final step of line (8) involves picking the -center set from with the least cost. We need one more pass over the data to perform line (8). This makes a total of 3 passes. The space requirement for the pass is . We summarise our result formally as the theorem below the proof of which follows trivially from the discussion above.
Theorem 1.4 (Main result for constrained binary -means)
Let . There is a 3-pass streaming algorithm that outputs a -approximate solution for any instance of the constrained binary -means problem. The space and per-item processing time of our algorithm is .
Note that as per the formulation of the constrained binary -means problem, the output is supposed to be a set of centers. The above 3-pass algorithm outputs such a center set . However, if the objective is to output the clustering of the data points with respect to , then one more pass over the data will be required and the resulting algorithm will be a 4-pass algorithm. This is relevant for the -rank- approximation problem that we discuss next. We obtain a result for the generalised binary -rank- problem that is similar to the above result, using a simple reduction to the constrained binary -means problem. This reduction is used by both Fomin et al. [fomin18] and Ban et al. [ban]. We restate the result of Fomin et al. [fomin18] for clarity.
Lemma 1 (Lemma 1 and 2 of [fomin18])
For any instance of the generalised binary -rank- approximation problem, one can construct in time an instance of constrained binary -means problem with the following property: Given any -approximate solution of , an -approximate solution of can be constructed in time .
The dataset corresponding to matrix , in the above reduction, is essentially the rows of the matrix and and ’s are pairwise distinct vectors in . The above reduction and Theorem 1.4 gives the following main result for the generalised binary -rank- approximation problem. Note that since we need to output a matrix , we will need the clustering of the rows of and as per previous discussion this will require one more pass than that in Theorem 1.4.
Theorem 1.5 (Main result for generalised binary -rank- approximation)
Let . There is a 4-pass streaming algorithm that makes row-wise passes over the input matrix and outputs a -approximate solution for any instance of the generalised binary -rank- problem. The space and per-item processing time of our algorithm is .
A lot of work has been done for the -means problem and the low rank approximation problems. The following related work subsection will help see our work in the right perspective.
1.1 Related work
The binary -means problem is a special case of the discrete variant of the classical -means problem where the clustering problem is defined over the metric . The problem was introduced and studied by Kleinberg, Papadimitriou, and Raghavan [kpr98, kpr04] by the name of segmentation problems. They showed that the problem is -hard for and gave approximation algorithms for the dual maximisation problem where the goal is to maximise . Ostrovsky and Rabani [or02] gave a randomised PTAS with running time for some function . More recently, Fomin et al. [fomin18] gave an efficient PTAS with running time for some function . In fact, Fomin et al. gave such an efficient PTAS for a much more generalised version called the generalised constrained binary -means problem which we discussed earlier. Ban et al. [ban] independently obtained similar results666The result of Ban et al. [ban] does not explicitly discuss the binary -means problem..
The binary versions of the -low rank approximation problem is relevant in a number of contexts (e.g., [gutch10, painsky, dan, bv10, sbm03, singliar]). Even though the generalised -rank- approximation problem was named more recently by Ban et al. [ban], the special cases of the problem have been studied in the past. For instance, it known that for the special case where the field for the inner product is , the problem is -hard for every [gv15, dan]. Various constant factor approximation algorithms have been given for cases where is a fixed constant [sjy09, jphy14, bkw17]. There also exist -approximation algorithm in time [dan]. Ban et al. [ban] showed a hardness-of-approximation result conditioned on the Exponential Time Hypothesis (ETH) showing that there is no approximation algorithm (beyond a fixed constant) for the generalised binary -rank- approximation problem running in time for a constant . They support their lower bound with a PTAS that runs in time . A similar PTAS (using similar ideas) was given independently by Fomin et al. [fomin18].
2 Analysis of PTAS (Proof of Theorem 1.3)
In this section, we prove our main result related to the algorithm GoodCenters. We state the pseudocode for GoodCenters and the statement of the theorem for ease of reading.
Following is the restatement of the main result with respect to the algorithm above.
Let . For any constrained binary -means instance , let denote an -approximate solution for the binary -means problem . The algorithm , returns a center set satisfying such that:
The running time of the algorithm is .
We will need a few definitions for our analysis with respect to the given instance . Let denote the optimal center set satisfying and let denote the corresponding clustering induced by . That is, for every . For every , let and as before let . We will also need terms related to the optimal solution to the corresponding binary -means instance (that is, the corresponding unconstrained instance). Let denote the optimal binary -means solution, let denote the corresponding clustering induced by and let . We know the following about some of the quantities defined above and the center set given as input to the algorithm:
The first inequality follows from the fact that denotes the optimal solution to the constrained version as opposed to that denotes the optimal solution to the unconstrained version. The second inequality follows from the fact that is an -approximate solution to the binary -means instance . The outer iteration (repeat times in line (2)) is for probability amplification. We will show that the probability of finding a good center set in one iteration is and the theorem will follow from simple probability calculations. Let us now focus on a single iteration of the algorithm. We will assume that we know such that . Note that we will try all possibilities such that the inequalities hold. It will be easier to analyse assuming that we know the correct values. We will show that with probability at least , there are disjoint (multi) subsets of (see line (5))) such that,
Since we try out all possible subsets in line (5), we will obtain the desired result. We argue in the following manner: consider the multi-set . We can interpret as a union of of multi-sets , where . Also, since consists of independently sampled points, we can interpret as a union of multi-sets where is the bunch of points sampled. For all , let where denotes the set of points in the multi-set (with repetition) that belongs to . We will show that there are subsets for every such that eqn. (3) holds. We state this formally as the next lemma which we will prove in the remaining section.
Let multi-sets be as defined above. Then
We divide the cluster indices into two groups based on the value of and then do a case analysis. Let
We will show that the good ’s as in Lemma 2 are such that for and for . The analysis share similarity with the analysis of the -sampling based algorithm in the context of the standard -means problem of Bhattacharya et al. [bjk] and Goyal et al. [gjk]. However, there are significant deviations and the arguments have to be adapted to the current setting. The main difference is because of the fact that in the context of the standard -means problem, a uniform sample from a cluster was sufficient to obtain a good center from that cluster. So, one could argue approximation guarantee cluster-wise. In the current context, one needs to argue simultaneously with respect to all clusters. Even though some parts of the proof may be similar to the previous works [bjk, gjk], there are significant differences and we give the detailed proofs here.
Let us consider the index set first. Let be any index in the set . We will show that there is a set consisting only of elements in the set such that a uniform sample from (along with similar samples from other ’s) will give a good center-set. For any point , let denote the center in the set that is closest to . That is, . Let us define the multi-set as
That is, we take the nearest centers of elements in with appropriate multiplicities to construct . The intuition behind constructing the set is that since the cost of with respect to center-set is very small, the points in the set are close to the centers in and hence the centers in can act as “proxy” for the points in the set . Obtaining a uniform sample from is much simpler since we consider the set that has appropriate number of copies from the set .
Now constructing a similar set for an index , is a bit more involved. Since the cost of for any is not small (as opposed to indices in ), the points from the set alone cannot act as proxy for the points in the set . On the other hand, if we sample using -sampling w.r.t. set , then all the points in that are far from centers in will have a good chance of being sampled. However, the same is not true for points in that are close to centers in . So, what we need to do is to consider a partition of the points in into near points and far-away points. The far-away points have a good chance of being sampled in line (3) and the centers in can act as proxy for near points. We define the set for more formally now. The closeness of point in to points in is quantified using radius that is defined by the equation:
Let be points in that are within distance from a center in and denote the remaining points. That is,
Using these, we define the multi-set as:
Note that . Let and . Having defined the sets corresponding to in eqn. (4) and (5), we will now try to show that a good center set for will also be good for . Let be a center set such that satisfies and minimises the cost . The next few lemmas will be useful in the analysis.
For any , .
Let . We do a case analysis:
In this case, consider any point . From the triangle inequality, we have
This gives .
In this case, we have from triangle inequality:
This completes the proof of the lemma.∎
The proof follows from the following inequalities: