A Note on Clustering Aggregation

07/24/2018
by   Jiehua Chen, et al.
Ben-Gurion University of the Negev
0

We consider the clustering aggregation problem in which we are given a set of clusterings and want to find an aggregated clustering which minimizes the sum of mismatches to the input clusterings. In the binary case (each clustering is a bipartition) this problem was known to be NP-hard under Turing reduction. We strengthen this result by providing a polynomial-time many-one reduction. Our result also implies that no 2^o(n)· |I|^O(1)-time algorithm exists for any clustering instance I with n elements, unless the Exponential Time Hypothesis fails. On the positive side, we show that the problem is fixed-parameter tractable with respect to the number of input clusterings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/17/2018

On Computing Centroids According to the p-Norms of Hamming Distance Vectors

In this paper we consider the p-Norm Hamming Centroid problem which asks...
12/01/2021

Approximating Length-Restricted Means under Dynamic Time Warping

We study variants of the mean problem under the p-Dynamic Time Warping (...
01/05/2020

All non-trivial variants of 3-LDT are equivalent

The popular 3-SUM conjecture states that there is no strongly subquadrat...
05/16/2003

Conflict-based Force Aggregation

In this paper we present an application where we put together two method...
10/19/2015

Clustering is Easy When ....What?

It is well known that most of the common clustering objectives are NP-ha...
10/03/2021

Information Elicitation Meets Clustering

In the setting where we want to aggregate people's subjective evaluation...
12/17/2020

Time Aggregation Techniques Applied to a Capacity Expansion Model for Real-Life Sector Coupled Energy Systems

Simulating energy systems is vital for energy planning to understand the...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Clustering Aggregation for Binary Strings

The problem can be formalized as follows. We aim to find a length- binary string that minimizes the Mirkin distance to some input binary strings of length . The Mirkin distance  [11] between two strings and counts the number of mismatches for each pair of bits. Formally, . The Mirkin distance of a string  to a sequence  of strings is the sum of the Mirkin distances between and each string in the sequence: .

The Mirkin distance has an alternative definition that uses Hamming distances.

Note that by the above formulation, the Mirkin distance objective function is not convex. The formal statement of the problem is as follows: 3

as

A set  of strings  and an integer .
Is there a string such that ?

Notations.

For two binary strings  and , let denote the concatenation of and and let denote the complement of string . By we mean the value of the bit of string  and we write as shorthand of . Given two integers  with , we use the notation to denote the substring .

Our contributions.

Our main result in this paper is a tight running time bound on the Mirkin Distance Minimization problem. Specifically, we show that the problem cannot be solved in time unless the Exponential Time Hypothesis (ETH) fails. While the upper bound in this result is not very difficult, the lower bound uses an intricate construction, which shows that the trivial brute-force algorithm for the problem cannot be substantially improved. In the second part of the paper, we show hat the problem is fixed-parameter tractable for the parameter

of input strings, via an integer linear programming (ILP) approach.

Related work.

Mirkin Distance Minimization is a special variant of the NP-hard Clustering Aggregation [6] problem (aka. Consensus Clustering [4] or Clusters Ensembles [12]

) from machine learning and bioinformatics. The problem has as input a set 

of  partitions on a set  of  elements and we search for a target partition  that minimizes the Mirkin distances to all partitions. Herein, a partition on the set  is an equivalent binary relation  (i.e. reflexive, symmetric, and transitive). Thus, each partition can be represented by the equivalence classes of the corresponding equivalence relation. The Mirkin distance between two partitions is defined as the number of pairs of elements which are equivalent in one partition but non-equivalent in the other, or the other way round.

A partition with at most two equivalence classes can also be expressed as a binary string. Thus, it is straight-forward to see that our problem is equivalent to Clustering Aggregation for Binary Strings, i.e. both the input and the output partitions are binary strings. Mirkin Distance Minimization has further applications in voting theory and is also studied under the name of Binary Relation Aggregation [1, 13, 14], which is related to a concept in voting known as the median relation [1]. Dörnfelder et al. [3] showed that Mirkin Distance Minimization is NP-hard under Turing reduction. We will show in this note that the problem is NP-hard by providing a many-one reduction, which also implies that the trivial brute-force algorithm for the problem cannot be substantially improved.

Very recently, we [2] considered a related problem, -Norm Hamming Centroid, which searches for a centroid string which minimizes the -norm of its Hamming distances to the input strings, for each fixed . When the objective is to maximize instead of minimize the distances and when , the Mirkin Distance Minimization problem can be reduced to this maximization variant.

2 NP-hardness for Sum of Mirkin Distances

We show that Mirkin Distance Minimization is indeed NP-hard by utilizing a gadget that Dörnfelder et al. [3] used to enforce that for each two bits, when restricted to only these two bits, exactly half of the input strings have the same value ( or ) and the other half have different values ( or ). Algorithm 1 computes such kind of gadget. Note that each output string has length . Note that, however, this type of gadget alone is not enough to devise a many-one hardness reduction. This gadget can be used to encode truth-values of variables in a reduction from 3SAT but an essential difficulty that remains is to find gadgets that encode clause satisfaction.

1 Build():
2        if  then
3               return
4       else
5               return
Algorithm 1 Algorithm for constructing a sequence of length- binary strings such that for each two bits, half of the strings have the same value and the other half have not.

We show that the strings constructed by Algorithm 1 fulfills our requirement.

Proposition 1.

Let be the sequence of strings constructed by Algorithm 1. Then, for each two distinct bits  the following two statements hold.

  1. There are strings from : , such that , .

  2. There are strings from : , such that , .

Proof.

We show the statement via induction on . For , Algorithm 1 returns . Our two statements follow immediately. Assume that sequence  satisfies the proposition. We show that also satisfies the proposition. By Algorithm 1, we have .

Consider two arbitrary bits . Obviously, by our induction assumption, the two statements hold if or . Thus, we assume that and (the other case when and is symmetric). By construction, consists of the strings and , . To show the two statements, it suffices if we can show that “” This is equivalent to “ if and only if ” which is obvious. ∎

We reduce from an NP-hard variant of the 3-SAT problem, called Not-All-Equal 3-SAT (NAE-3SAT) [5], which given a set of size-three clauses asks whether there is a satisfying truth assignment such that each clause has at least one true literal and at least one false literal.

Theorem 1.

Mirkin Distance Minimization is NP-hard.

Proof.

As mentioned, we reduce from the NP-hard NAE-3SAT problem [5]. Let be an instance of NAE-3SAT, where denotes the set of variables and denotes a set of clauses of size three each. Without loss of generality, assume that for some . We construct two groups of binary strings where each string is of length 

. Variables will be encoded by pairs of two consecutive bits in the string, one on odd position, one on even position. We use the gadget constructed via

Algorithm 1 to enforce that these two bits will always have the same value so that will correspond to setting the variable to true while will correspond to setting the variable to false.

To this end, given two binary strings  and , and an integer  with , by we mean inserting the string  into at the position . For instance, . In particular, and .

Group 1.

Let . Then, for each integer  (representing the index of a specific variable) we introduce strings as follows. For each string , construct two strings with the forms  and . Note that each of these newly constructed strings has length . Let denote the sequence that contains all these newly introduced strings.

Group 2.

For each clause  let be the three literals contained in . We define three strings  as follows.

Let . For instance, for clause , the three corresponding strings are

Let . The instance  consists of the following strings: For each , add copies of to . For each , add to . This completes the construction, which can clearly be done in polynomial time. (Note that takes time.)

We claim that the instance  has a satisfying truth assignment such that each clause has a true literal and a false literal if and only if there is binary string  that has a Mirkin distance of at most  to the strings from .

Before we show the correctness of the construction, we present two observations which will help us to determine the solution string for .

Claim 1.

Let be an arbitrary binary string of length . For each integer , the following holds. If , then . If , then .

Proof.

By the construction of (Proposition 1), we have the following.

  • For each pair  we have

    1. strings  from such that , and

    2. strings  from such that .

    This means that the Mirkin distance from to regarding the pair  is always .

  • For each bit , strings from have a in column  and strings from have a in column . Thus, the Mirkin distance from to regarding the pair  (resp. ) is also .

  • The Mirkin distance from to regarding the pair  is if ; otherwise it is zero.

In total, we have

Define by .

Claim 2.

Let be an arbitrary clause. Then for each , we have that , and the equality is attained if and only if the string , interpreted as a truth assignment to the variables , satisfies with at least one true literal and at least one false literal.

Proof.

Assume, without loss of generality, that the literals in correspond the first, the second, and the third variable (each in either a positive or a negative form). For each string  with , by the definition of the Hamming distance, . By the definition of regarding the positions from to , we have that .

Assume that satisfies with the literal being true and the literal being false, and . Let . We distinguish two cases. If is true under , then while . If is false under , then while .

Using the alternative definition of the Mirkin distance, we have that

and that . Altogether, we have .

Assume that under either all literals from are true or all literals from are false. For the first case, for each , we have , implying . For the other case, for each , we have , implying . Altogether, we have . ∎

Now we are ready to show the equivalence between and , i.e. admits a truth assignment such that each clause in has a true literal and a false literal if and only if there is a string  whose Mirkin distance to the strings in is at most .

For the “only if” direction, assume that is a satisfying assignment for such that each clause  has at least one true literal and at least one false literal. Claim 2 indicates that has Mirkin distance to each triple in that corresponds to the clause . The second statement in Claim 1 indicates that has Mirkin distance to all strings in that corresponds to the variable . Altogether, the Mirkin distance between and all strings in is .

For the “if” direction, assume that is a string whose Mirkin distance to all strings in is at most . We claim that has the form  with for all . Suppose, towards a contradiction, that is not of the desired form, and let be an integer such that . Then, by the first statement in Claim 1, the Mirkin distance of to the first group of strings in will be at least which exceeds our distance bound  since —a contradiction.

Thus, has the form  with for all . We show that is a satisfying assignment for such that each clause has at least one true literal and at least one false literal. By the above reasoning, the Mirkin distance of to the second group of strings can be at most . Since there are  triples in the second group, one for each clause, the average Mirkin distance of to each triple is . By Claim 2 the Mirkin distance of to each triple in the second group is indeed , meaning that under  each clause has at least one true literal and one false literal. ∎

As a corollary, we obtain a running time lower bound for our problem.

Corollary 1.

Unless the Exponential Time Hypothesis fails, no -time algorithm exists for any instance  of Mirkin Distance Minimization where is the length of the input strings.

Proof.

To show the statement, note that the length of the the strings that we constructed in the proof of Theorem 1 is exactly , where is the number of variables in the NAE-3SAT instance. Thus, if we can show that, assuming the Exponential Time Hypothesis, NAE-3SAT does not admit a -time algorithm, where is an NAE-3SAT instance with variables, then our result follows.

Since we are not aware of any reference that explicitly states such a running time lower bound for NAE-3SAT, we prove this by providing a simple reduction from 3SAT. 3SAT is known not to admit any sub-exponential time algorithm unless the Exponential Time Hypothesis fails [8]. Let be an instance of 3SAT, where denotes the set of variables and denotes a set of clauses of size three each. We construct an instance  of NAE-3SAT as follows. The variable set  of consists of all variables from , and  new variables , , and , i.e. . For each clause  of let to unify the notation. For each clause , we introduce to the following two clauses  and with

This completes the construction which can be carried out in linear time. We claim that admits a satisfying truth assignment  if and only if admits a satisfying truth assignment  such that each clause in has at least one true literal and at least one false literal.

For the “only if” direction, assume that is a satisfying truth assignment for . It is straight-forward to verify that the following truth assignment is a satisfying truth assignment for such that each clause has at least one true literal and one false literal.

For the “if” direction, assume that is a satisfying truth assignment for such that each clause in has at least one true literal and one false literal. We claim that the following truth assignment  is a satisfying assignment for .

Suppose, for the sake of contradiction, that there is a clause  which is not satisfied by . Let , and be the three literals in . Since is not satisfied by , it follows that . We distinguish two cases and show in each case a contradiction.

  • is a positive literal, implying that . Since is satisfied (contains either a true or a false literal), it follows that . However, since is satisfied, it follows that —a contradiction.

  • is a negative literal, say , implying that and . Again, since is satisfied, it follows that . However, since is satisfied, it follows that —a contradiction.

We have shown the correctness of our construction. Now, observe that our constructed instance  has in total variables. Hence, a -time algorithm for NAE-3SAT would imply a -time algorithm for 3SAT, which is unlikely unless the Exponential Time Hypothesis fails [8]. In summary, this proves our running time lower bound statement for the Mirkin Distance Minimization problem. ∎

3 An Integer Linear Program (ILP) Formulation

In this section, we show that minimizing the Mirkin distance is fixed-parameter tractable with respect to the number  of input strings. To achieve this, we formulate our problem as an integer linear program with the number  of variables upper-bounded by , each corresponding to a pair of column types (to be defined shortly), and with polynomial number of constraints. By Lenstra [10], Kannan [9], we immediately have that our problem is solvable in time , where denotes the length of binary encoding of the input strings. We note that this integer programming approach similar to ours is applicable in many string problems whenever the columns of the input can be grouped together in order to be represented by a constant number of variables [7, 2]

. The resulting mathematical programming formulation is not linear at first. We need additional tricks where reformulate such that we can safely omit the square of binary variables, and such that we can introduce some extra variables to avoid multiplications of binary variables.

Before presenting the formulation, we observe a useful property of an optimal solution that allows us to introduce only binary variables, one for each column type. Herein, given a non-empty sequence  of length- strings, we say that two columns  have the same type if for each it holds that . The type of column

is its equivalence class in the same-type relation. Thus, each type is represented by a vector in

.

Lemma 1.

Let be a sequence of strings, each of length , and let be a solution with minimum Mirkin distance to . If two distinct columns  and with have the same type, then it holds that .

Proof.

Towards a contradiction, suppose that . We will show that making these two columns have the same bit, either zero or one, will result in a better solution. Let (resp. ) be a string that we obtain from by replacing with (resp. ) the bits at positions  and . Formally, we have and , and for each , we have . Given two strings  and , we define a function  that computes the Mirkin distance from to subtracted by the Mirkin distance between to :

To obtain a contradiction, we show that is not an optimal solution by showing that

because this implies that

For each input string , let denote the Hamming distance between and , restricted to the columns that are neither nor . We show that .

By our reasoning before, this implies that is not an optimal solution, a contradiction. ∎

By Lemma 1, for each two distinct types of columns, we only need to store whether the output string will have the same value at columns that correspond to these two types. Let denote the number of different (column) types in . Then, . Enumerate the column types as . Below we identify a column type with its index for easier notation. Using this, we can encode the set  succinctly by introducing a constant  for each column type  that denotes the number of columns with type . Analogously, given an optimal solution string , by Lemma 1 we can also encode this string  via a binary vector , where for each column type  we use to indicate whether the columns that correspond to the type have zeros or ones. Note that this encodes all essential information in a solution, since the actual order of the columns is not important.

Example 1.

For an illustration, let . The set  has two different column types, represented by , call it type , and , call it type . There are three columns of type  and one column of type . An optimal solution  with minimum Mirkin distance four for can be encoded by two binary variables  and .

Integer Linear Program Formulation.

Using the binary variables  that represent a solution  that has the same values in the columns of the same type, we can reformulate the Hamming distance between the two strings  and as follows. For the sake of simplicity, we let if the column type of column  has one in the row and if it has zero in the row.

Then the Mirkin distance between and can be formulated as follows, where denotes the number of ones in string  and , i.e.  if and if .