1 Clustering Aggregation for Binary Strings
The problem can be formalized as follows. We aim to find a length binary string that minimizes the Mirkin distance to some input binary strings of length . The Mirkin distance [11] between two strings and counts the number of mismatches for each pair of bits. Formally, . The Mirkin distance of a string to a sequence of strings is the sum of the Mirkin distances between and each string in the sequence: .
The Mirkin distance has an alternative definition that uses Hamming distances.
Note that by the above formulation, the Mirkin distance objective function is not convex. The formal statement of the problem is as follows: 3
as
A set of strings and an integer .
Is there a string such that ?
Notations.
For two binary strings and , let denote the concatenation of and and let denote the complement of string . By we mean the value of the bit of string and we write as shorthand of . Given two integers with , we use the notation to denote the substring .
Our contributions.
Our main result in this paper is a tight running time bound on the Mirkin Distance Minimization problem. Specifically, we show that the problem cannot be solved in time unless the Exponential Time Hypothesis (ETH) fails. While the upper bound in this result is not very difficult, the lower bound uses an intricate construction, which shows that the trivial bruteforce algorithm for the problem cannot be substantially improved. In the second part of the paper, we show hat the problem is fixedparameter tractable for the parameter
of input strings, via an integer linear programming (ILP) approach.
Related work.
Mirkin Distance Minimization is a special variant of the NPhard Clustering Aggregation [6] problem (aka. Consensus Clustering [4] or Clusters Ensembles [12]
) from machine learning and bioinformatics. The problem has as input a set
of partitions on a set of elements and we search for a target partition that minimizes the Mirkin distances to all partitions. Herein, a partition on the set is an equivalent binary relation (i.e. reflexive, symmetric, and transitive). Thus, each partition can be represented by the equivalence classes of the corresponding equivalence relation. The Mirkin distance between two partitions is defined as the number of pairs of elements which are equivalent in one partition but nonequivalent in the other, or the other way round.A partition with at most two equivalence classes can also be expressed as a binary string. Thus, it is straightforward to see that our problem is equivalent to Clustering Aggregation for Binary Strings, i.e. both the input and the output partitions are binary strings. Mirkin Distance Minimization has further applications in voting theory and is also studied under the name of Binary Relation Aggregation [1, 13, 14], which is related to a concept in voting known as the median relation [1]. Dörnfelder et al. [3] showed that Mirkin Distance Minimization is NPhard under Turing reduction. We will show in this note that the problem is NPhard by providing a manyone reduction, which also implies that the trivial bruteforce algorithm for the problem cannot be substantially improved.
Very recently, we [2] considered a related problem, Norm Hamming Centroid, which searches for a centroid string which minimizes the norm of its Hamming distances to the input strings, for each fixed . When the objective is to maximize instead of minimize the distances and when , the Mirkin Distance Minimization problem can be reduced to this maximization variant.
2 NPhardness for Sum of Mirkin Distances
We show that Mirkin Distance Minimization is indeed NPhard by utilizing a gadget that Dörnfelder et al. [3] used to enforce that for each two bits, when restricted to only these two bits, exactly half of the input strings have the same value ( or ) and the other half have different values ( or ). Algorithm 1 computes such kind of gadget. Note that each output string has length . Note that, however, this type of gadget alone is not enough to devise a manyone hardness reduction. This gadget can be used to encode truthvalues of variables in a reduction from 3SAT but an essential difficulty that remains is to find gadgets that encode clause satisfaction.
We show that the strings constructed by Algorithm 1 fulfills our requirement.
Proposition 1.
Let be the sequence of strings constructed by Algorithm 1. Then, for each two distinct bits the following two statements hold.

There are strings from : , such that , .

There are strings from : , such that , .
Proof.
We show the statement via induction on . For , Algorithm 1 returns . Our two statements follow immediately. Assume that sequence satisfies the proposition. We show that also satisfies the proposition. By Algorithm 1, we have .
Consider two arbitrary bits . Obviously, by our induction assumption, the two statements hold if or . Thus, we assume that and (the other case when and is symmetric). By construction, consists of the strings and , . To show the two statements, it suffices if we can show that “” This is equivalent to “ if and only if ” which is obvious. ∎
We reduce from an NPhard variant of the 3SAT problem, called NotAllEqual 3SAT (NAE3SAT) [5], which given a set of sizethree clauses asks whether there is a satisfying truth assignment such that each clause has at least one true literal and at least one false literal.
Theorem 1.
Mirkin Distance Minimization is NPhard.
Proof.
As mentioned, we reduce from the NPhard NAE3SAT problem [5]. Let be an instance of NAE3SAT, where denotes the set of variables and denotes a set of clauses of size three each. Without loss of generality, assume that for some . We construct two groups of binary strings where each string is of length
. Variables will be encoded by pairs of two consecutive bits in the string, one on odd position, one on even position. We use the gadget constructed via
Algorithm 1 to enforce that these two bits will always have the same value so that will correspond to setting the variable to true while will correspond to setting the variable to false.To this end, given two binary strings and , and an integer with , by we mean inserting the string into at the position . For instance, . In particular, and .
 Group 1.

Let . Then, for each integer (representing the index of a specific variable) we introduce strings as follows. For each string , construct two strings with the forms and . Note that each of these newly constructed strings has length . Let denote the sequence that contains all these newly introduced strings.
 Group 2.

For each clause let be the three literals contained in . We define three strings as follows.
Let . For instance, for clause , the three corresponding strings are
Let . The instance consists of the following strings: For each , add copies of to . For each , add to . This completes the construction, which can clearly be done in polynomial time. (Note that takes time.)
We claim that the instance has a satisfying truth assignment such that each clause has a true literal and a false literal if and only if there is binary string that has a Mirkin distance of at most to the strings from .
Before we show the correctness of the construction, we present two observations which will help us to determine the solution string for .
Claim 1.
Let be an arbitrary binary string of length . For each integer , the following holds. If , then . If , then .
Proof.
By the construction of (Proposition 1), we have the following.

For each pair we have

strings from such that , and

strings from such that .
This means that the Mirkin distance from to regarding the pair is always .


For each bit , strings from have a in column and strings from have a in column . Thus, the Mirkin distance from to regarding the pair (resp. ) is also .

The Mirkin distance from to regarding the pair is if ; otherwise it is zero.
In total, we have
∎
Define by .
Claim 2.
Let be an arbitrary clause. Then for each , we have that , and the equality is attained if and only if the string , interpreted as a truth assignment to the variables , satisfies with at least one true literal and at least one false literal.
Proof.
Assume, without loss of generality, that the literals in correspond the first, the second, and the third variable (each in either a positive or a negative form). For each string with , by the definition of the Hamming distance, . By the definition of regarding the positions from to , we have that .
Assume that satisfies with the literal being true and the literal being false, and . Let . We distinguish two cases. If is true under , then while . If is false under , then while .
Using the alternative definition of the Mirkin distance, we have that
and that . Altogether, we have .
Assume that under either all literals from are true or all literals from are false. For the first case, for each , we have , implying . For the other case, for each , we have , implying . Altogether, we have . ∎
Now we are ready to show the equivalence between and , i.e. admits a truth assignment such that each clause in has a true literal and a false literal if and only if there is a string whose Mirkin distance to the strings in is at most .
For the “only if” direction, assume that is a satisfying assignment for such that each clause has at least one true literal and at least one false literal. Claim 2 indicates that has Mirkin distance to each triple in that corresponds to the clause . The second statement in Claim 1 indicates that has Mirkin distance to all strings in that corresponds to the variable . Altogether, the Mirkin distance between and all strings in is .
For the “if” direction, assume that is a string whose Mirkin distance to all strings in is at most . We claim that has the form with for all . Suppose, towards a contradiction, that is not of the desired form, and let be an integer such that . Then, by the first statement in Claim 1, the Mirkin distance of to the first group of strings in will be at least which exceeds our distance bound since —a contradiction.
Thus, has the form with for all . We show that is a satisfying assignment for such that each clause has at least one true literal and at least one false literal. By the above reasoning, the Mirkin distance of to the second group of strings can be at most . Since there are triples in the second group, one for each clause, the average Mirkin distance of to each triple is . By Claim 2 the Mirkin distance of to each triple in the second group is indeed , meaning that under each clause has at least one true literal and one false literal. ∎
As a corollary, we obtain a running time lower bound for our problem.
Corollary 1.
Unless the Exponential Time Hypothesis fails, no time algorithm exists for any instance of Mirkin Distance Minimization where is the length of the input strings.
Proof.
To show the statement, note that the length of the the strings that we constructed in the proof of Theorem 1 is exactly , where is the number of variables in the NAE3SAT instance. Thus, if we can show that, assuming the Exponential Time Hypothesis, NAE3SAT does not admit a time algorithm, where is an NAE3SAT instance with variables, then our result follows.
Since we are not aware of any reference that explicitly states such a running time lower bound for NAE3SAT, we prove this by providing a simple reduction from 3SAT. 3SAT is known not to admit any subexponential time algorithm unless the Exponential Time Hypothesis fails [8]. Let be an instance of 3SAT, where denotes the set of variables and denotes a set of clauses of size three each. We construct an instance of NAE3SAT as follows. The variable set of consists of all variables from , and new variables , , and , i.e. . For each clause of let to unify the notation. For each clause , we introduce to the following two clauses and with
This completes the construction which can be carried out in linear time. We claim that admits a satisfying truth assignment if and only if admits a satisfying truth assignment such that each clause in has at least one true literal and at least one false literal.
For the “only if” direction, assume that is a satisfying truth assignment for . It is straightforward to verify that the following truth assignment is a satisfying truth assignment for such that each clause has at least one true literal and one false literal.
For the “if” direction, assume that is a satisfying truth assignment for such that each clause in has at least one true literal and one false literal. We claim that the following truth assignment is a satisfying assignment for .
Suppose, for the sake of contradiction, that there is a clause which is not satisfied by . Let , and be the three literals in . Since is not satisfied by , it follows that . We distinguish two cases and show in each case a contradiction.

is a positive literal, implying that . Since is satisfied (contains either a true or a false literal), it follows that . However, since is satisfied, it follows that —a contradiction.

is a negative literal, say , implying that and . Again, since is satisfied, it follows that . However, since is satisfied, it follows that —a contradiction.
We have shown the correctness of our construction. Now, observe that our constructed instance has in total variables. Hence, a time algorithm for NAE3SAT would imply a time algorithm for 3SAT, which is unlikely unless the Exponential Time Hypothesis fails [8]. In summary, this proves our running time lower bound statement for the Mirkin Distance Minimization problem. ∎
3 An Integer Linear Program (ILP) Formulation
In this section, we show that minimizing the Mirkin distance is fixedparameter tractable with respect to the number of input strings. To achieve this, we formulate our problem as an integer linear program with the number of variables upperbounded by , each corresponding to a pair of column types (to be defined shortly), and with polynomial number of constraints. By Lenstra [10], Kannan [9], we immediately have that our problem is solvable in time , where denotes the length of binary encoding of the input strings. We note that this integer programming approach similar to ours is applicable in many string problems whenever the columns of the input can be grouped together in order to be represented by a constant number of variables [7, 2]
. The resulting mathematical programming formulation is not linear at first. We need additional tricks where reformulate such that we can safely omit the square of binary variables, and such that we can introduce some extra variables to avoid multiplications of binary variables.
Before presenting the formulation, we observe a useful property of an optimal solution that allows us to introduce only binary variables, one for each column type. Herein, given a nonempty sequence of length strings, we say that two columns have the same type if for each it holds that . The type of column
is its equivalence class in the sametype relation. Thus, each type is represented by a vector in
.Lemma 1.
Let be a sequence of strings, each of length , and let be a solution with minimum Mirkin distance to . If two distinct columns and with have the same type, then it holds that .
Proof.
Towards a contradiction, suppose that . We will show that making these two columns have the same bit, either zero or one, will result in a better solution. Let (resp. ) be a string that we obtain from by replacing with (resp. ) the bits at positions and . Formally, we have and , and for each , we have . Given two strings and , we define a function that computes the Mirkin distance from to subtracted by the Mirkin distance between to :
To obtain a contradiction, we show that is not an optimal solution by showing that
because this implies that
For each input string , let denote the Hamming distance between and , restricted to the columns that are neither nor . We show that .
By our reasoning before, this implies that is not an optimal solution, a contradiction. ∎
By Lemma 1, for each two distinct types of columns, we only need to store whether the output string will have the same value at columns that correspond to these two types. Let denote the number of different (column) types in . Then, . Enumerate the column types as . Below we identify a column type with its index for easier notation. Using this, we can encode the set succinctly by introducing a constant for each column type that denotes the number of columns with type . Analogously, given an optimal solution string , by Lemma 1 we can also encode this string via a binary vector , where for each column type we use to indicate whether the columns that correspond to the type have zeros or ones. Note that this encodes all essential information in a solution, since the actual order of the columns is not important.
Example 1.
For an illustration, let . The set has two different column types, represented by , call it type , and , call it type . There are three columns of type and one column of type . An optimal solution with minimum Mirkin distance four for can be encoded by two binary variables and .
Integer Linear Program Formulation.
Using the binary variables that represent a solution that has the same values in the columns of the same type, we can reformulate the Hamming distance between the two strings and as follows. For the sake of simplicity, we let if the column type of column has one in the row and if it has zero in the row.
Then the Mirkin distance between and can be formulated as follows, where denotes the number of ones in string and , i.e. if and if .
Comments
There are no comments yet.