    # Approximating minimum representations of key Horn functions

Horn functions form a subclass of Boolean functions and appear in many different areas of computer science and mathematics as a general tool to describe implications and dependencies. Finding minimum sized representations for such functions with respect to most commonly used measures is a computationally hard problem that remains hard even for the important subclass of key Horn functions. In this paper we provide logarithmic factor approximation algorithms for key Horn functions with respect to all measures studied in the literature for which the problem is known to be hard.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A Boolean function of variables is a mapping from to . Boolean functions naturally appear in many areas of mathematics and computer science and constitute a principal concept in complexity theory. In this paper we shall study an important problem connected to Boolean functions, a so called Boolean minimization problem, which aims at finding a shortest possible representation of a given Boolean function. The formal statement of the Boolean minimization problem (BM) of course depends on (i) how the input function is represented, (ii) how it is represented on the output, and (iii) the way how the output size is measured.

One of the most common representations of Boolean functions are conjunctive normal forms (CNFs), the conjunctions of clauses which are elementary disjunctions of literals. There are two usual ways how to measure the size of a CNF: the number of clauses and the total number of literals (sum of clause lengths). It is easy to see that BM is NP-hard if both input and output is a CNF (for both above mentioned measures of the output size). This is an easy consequence of the fact that BM contains the CNF satisfiability problem (SAT) as its special case (an unsatisfiable formula can be trivially recognized from its shortest CNF representation). In fact, BM was shown to be in this case probably harder than SAT: while SAT is NP-complete (i.e.

-complete ), BM is -complete  (see also the review paper  for related results). It was also shown that BM is -complete when considering Boolean functions represented by general formulas of constant depth as both the input and output for BM .

Horn functions form a subclass of Boolean functions which plays a fundamental role in constructive logic and computational logic. They are important in automated theorem proving and relational databases. An important feature of Horn functions is that SAT is solvable for this class in linear time 

. A CNF is Horn if every clause in it contains at most one positive literal, and it is pure Horn (or definite Horn in some literature) if every clause in it contains exactly one positive literal. A Boolean function is (pure) Horn, if it admits a (pure) Horn CNF representation. Pure Horn functions represent a very interesting concept which was studied in many areas of computer science and mathematics under several different names. The same concept appears as directed hypergraphs in graph theory and combinatorics, as implicational systems in artificial intelligence and database theory, and as lattices and closure systems in algebra and concept lattice analysis

. Consider a pure Horn CNF on variables , where stands for the negation of , etc. The equivalent directed hypergraph is with vertex set and directed hyperarcs . This latter can be expressed more concisely using adjacency lists (a generalization of adjacency lists for ordinary digraphs) in which all hyperarcs with the same body (also called source) are grouped together , or can be represented as an implicational (closure) system on variables defined by rules .

Interestingly, in each of these areas the problem similar to BM, i.e. a problem of finding the shortest equivalent representation of the input data (CNF, directed hypergraph, set of rules) was studied. For example, such a representation can be used to reduce the size of knowledge bases in expert systems, thus improving the performance of the system. The above examples show that a “natural” way how to measure the size of the representation depends on the area. Six different measures and corresponding concepts of minimality were considered in [2, 12]: (B) number of bodies, (BA) body area, (TA) total area, (C) number of clauses, (BC) number of bodies and clauses, and (L) number of literals. For precise definitions, see Section 2. With a slight abuse of notations we shall use (B), (BA), (TA), (C), (BC) and (L) to denote both the measures and the corresponding minimization problems.

The only one of these six minimization problems for which a polynomial time procedure exists to derive a minimum representation is (B). The first such algorithm appeared in database theory literature . Different algorithms for the same task were then independently discovered in hypergraph theory , and in the theory of closure systems .

For the remaining five measures it is NP-hard to find the shortest representation. There is an extensive literature on the intractability results in various contexts for these minimization problems [2, 18, 22]. It was shown that (C) and (L) stay NP-hard even when the inputs are limited to cubic (bodies of size at most two) pure Horn CNFs , and the same result extends to the remaining three measures. Note that if all bodies are of size one then the above problems become equivalent with the transitive reduction of directed graphs, which is tractable . It should be noted that there exists many other tractable subclasses, such as acyclic and quasi-acyclic pure Horn CNFs , and CQ Horn CNFs 

. There are also few heuristic minimization algorithms for pure Horn CNFs

.

It was shown that (C) and (L) are not only hard to solve exactly but even hard to approximate. More precisely,  shows that these problems are inapproximable within a factor assuming , where denotes the number of variables. In addition,  shows that they are inapproximable within a factor assuming even when the input is restricted to 3-CNFs with clauses, for some small . It is not difficult to see that the same proof extends to (BC) and (TA) as well. On the positive side, (C), (BC), (BA), and (TA) admit -approximations and (L) has an -approximation . To the best of our knowledge, no better approximations are known even for pure Horn 3-CNFs.

Given a relational database, a key is a set of attributes with the property that a value assignment to this set uniquely implies the values of all other attributes. The concept of a key is essential for standard database operations [23, 25]. Analogously, we say that a pure Horn function is key Horn if any of its bodies implies all other variables. A special case of key Horn functions, when all bodies are of size two, was considered under the name hydra functions in  where a -approximation algorithm was presented for (C). The problem is NP-hard for (C) as shown in , which implies NP-hardness also for (BC), (TA), and (L), while (B) and (BA) are trivial in this case.

In this paper we consider the minimization problems for key Horn functions. Any irredundant representation of a key Horn function has the same set of bodies, implying that problems (B) and (BA) are in P. We show that a simple algorithm gives a 2-approximation for (TA) and a -approximation for (C), (BC), and (L), where is the size of a largest body. Our paper contains two main results. The first one improves the -approximation bound for (C) and (BC) to in the case of key Horn functions. The second result improves the -approximation bound for (L) to . Table 1 summarizes the state of the art of Horn minimization and the results presented in this paper for key Horn functions.

The structure of our paper is as follows: Section 2 introduces the necessary definitions and notation, Section 3 provides lower bounds for the measures we introduced, Section 4 contains our results about approximation algorithms, while Section 5 discusses the relation to the problem of finding a minimum weight strongly connected subgraph.

## 2 Preliminaries

Let denote a set of variables. Members of are called positive while their negations are called negative literals. Throughout the paper, the number of variables is denoted by . A Boolean function is a mapping . The

characteristic vector

of a set is denoted by , that is, if and otherwise. We say that a set is a true set of if , and a false set otherwise.

For a subset and we write to denote the pure Horn clause . Here and are called the body and head of the clause, respectively. That is, a pure Horn CNF can be associated with a directed hypergraph where every clause is considered to be a directed hyperarc oriented from to . The set of bodies appearing in a CNF representation is denoted by . We will also use the notation to denote . By grouping the clauses with the same body, a pure Horn CNF can be represented as . The latter representation is in a one-to-one correspondence with the adjacency list representation of the corresponding directed hypergraph.

For any pure Horn function the family of its true sets is closed under taking intersection and contains . This implies that for any non-empty set there exists a unique minimal true set containing . This set is called the closure of and is denoted by . If is a pure Horn CNF representation of , then the closure can be computed in polynomial time by the following forward chaining procedure. Set . In a general step, if is a true set then we set . Otherwise, let denote the set of all variables for which there exists a clause of with and , and set . The result does not depend on the particular choice of the representation , but only on the underlying function , that is, .

A pure Horn function is key Horn if it has a CNF representation of the form for some . We shall refer to as . Note that the same set of functions is defined if we restrict to be Sperner, that is, for any distinct we have and .

Assume now that is a pure Horn CNF of the form where for . Note that the number of clauses in the CNF is . The size of the formula can be measured in different ways:

• (B) number of bodies: ,

• (BA) body area: ,

• (TA) total area: ,

• (C) number of clauses (i.e., hyperarcs): ,

• (BC) number of bodies and clauses: ,

• (L) number of literals: .

These measures come up naturally in connection with directed hypergraphs, implicational systems, and CNF representations. The Horn minimization problem is to find a representation that is equivalent to a given Horn formula and has minimum size with respect to where denotes one of the aforementioned functions.

## 3 Lower bounds

The present section provides some simple reductions of the problem and lower bounds for the size of an optimal solution.

For a family , we denote by the family of minimal elements of . Recall that denotes the function defined by

 ΨB=⋀B∈BB→(V∖B). (1)
###### Lemma 1

For any measure () and for any , there exists a -minimum representation of that uses exactly the bodies in .

###### Proof

Take a -minimum representation for which is as small as possible. First we show . Assume that . As is a false set of , there must be a clause in that is falsified by , implying that . Therefore there exists a such that . If we substitute every clause of by , then we get another representation of since is a clause of . Meanwhile, the size of the representation does not increase while decreases, contradicting the choice of .

Next we prove . If there exists a , then is a true set of while it is a false set of , contradicting the fact that is a representation of .∎

Lemma 1 has two implications. It suffices to consider Sperner hypergraphs defining key Horn functions as an input, and more importantly, it is enough to consider CNFs using bodies from the input Sperner hypergraph when searching for minimum representations. For non-key Horn functions, this is not the case.

From now on we assume that is a Sperner family. We also assume that

 ⋃B∈BB=V    and    ⋂B∈BB=∅.

Indeed, if a variable is not covered by the bodies, then there must be a clause with head and body in in any minimum representation of , and actually one such clause suffices. Furthermore, if , then we can reduce the problem by deleting it. None of these reductions affects the approximability of the problem.

Recall that the size of the ground set is denoted by , while . The size of an optimal solution with respect to measure function is denoted by . Using these notations Lemma 1 has the following easy corollary:

###### Corollary 1

We have and . Therefore the minimization problems (B) and (BA) are solvable in polynomial time.∎

For the remaining measures we prove the following simple lower bound.

###### Lemma 2

for all measures , and for . Furthermore, , where is the size of a smallest body in .

###### Proof

By definition, is a lower bound for all the other measures, implying .

To see the second part, observe that is a lower bound for the three other measures. Therefore it suffices to prove . By the assumption that for every there exists a not containing , we can conclude by the fact that the closure and by the way the forward chaining procedure works that every CNF representation of must contain at least one clause with as its head. This implies .

To see the last part note that every variable is the head of at least one clause, the body of which is of at least size . Furthermore, since every body appears at least once and all clauses are of size at least , the claim follows. ∎

For a pair of sets, let denote the minimum -size of a CNF for which and , that is,

 price∗(S,T)=minΦ{|Φ|∗∣BΦ⊆B,T⊆FΦ(S)}. (2)

The following lemma plays a key role in our approximability proofs.

###### Lemma 3

Let be a partition of and let for . Then

 OPT∗(B)≥q∑i=1min{price∗(Bi,B)∣B∈B∖Bi} (3)

for all six measures .

###### Proof

Take a minimum representation with respect to which uses bodies only from . Such a representation exists by Lemma 1. We claim that the contribution of the clauses with bodies in to the total size of is at least for each . This would prove the lemma as the ’s form a partition of .

To see the claim, take an index and let be the first body (more precisely, one of the first bodies) not contained in that is reached by the forward chaining procedure from with respect to . Every clause that is used to reach from has its body in and their contribution to the size of the representation is lower bounded by , thus concluding the proof. ∎

## 4 Approximability results for (TA), (C), (BC), and (L)

Given a Sperner family , we can associate with it a complete directed graph by defining and . We refer to as the body graph of .

For any subset , define

 ΦE′=⋀(B,B′)∈E′B→(B′∖B). (4)

Note that if forms a strongly connected spanning subgraph of , then is a representation of . Let us add that not all representations arise this way, in particular, minimum representations might have significantly smaller size.

###### Lemma 4

If is a Hamiltonian cycle in , then defined in (4) provides a -approximation for all measures, where is an upper bound on the sizes of bodies in .

###### Proof

By Lemma 1, there exists a minimum representation of such that . Since is at most for all arcs , the statement follows. ∎

In fact, for (B) and (BA) (4) gives an optimal representation for any strongly connected spanning . Furthermore, if is a Hamiltonian cycle, we get a -approximation for (TA) based on the fact that the total area of any representation is lower bounded by .

###### Theorem 4.1

If is a Hamiltonian cycle in , then defined in (4) provides a -approximation for (TA).

###### Proof

The size of is . ∎

The observation that a strongly connected subgraph of the body graph corresponds to a representation of , as in (4), suggests the reduction of our problem to the problem of finding a minimum weight strongly connected spanning subgraph in a directed graph with arc-weight for . The optimum solution to this problem (MWSCS) is an upper bound for the minimum -size of a representation of . As there are efficient constant-factor approximations for MWSCS , this approach may look promising. There are however two difficulties: for measure (L), no polynomial time algorithm is known for computing ; even when it is efficiently computable (for measures (C) and (BC)), the upper bound obtained in this way may be off by a factor of from the optimum (see Section 5 for a construction).

In what follows, we overcome these difficulties. For (C), instead of a strongly connected spanning subgraph, we compute a minimum weight spanning in-arborescence and extend that to a representation of . The same approach works for (BC) as well. For (L), the situation is more complicated. First, we develop an efficient approximation algorithm for . Next, we compute a minimum weight spanning in-arborescence where its root is pre-specified. Finally, we extend the corresponding CNF to a representation of . We show that the cost of the arborescences built is at most a multiple of the optimum by a logarithmic factor, which in turn ensures the improved approximation factor.

### 4.1 Clause and body-clause minimum representations

In this section we consider (C) and (BC) and show that the simple algorithm described in Procedure 1 provides the stated approximation factor. We note that a minimum weight spanning in-arborescence of a directed graph can be found in polynomial time, see [10, 15].

algocf[htbp]

First we observe that is easy to compute.

for .

###### Proof

Take a pure Horn CNF attaining the minimum in (2). As every variable in is reached by the forward chaining procedure from with respect to , each such variable must be a head of at least one clause in . That is, contains at least clauses. On the other hand, uses exactly clauses, hence as stated. ∎

###### Lemma 6

Let denote a minimum -weight spanning in-arborescence in . Then

 |ΦT|H≤⌈logk⌉OPTH(B)+max{0,m−k},

where is an upper bound on the sizes of bodies in .

###### Proof

We construct a subgraph of such that (i) it is a spanning in-arborescence, and (ii) . We start with the digraph on node set that has no arcs. In a general step of the algorithm, will denote the graph constructed so far. We maintain the property that is a branching, that is, a collection of node-disjoint in-arborescences spanning all nodes. In an iteration, for each such in-arborescence we choose an arc of minimum weight with respect to that goes from the root of the in-arborescence to some other component. We add these arcs to , and for each directed cycle created, we delete one of its arcs. This results in a graph with at most half the number of weakly connected components that has, all being in-arborescences. We repeat this until the number of components becomes at most . To reach this, we need at most iterations. Finally, we choose one of the roots of the components and add an arc from all the other roots to this one, obtaining a spanning in-arborescence .

It remains to show that also satisfies (ii). In the final stage, we add at most arcs to , which corresponds to at most clauses in . Now we bound the rest of . In iteration , components of define a partition . Let us denote by the body corresponding to the root of the arborescence with node-set . Let us consider the arcs chosen to be added in the th iteration. Now we obtain

 |ΦTi+1∖Ti|H≤q∑j=1priceH(Bj,B′j)=q∑j=1minB∈B∖BjpriceH(Bj,B)≤OPTH(B).

The first inequality follows from the construction of . The equality follows from the criterion to choose the arcs to be added. The last inequality follows from Lemma 3. Since we have at most iterations, the lemma follows. ∎

###### Theorem 4.2

For key Horn functions, there exists a polynomial time -approximation algorithm for (C) and (BC), where is an upper bound on the sizes of bodies in .

###### Proof

We first show that provided by Procedure LABEL:proc:base is a -approximation for (C) and (BC). Note that is a subformula of defined by (1) since all bodies in are from . Furthermore, by our construction, for all . This implies that the output represents . Using Lemma 6 and the fact that we added clauses to in Step 2, we obtain

 |Φ|H≤⌈logk⌉OPTH(B)+max{0,m−k}+n.

By Lemma 2, this gives a -approximation, while setting gives a -approximation. By Lemma 1, . Since , the same approximation ratios as above follow for (BC) as well.

Finally, Lemma 4 provides a different CNF that is a -approximation for (C) and (BC). ∎

### 4.2 Literal minimum representations

In this section we consider (L). The first difficulty that we have to overcome is that, unlike in the case of (C) and (BC), no polynomial time algorithm is known for computing . To circumvent this, we give an -approximation algorithm for for any pair of sets . Note that if does not contain a body then , hence we assume that this is not the case.

We first analyze the structure of a pure Horn CNF attaining the minimum in (2) for (L). Starting the forward chaining procedure from with respect to , let denote the set of variables reached within the first steps. That is, . We choose in such a way that is as small as possible. Let be a smallest body in for and set .

for .

###### Proof

Suppose to the contrary that for some . By the definition of forward chaining, every variable is reached through a clause where . Now substitute each such clause by . As , the size of the CNF does not increase. However, the number of steps in the forward chaining procedure decreases by at least one, contradicting the choice of . Finally, would contradict the minimality of . ∎

Proposition 1 immediately implies that .

for .

###### Proof

Let be the smallest index that violates the condition. Take an arbitrary variable . Then is reached in the th step of the forward chaining procedure from a body of size at least . If we substitute this clause by , the resulting CNF still satisfies but has smaller size by , contradicting the minimality of . ∎

By Proposition 2, . Define

 Φ(1):=t−1⋀i=0Bi→(Bi+1∖(S∪i⋃j=1Bj)).

Observe that has a simple structure which is based on a linear order of bodies .

.

###### Proof

Take an arbitrary variable for some . By the observation above, . This means that has at least one clause entering , say , for which and so . However, has exactly one clause entering , namely . This implies that , and equality holds by the minimality of . ∎

The proposition implies that also realizes . We know no efficient algorithms to compute , thus, using the next two propositions, we define a CNF that approximates well and can be computed efficiently.

Let and for let denote the smallest index for which . Let be the largest value for which exists and set . Now define

 Φ(2):=r−1⋀j=0Bij→(Bij+1∖(S∪j⋃ℓ=1Biℓ)).

It is easy to see that .

.

###### Proof

Take an arbitrary variable for some . Then both and contain a single clause entering . Namely, is reached from in and from in . By the definition of the sequence , we get , concluding the proof. ∎

Although gives a -approximation for , it is not clear how we could find such a representation. Define

 Φ(3):=r−1⋀j=0Bij→(Bij+1∖(S∪Bij)).

The only difference between and is that we add unnecessary clauses to the representation. However, the next claim shows that the size of the formula cannot increase a lot.

.

###### Proof

Take an arbitrary variable that appears as the head of a clause in the representation . Let be the smallest index for which . Then contains a single clause entering , namely . On the other hand, the set contains all the clauses of that enter . By the definition of the sequence , we get . We get at most this many extra literals in on top of the literals in . As for , the statement follows. ∎

By Propositions 34 and 5,

 |Φ(3)|L≤2717|Φ(2)|L≤5417|Φ(1)|L=5417|Φ|L. (5)
###### Lemma 7

There exists an efficient algorithm to construct a CNF such that , , and .

###### Proof

We consider an extension of the body graph by adding to . We also define arc-weights by setting for . Let be a smallest body contained in (as defined before Proposition 1). Compute a shortest path from to and define

 Λ(S,S′)=⋀(B,B′)∈PB→(B′∖(S∪B)). (6)

Note that, by definition, is the weight of the shortest path , while is the length of one of the paths from to . By (5), . That is, provides a -approximation for as required, finishing the proof of the lemma. ∎

We prove that the algorithm described in Procedure 2 provides the stated approximated factor for (L). We note that a minimum weight spanning in-arborescence of a directed graph rooted at a fixed node can be found in polynomial time, see [10, 15]. Let be a smallest body in and denote . We define the weight of an arc in the body graph to be .

algocf[htbp]

Choose a smallest body in and let . Set for .

###### Lemma 8

Let denote a minimum -weight spanning in-arborescence in such that is rooted at . Then

where is the size of a largest body in .

###### Proof

We construct a subgraph of such that (i) it is a spanning in-arborescence, and (ii) . We start with the directed graph on node set that has no arcs. In a general step of the algorithm, will denote the graph constructed so far. We maintain the property that is a branching, that is, a collection of node-disjoint in-arborescences spanning all nodes. In an iteration, for each such in-arborescence we choose an arc of minimum weight with respect to that goes from the root of the in-arborescence to some other component. We add these arcs to , and for each directed cycle created, we delete one of its arcs. This results in a graph with at most half the number of weakly connected components that has, all being in-arborescences. We repeat this until the number of components becomes at most . To reach this, we need at most iterations. Finally, we add an arc from all the other roots to and delete all the arcs leaving , obtaining a spanning in-arborescence rooted at .

It remains to show that also satisfies (ii). In the final stage, we add at most arcs to whose total weight is upper bounded by , where the last inequality follows by Lemma 2. Now we bound the rest of . In iteration , components of define a partition . Let us denote by the body corresponding to the root of the arborescence with node-set . Let us consider the arcs chosen to be added in the th iteration. Now we obtain

 ∣∣ ∣∣⋀(B,B′)∈Ti+1∖TiΛ(B,B′)∣∣ ∣∣L = q∑j=1w(Bj,B′j)=q∑j=1minB∈B∖Bjw(Bj,B) ≤ 5417q∑j=1minB∈B∖BjpriceL(Bj,B)≤5417OPTL(B),

where the first and second inequalities follow by Lemmas 7 and 3, respectively. Since we have at most iterations, the lemma follows. ∎

###### Theorem 4.3

For key Horn functions, there exists a polynomial time -approximation algorithm for (L), where is the size of a largest body in .

###### Proof

We first show that provided by Procedure LABEL:proc:litmin is a -approximation for (L). Note that is a subformula of defined by (1) since all bodies in are from . Furthermore, by our construction, for all . This implies that the output represents . By Lemma 2, we add at most literals to in Step 4. This, together with Lemma 8, implies the theorem.∎

## 5 Clause minimization and minimum weight strongly connected subgraphs

Given a strongly connected graph and non-negative weights , we denote by the problem of finding a minimum weight subset of the arcs such that is also strongly connected. We denote by the weight of such a minimum weight arc subset. is an NP-hard problem, for which polynomial time approximation algorithms are known. For the case of uniform weights a -approximation was given by Khuller et al. . For general weights a simple -approximation is due to Fredericson and Jájá . Note that in the case of general weights, we can assume that is a complete directed graph.

As it was observed already in the beginning of Section 4, there is a natural relation of the above problem to the minimization of a key Horn function. Let us consider a Sperner hypergraph and the corresponding Horn function

 hB = ⋀B∈BB→(V∖B). (7)

The body graph of was a complete directed graph where . Define a weight function on the arcs of this graph by setting for all , , where is defined in (2). Then any solution of problem defines a representation of :

 Φ(F) = ⋀(B,B′)∈E(GB)Φ∗(B,B′), (8)

where is a formula for which , and . It is immediate to see that holds. Thus, it is natural to expect that a polynomial time approximation of problem provides also a good approximation for . This however turns out to be false for the case of .

Let us recall first some basic facts on finite projective spaces from the book . The finite projective space of dimension over a finite field