# Percolation and Phase Transition in SAT

Erdös and Rényi proved in 1960 that a drastic change occurs in a large random graph when the number of edges is half the number of nodes: a giant connected component surges. This was the origin of percolation theory, where statistic mechanics and mean field techniques are applied to study the behavior of graphs and other structures when we remove edges randomly. In the 90's the study of random SAT instances started. It was proved that in large 2-SAT random instances also a drastic change occurs when the number of clauses is equal to the number of variables: the formula almost surely changes from satisfiable to unsatisfiable. The same effect, but at distinct clause/variable ratios, was detected in k-SAT and other random models. In this paper we study the relation between both phenomena, and establish a condition that allows us to easily find the phase transition threshold in several models of 2-SAT formulas. In particular, we prove the existence of a phase transition threshold in scale-free random 2-SAT formulas.

## Authors

• 4 publications
• ### The SAT Phase Transition

Phase transition is an important feature of SAT problem. For random k-SA...
05/22/2000 ∙ by Ke Xu, et al. ∙ 0

• ### Phase Transition in Matched Formulas and a Heuristic for Biclique Satisfiability

A matched formula is a CNF formula whose incidence graph admits a matchi...
08/06/2018 ∙ by Miloš Chromý, et al. ∙ 0

• ### Separation of bounded arithmetic using a consistency statement

This paper proves Buss's hierarchy of bounded arithmetics S^1_2 ⊆ S^2_2 ...
04/14/2019 ∙ by Yoriyuki Yamagata, et al. ∙ 0

• ### Phase Transition and Network Structure in Realistic SAT Problems

A fundamental question in Computer Science is understanding when a speci...
03/30/2013 ∙ by Soumya C. Kambhampati, et al. ∙ 0

• ### Phase Transition Behavior of Cardinality and XOR Constraints

The runtime performance of modern SAT solvers is deeply connected to the...
10/22/2019 ∙ by Yash Pote, et al. ∙ 0

• ### Fault Tolerance of Random Graphs with respect to Connectivity: Phase Transition in Logarithmic Average Degree

The fault tolerance of random graphs with unbounded degrees with respect...
12/21/2017 ∙ by Satoshi Takabe, et al. ∙ 0

• ### On the Average Similarity Degree between Solutions of Random k-SAT and Random CSPs

To study the structure of solutions for random k-SAT and random CSPs, th...
08/11/2000 ∙ by Ke Xu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Over the last 20 years, SAT solvers have experienced a great improvement in their efficiency when solving practical SAT problems. This is the result of some techniques like conflict-driven clause learning (CDCL), restarting and clause deletion policies. The success of SAT solvers is surprising, if we take into account that SAT is an NP-complete problem, and in fact, a big percentage of formulas need exponential size resolution proofs to be shown unsatisfiable. This has led some researchers to study what is the nature of real-world or industrial SAT instances that make them easy in practice. Parallelly, most theoretical work on SAT has focused on uniform randomly selected instances. Nevertheless, nowadays we know that most industrial instances share some properties that are not present in most (uniform randomly-choosen) SAT formulas. It is also well-known that solvers that perform well on industrial instances, do not perform well on random instances, and vice versa. Therefore, a new theoretical paradigm that describes the distribution of industrial instances is needed. Not surprisingly, generating random instances that are more similar to real-world instances is described as one of the ten grand challenges in satisfiability Selman et al. (1997); Selman (2000); Kautz and Selman (2003, 2007).

Over the last 10 years, the analysis of the industrial SAT instances used in SAT solvers competitions has allowed us to have a clear image of the structure of real-world instances. Williams et al. (2003) proved that industrial instances contain a small number of variables (called the backdoor of the formula) that, when instantiated, make the formula easy to solve. Ansótegui et al. (2008) showed that industrial instances have a smaller tree-like resolution space complexity (also called hardness Beyersdorff and Kullmann (2014)) than randomly generated instances with the same number of variables. Ansótegui et al. (2009a) proved that most industrial instances, when represented as a graph, have a scale-free structure. This kind of structure has also been observed in other real-world networks like the World Wide Web, Internet, some social networks like papers co-authorship or citation, protein interaction network, etc. Ansótegui et al. (2012, 2016) show that these graph representations of industrial instances exhibit a very high modularity. Modularity has been shown to be correlated with the runtime of CDCL SAT solvers Newsham et al. (2014), and has been used to improved the performance of some solvers Ansótegui et al. (2015); Sonobe et al. (2014); Martins et al. (2013). It is also known that these graph representations are self-similar Ansótegui et al. (2014)

and eigen-vector centrality is correlated with the significance of variables

Katsirelos and Simon (2012).

Defining a model that captures all these properties observed in industrial instances is a hard task. Here, we focus on the scale-free structure. We will define a model and propose a random generator for scale-free SAT formulas, extending our work presented in CCIA’07 Ansótegui et al. (2007), CCIA’08 Ansótegui et al. (2008), IJCAI’09 Ansótegui et al. (2009b) and CP’09 Ansótegui et al. (2009a). This model is parametric in the size of clauses and an exponent . Formulas are sets of independently sampled clauses of size with possible repetitions. Clauses are sets of independently sampled variables, without repetitions, where each variable is chosen with probability , and negated with probability . In this paper, we also study the SAT-UNSAT phase transition phenomena in this new model using percolation techniques of statistical mechanics. We prove that random scale-free formula over variables, exponent and clauses of size are unsatisfiable with high probability (see Theorem 5). This means that, for big enough values of , the number of clauses needed to make a formula unsatisfiable is sub-linear on the number of variables, contrarily to the standard random SAT model. We also prove that scale-free random 2-SAT formulas with exponent and a ratio of clause/variables are also unsatisfiable with high probability (see Theorem 4). This last result, together with a coincident lower bound found by Friedrich et al. Friedrich et al. (2017b) allow us to conclude that scale-free random 2SAT formulas show a SAT-UNSAT phase transition threshold.

During the revision of this article, many new results related to the phase transition on scale-free random formulas have been found. Friedrich et al. (2017a) generalize the notion of scale-free random k-SAT formulas and prove that there exists an asymptotic satisfiability threshold (in the sense of Friedgut (1998)) for , when the number of clauses is linear in the number of variables. Friedrich and Rothenberger (2018) find sufficient conditions for the sharpness of this threshold, generalizing Friedgut (1998)’s result for uniform random formulas. Friedrich and Rothenberger (2019) generalize the notion of scale-free formula to the notion of non-uniform random formula, only assuming that variable is selected with probability (where ) and determine the position of the threshold for . Cooper et al. (2007) and Omelchenko and Bulatov (2019) analyze the configuration model for 2-SAT where, instead of fixing the probability of every variable, they fix the degree of every variable. If these degrees follow a power-law distribution, the location of the satisfiability threshold (for ) is the same as in our model.

This article proceeds as follows. In Section 2 we review some methods to generate scale-free random graphs. One of these methods is the basis of the definition of scale-free random formulas, introduced in Section 3. In Section 4, we summarize some properties of industrial or real-world SAT instances, described in detail in our work presented at CP’09 Ansótegui et al. (2009a). We prove the existence of a SAT-UNSAT phase transition phenomenon in scale-free random 2-SAT instances in Section 5. This is done using percolation techniques. In Section 6, we prove that when the parameter that regulates the scale-free struture of formulas exceeds a certain value, the SAT-UNSAT phase transition phenomena vanishes, and most formulas become unsatisfiable due to small cores of unsatisfiable clauses.

## 2 Generation of Scale-Free Graphs

Generating scale-free formulas has an obvious relationship with the generation of scale-free graphs. In this section we review some graph generation methods developed by researchers on complex networks.

A scale-free graph is a graph where node degrees follow a power-law distribution , at least asymptotically, where exponent is around . Preferential attachment Barabási and Albert (1999) has been proposed as the natural process that makes scale-free networks so prevalent in nature. This process can be used to generate scale-free graphs as follows. Given two numbers and , we start at time with a clique of size where all nodes have degree (in the limit when tends to infinity, the starting graph is not relevant). Then, at every time , we add a new node (with index ), connected to distinct and older nodes , such that the probability that a node gets a connection to this new node is proportional to the degree of at time . This process generates a scale-free graph with asymptotic exponent , average node degree and minimum degree , for all nodes. We can also prove that the expected degree of node is , for small values of , where .

In order to explain the origin of scale-free networks where , several models have been proposed Dorogovtsev and Mendes (2003). One of these models is based on the aging of nodes Dorogovtsev and Mendes (2000). This means that the probability of a node (created at instant ) to get a new edge at instant is proportional to the product of its degree and , where is the age of the node. This model generates scale-free graphs when . When , the exponents of the power-laws and are and , respectively. Therefore, the value of may be used to tune the values of and .

In the previous methods, growth in the number of nodes is essential. There are other methods, usually called static, where the number of nodes is fixed from the beginning and during the process we only add edges.

The simplest method, assuming uniform probability for all graphs with a scale-free degree distribution in the degree of nodes, is the configuration method, that can be implemented as follows.

Given a desired number of nodes and exponent , for every node , generate a degree following the probability , independently of . Here, is the Riemann zeta function. Then, generate a graph with these node degrees, ensuring that all them are generated with the same probability. This can be done, for instance, with a unfold-fold process: In the unfolding, we replicate node , with degree , into new nodes with degree . Then, we randomly generate a graph where all nodes have degree equal to one, ensuring that all 1-regular graphs with nodes are generated with the same probability. Then, in the folding, we merge the nodes that came from the replication of , into the same node. When there is an edge between two nodes and these two nodes are merged, a self-loop is created. Similarly, when we have two edges and and and are merged, a duplicated edge is created. Therefore, we reject the resulting graph, if it contains self-loops or multiple edges between the same pair of nodes. Alternatively, we can also apply the Erdös-Rényi generation method to the unfolded set of nodes, with average node degree equal to one. In this later case, we would ensure that after folding, node has a degree close to

, since in the Erdös-Rényi model, node degrees follow a binomial distribution (a Poisson distribution

in the infinite limit, where is the average degree, in our case).

The previous method has two problems. First, the resulting graph (after the unfold-fold process) will have average node degree equals to:

 E[k]=∞∑k=1kP(k)=∞∑k=1kk−γζ(γ)=ζ(γ−1)ζ(γ)

If we want to obtain a graph with a distinct average degree, we have to modify the probability for small values of , and ensure that follows a power-law distribution only asymptotically for big values of . In other words, we only require to follow a heavy-tail distribution. Second, a great fraction of generated graphs will contain self-loops or multiple-edges after folding. This means that a great fraction of graphs will be rejected, which makes the method inefficient. However, the model can be useful to translate some properties of the Erdös-Rényi model to scale-free graphs via the unfolding-folding process and the configuration model Bender and Canfield (1978); Bollobás (2001).

The unfolding-folding procedure was described by Aiello et al. (2000). They, instead of assigning a random degree to each node, describe a model where, given two parameters and ,222In the original paper, authors use the name instead of . we choose a random graph (with uniform probability, and allowing self-cycles) among all graphs satisfying that the number of nodes with degree is . When , the average node degree in this model is also .

Alternatively, instead of fixing the degree of every node, we can fix the expected degree of every node . In order to construct a graph where nodes have this expected degree , we only need to generate edge with probability . If we want to generate a scale-free graph where , for sparse graphs, it suffices to fix  Chung and Lu (2002); Goh et al. (2001) (see also Theorem 1).

Our scale-free formula generation method is based on this static scale-free graph generation model with fixed expected node degrees. Basically, nodes are replaced by variables. Then, instead of edges, we generate hyper-edges. Negating every variable connected by a hyper-edge with probability , we get clauses.

## 3 Scale-Free Random Formulas

In this section we describe the scale-free random SAT formulas model.

We consider -SAT formulas over variables, denoted by . A formula is a conjunction of possibly repeated clauses, represented as a multiset. Clauses are disjunctions of literals, noted , where every literal may be a variable or its negation . We identify with . We restrict clauses to not contain repeated occurrences of variables. This avoids simplifiable formulas like and tautologies like . In general, we represent every variable by its index, and negation as a minus, writing instead of , and instead of . In other words, a variable is a number in , and a literal a number in distinct from zero. We use the notation to denote either or . The number of occurrences of literal in a formula is denoted by , and denotes the number of occurrences of variable . The size of a formula is .

In the following, we will use the notation

to indicate that random variable

follows the probability distribution . The notation indicates that .333Notice that is equivalent to . When , then is equivalent to .

###### Definition 1 (Scale-free Random Formula)

In the scale-free model, given , and , to construct a random formula, we generate clauses independently at random from the set of clauses, sampling every valid clause with probability

 P(l1∨⋯∨lk)∼k∏i=1P(li)

where every literal is sampled with probability

 P(x)=P(¬x)∼x−β

In practice, we generate a variable with probability , negate it with probability , repeat the process times, and reject clauses containing repeated variables.

Therefore, the probability of a clause satisfies the inequality

 P(l1∨⋯∨lk)≥k!∏ki=1|li|−β(2∑ni=1i−β)k

### 3.1 Some Properties of the Model

In the case of the graph generator, we reject self-loops and repeated edges between two nodes. This makes distribution of degrees to follow a power-law, only asymptotically and for sparse graphs. In our case, we reject clauses with repeated variables. This is the reason that invalidates the reverse direction in the previous inequality. It also makes formulas to follow a power-law distribution in the number of variable occurrences only asymptotically (see Theorem 1). In the following we will discuss when the approximation for is tight (Lemma 1).

Notice that are the generalized harmonic numbers. When tends to infinity and , using the Euler-Maclaurin formula, they can be approximated as

 n∑i=1i−β=ζ(β)+11−βn1−β+12n−β+O(n−β−1) (1)

where is the Riemann zeta function. When , we have

 n∑i=1i−1=γ+logn+O(n−1) (2)

where is the Euler constant.

This means that, when tends to infinity, the probability of sampling variable is , when , and , when . The fact that the probability of sampling a variable does not vanish, when the number of variables tend to infinity and , may be troublesome. In particular, the probability of generating clauses with duplicated variables does not vanish, even for constant clause sizes. Similarly, to avoid duplicated variables, we also have to impose an upper bound .

###### Lemma 1

When , the sizes of clauses are and tends to infinity, the probability of generating a clause with a duplicated variable tends to zero.

In these conditions, the probability of a random variable and the probability of a random clause in a formula are

 P(x=xi)≈i−β∑nj=1j−β
 P(C=l1∨⋯∨lk)≈k!∏ki=1|li|−β(2n∑i=1i−β)k

Proof: We will use a result known as surname problem Mase (1992), that generalizes the birthday paradox. Let be independent random variables which have an identical discrete distribution , for . Let be the coincidence probability that at least two have the same value. Let be the non-coincidence probability. Then, may be computed using the recurrence and

 rk=k∑j=1(−1)j−1(k−1)!(k−j)!Pjrk−j

where . The coincidence probability can be computed as and

 Rk=Rk−1+k∑j=2(−1)j(k−1)!(k−j)!Pj(1−Rk−j)

In our case, we face the problem of choosing independent variables, and we want to compute the probability of getting a duplicated variable, hence a rejected clause. When , we have:

 Pk=∑i≥1i−βk(∑i≥1i−β)k=11−βkn1−βk+ζ(βk)+O(n−βk)(11−βn1−β+O(1))k=(1−β)k1−βkn1−k+ζ(βk)(1−β)kn−(1−β)k+O(n−k)

Depending on whether is greater or smaller than , the first or the second term of will dominate.

Since and , we have

 Rk≤k∑i=2i∑j=2ij−1Pj≤kmaxj=2,…,kkj−1Pj

In our case, assuming , and replacing the value of , we get

 Rk≤O(nα)maxj=2,…kO(nα(j−1)(n1−j+n−(1−β)j))=maxj=2,…kO(n1−(1−α)j+n−(1−β−α)j))

Assuming , the maximum is obtained for . In this situation . Therefore, it suffices to assume that to ensure that .

###### Lemma 2

In a scale-free random formula over variables and clauses of size , generated with exponent , the expected number of occurrences of variable is

 E[Ki]≈Ck(1−β)(in)−β

Proof: By Lemma 1 and equation (1), since we have

 E[Ki]=P(i)|F|≈i−βζ(β)+11−βn1−β+O(n−β)Ckn≈Ck(1−β)(in)−β

The following theorem ensures that the formulas we get are scale-free, in the sense that the number of occurrences of variables follow a power-law distribution , for big enough values of .

###### Theorem 1

In scale-free random formulas over variables, with clauses of size , and generated with exponent , when tends to being and constants, the probability that a variable has occurrences, where or , follows a power-law distribution , where .

Proof: In the limit when , by Lemma 1, is the probability of sampling a variable , for some constant that depends on . Let be the number of occurrences of variable in a randomly generated formula . We have . Chernoff’s or Hoeffding’s bounds ensure that, under certain conditions that we will consider later, is approximately . Hence, with high probability.

Now we want to approximate the probability that a variable occurs at least times. Given a value , let be the index of the variable satisfying . Under these conditions, all variables with index smaller that will have more than occurrences, and those with indexes between and have less than occurrences. Therefore, , for the particular defined above. From and we obtain

 F(K)=in=1n(K|F|C)−1/β

Then, the probability is

 P(K)=−∂∂KF(K)=(|F|C)1/ββnK−1/β−1

Hence we obtain a discrete power-law distribution with exponent .

The problem is that is a good approximation of only when is small. For instance, when , we have and . In this situation, when being and constants, the number of occurrences of the variable

follows a Poisson distribution with constant variance. This means that, even in the limit

, we can not assume that implies , when . In the following we will find an upper bound for the index of the variable (a lower bound for the value of ) ensuring that is a good approximation of , when . We will use both Hoeffding’s and Chernoff’s bounds.

In what follows, let be be the constant such that is the size of the formula.

Hoeffding’s bound states that, if is the sum of identical and independent Bernoulli variables, then

 P(|X−E[X]|≥ϵn)≤2e−2ϵ2n

taking we obtain

 P(|X−E[X]|≥√nlogn)≤2n2

Given a value of , let’s fix two variables and such that

 E[Ki]=KE[Kj]=K−√nlogn

We have , and for all variables with bigger indexes

 ∑r≥jP(Kr≥K)=o(1)

We have already argued that . Using , we have a strict bound

 P(k≥K)≤j/n+o(1)

By Lemma 2, we get

 K=E[Ki]≈C(1−β)(i/n)−βK−√nlogn=E[Kj]≈C(1−β)(j/n)−β=C(1−β)(j/i)−β(i/n)−β≈K(j/i)−β

Therefore

 j≈i(1−√nlognK)1/βi≈(C(1−β)K)1/β

Replacing the expressions for and , we get

 P(k≥K)≤j/n+o(1)≈(C(1−β)K(1−√nlognK))1/β

If then

 P(k≥K)≤(KC(1−β))−1/β+o(1)

Similarly, we can prove the same lower bound , using now the variable such that .

Alternatively, we can use the Chernoff’s bound

where is the sum of independent random variables in the range . In order to ensure that the ’s are sorted, we require that, in the limit , we have . We take the value of that satisfies

 δE[Ki]=E[Ki]−E[Ki+1]2

By Lemma 2 and the Taylor expansion this value of , when , is

 δ≈1/2−1/2(i+1i)−β≈β2i

And, for this value of , we impose

 2e−δ2E[Ki]3≈2exp(−(β/2i)2C(1−β)(i/n)−β3)=O(n−1)

From this, we get the minimum value of for which .

 i=O(nβ/(2+β)/log1/(2+β)n)

The value of corresponding to this variable gives us a value from which on we can expect to observe the power-law distribution in .

 K=Ω((n2logn)β/(2+β))

### 3.2 Implementation of the Generator

The generation method is formalized in Algorithm 1.

The function sampleVariable(,n) may be implemented in two ways.

We can compute a vector such that at the beginning of the algorithm. Then, every time we call sampleVariable, we compute a random number uniformly distributed in , using a dichotomic search, look for the smallest such that , and return such .

Alternatively, if is big we can use the following approximated algorithm. If we want to generate numbers with probability density , we can integrate , find the inverse function, and compute , where is a uniformly random number in . Our probability function is discrete. However, when , and both and , we can approximate it as

 P(x≤X)=∑Xi=1i−β∑ni=1i−β≈ζ(β)+1/(1−β)X1−βζ(β)+1/(1−β)n1−β

Therefore, computing the inverse, sampleVariable may be computed as

 X=⌊((n1−β+(1−β)ζ(β))Y−(1−β)ζ(β))1/(1−β)⌋+1

where is a uniform random variable in . This way, avoiding the use of the vector and the dichotomic search, we save a factor in the time-complexity and a factor in the space-complexity of the generator.

## 4 Industrial SAT Instances

In the previous section we have scale-free random SAT instances. We want this models to generate formulas as close as possible to industrial ones. Therefore, we want to compute the value of that best fits industrial instances. For this purpose we have studied the 100 benchmarks (all industrial) used in the SAT Race 2008. All together, they contain variables, with a total of occurrences. Therefore, the average number of occurrences per variable is . If we used the classical (uniform) random model to generate instances with this average number of occurrences, most of the variables would have a number of occurrences very close to . However, in the analyzed industrial instances, close to of the variables have less than this number of occurrences, and more than have or less occurrences. The big value of the average is produced by a small fraction of the variables that have a huge number of occurrences. This indicates that the number of occurrences could be better modeled with a power-law distribution. This was already suggested by Boufkhad et al. (2005a).

In order to check if those industrial instances (all together) are scale-free SAT formulas, and estimate the value of , we compute the number of occurrences of each variable of each industrial instance. Then, we rename the indexes of such variables such that , for . Now, before comparing with , we renormalize both functions such that both are defined in and its integral in this range is . Hence, we define for the empirical , the empirical function as

 ϕind(x)=defn∑nj=1KjK⌊nx⌋

and, for the theoretical function , the theoretical function as

 ϕ(x;β,n)=n∑nj=1j−β⌊nx⌋−β≈nζ(β)+11−βn1−β(nx)−β=1−β(1−β)ζ(β)n1−β+1x−β

When , we get

 ϕ(x;β)=limn→∞ϕ(x;β,n)=(1−β)x−β

In Figure 1 we represent both functions with normal axes, and with double-logarithmic axes. Notice that in double logarithmic-axes, the slope of allows us to estimate the value of .

Theorem 1 allows us to ensure that the distribution of frequencies on the number of occurrences of variables follows a power-law distribution, with exponent .

Finally, we have generated a scale-free random 3-SAT formula with variables, clauses and . In Figure 2, we show the frequencies of occurrences of variables of this formula and compared it with those obtained for the SAT Race 2008, and the line with slope .

## 5 Phase Transition in Scale-Free Random 2-SAT Formulas

Chvátal and Reed (1992) proved that a random formula with clauses of size over variables, is satisfiable with probability , when , and unsatisfiable with probability , when , where represents a quantity tending to zero as tends to infinity.

As will see in this section, a similar result for scale-free random 2-SAT formulas can be obtained using percolation and mean field techniques.

Percolation theory describes the behavior of connected components in a graph when we remove edges randomly. Erdös and Rényi Erdös and Rényi (1959) are considered the initiators of this theory. In this seminal paper on graph theory they proposed a random graph model where all graphs with nodes and edges are selected with the same probability. Gilbert Gilbert (1959) proposed a similar model where is also the number of nodes, and every possible edge is selected with probability . For not very sparse graphs (when ), both models have basically the same properties taking . Erdös and Rényi Erdös and Rényi (1960) also studied the connectivity on these graphs and proved that

• when , i.e. , a random graph almost surely has no connected component larger than ,

• when , i.e. a largest component of size almost surely emerges, and

• when , i.e. , the graph almost surely contains a unique giant component with a fraction of the nodes and no other component contains more than nodes.

Phase transitions is a phenomenon that has been observed and studied in many AI problems. Many problems have an order parameter that separates a region of solvable and unsolvable problems, and it has been observed that hard problems occur at critical values of this parameter. Mitchell et al. Mitchell et al. (1992) found this phenomena in 3-SAT when the ratio between number of clauses and variables is . Gent and Walsh Gent and Walsh (1994) observed the same phenomenon with clauses of mixed length.

There is a close relationship between SAT problems and graphs. Both, percolation on graphs and phase transition in SAT (or other AI problems) are critical phenomena and both can be studied using mean field techniques from statistical mechanics. Percolation theory has been used and inspired works in the literature about random SAT and satisfiability threshold, e.g. in Achlioptas et al. (2001) to determine the satisfiability threshold of 1-in-k SAT and NAE 3-SAT formulas. Some results on graphs have been previously extended to 2-SAT. For instance, Sinclair and Vilenchik (2013) adapted Achlioptas processes for graphs into formulas. Bollobás et al. (2001) investigated the scaling window of the 2-SAT phase transition, finding the critical exponent of the order parameter and proving that the transition is continuous, adapting results of Bollobás (1984) for Erdös-Rényi graphs. The relationship between percolation in random graphs and phase transition in random 2-SAT formulas is suggested in many other works. For instance, Monasson et al. (1999) when studying the phase transition in -SAT (a mixture of clauses of size and clauses of size ) already mention that “It is likely that the 2SAT transition results from percolation of these loops…”. Cooper et al. (2007) use the emergence of a giant component in a graph to prove the existence of a phase transition in 2-SAT random formulas with prescribed degrees, using the configuration model. They find, for this model, the same criterion as Friedrich et al. (2017b) and us in Theorem 2.

Given a random 2-SAT formula with clauses over variables, we can construct an Erdös-Rényi graph where the literals are nodes, and the clauses are edges. At the percolation point of this graph a giant component emerges. Just at the same point the 2-SAT phase transition threshold is located. However, despite the coincidence in the point, the relation between both facts is not direct: a giant component in the graph is not the same as a giant (hence, unsatisfiable) loop of implications in the SAT formula. The connection between two edges and in the graph is given by a common node (literal) . Whereas, in the SAT formula, the resolution between and is through a variable that is affirmed in one clause and negated in the other. In this section, we elaborate on the relation of giant components in graphs and unsatisfiability proofs in 2-SAT formulas.

### 5.1 A Criterion for Phase Transition in 2-SAT

Unsatisfiability proofs of 2-SAT formulas are characterized by bicycles. Let be a 2-SAT formula. Any sequence of literals satisfying , for any , is called an implication sequence. We say that implies , if there exists an implication sequence of the form . Any implication sequence of the form is called a cycle. A bicycle is a cycle such that there exists a variable satisfying .

A 2-SAT formula is unsatisfiable if, and only if, it contains a bicycle Aspvall et al. (1979); Chvátal and Reed (1992).

We will also consider random graphs with nodes and edges,444We will deal with distinct models of random graphs where every graph has a distinct probability of being chosen. and connected components, defined as subsets of nodes such that any pair of them is connected by a path inside the component. A random graph of size is said to contain a giant connected component if almost surely555Almost surely means that, in the model of random graph, as , the probability tends to one. it contains a connected component with a positive fraction of the nodes. Given a model of random graphs, we say that is the percolation threshold if any random graph with nodes and more than edges almost surely contains a giant component. In a random graph, the degree of a node , noted , is a random variable. The random variable represents the degree of a random node chosen with uniform probability.666In some of the models of random graphs that we will consider, not all degrees of nodes follow the same probability distribution. Therefore, we will distinguish between and .

As we commented above, we can represent any 2-SAT formula as a graph where nodes are literals, and clauses are edges between literals and . In classical 2-SAT random formulas, since literals are chosen independently with uniform probability, the generated graph will be an Erdös-Rényi graph following the model . However, a connected component in the graph is not necessarily an unsatisfiability proof of the formula.

First, in a random SAT formula, we may have repeated clauses, which means that from clauses we will obtain less than edges. However, for a linear number of clauses, when , there are distinct clauses or edges. In the classical case, in the limit , with a linear number of clauses , and a quadratic number of possible clauses, the probability of any clause is , and the probability of being repeated . Therefore, the fraction of repeated clauses is negligible. For scale-free 2-CNF formulas, in Theorem 5, we will see that, if then clauses have probability . Precisely, the most probable 2-CNF clause is . This means, that after generating clauses, the probability that a new generated clause has already been generate previously is bounded by

 P(x1∨x2)O(n)≈2!1−β2−β(2∑ni=1i−β)2O(n)=2−1−β(ζ(β)+n1−β1−β+O(n−β))2O(n)=O(n2β−1)

This probability bounds the value of the fraction of repeated clauses, that it is meaningless when .

Second, graph connected components and cycles are not the same structure. Therefore, the existence of a giant connected component and the existence of a giant cycle are independent facts.

Molloy and Reed Molloy and Reed (1995) and Cohen et al. Cohen et al. (2000) have studied the existence of giant components in random graphs with heterogeneous and fixed node degrees. Molloy and Reed Molloy and Reed (1995) prove that the critical point is at

 Q(λ)=∑i>0i(i−2)λi=0

where is the fraction of nodes with degree . Whereas, Cohen et al. Cohen et al. (2000) independently prove (but in a much more informal way) that the critical point is characterized by

 E[k2]E[k]=2

where is the degree of a random node, and denotes expectation. It is easy to see that both criterion are exactly the same. Interestingly, the criterion depends not only on the expected degree of nodes, but also on the expected square degree of the nodes, hence on the variability of node’s degrees. The variability on the nodes degree plays an important role in the location of the percolation threshold. For instance, in the Erdös-Rényi model, the percolation threshold is located at , hence the expected degree of nodes is . However, the expected degree of nodes belonging to the same connected component777Recall that minimally connected components are trees, where the number of edges is equal to the number of nodes minus one. of size is, at least, . This discrepancy is only possible if the variability in node’s degree is high. This also explains why, in regular random formulas, where we impose variables to occur exactly the same number of times (instead of the same average number of times), we get distinct phase transition thresholds.

Cohen et al. Cohen et al. (2000) starts assuming that loops of connected nodes may be neglected. In this situation, the percolation transition takes place when a node , connected to a node in the connected component, is also connected in average to at least one other node, i.e. when .

Molloy and Reed Molloy and Reed (1995) give a more detailed proof that we will try to summarize. Given the list of fixed degrees of every node, they describe a random algorithm that constructs (exposes) all graphs compatible with these degrees with the same probability, exposing connected components one by one:

Let be the degree of node on the partially exposed graph. Initially, set , for every node. Then, until , for all nodes, repeat the following actions. If, for some node , we have , then (case A) select it; otherwise, (case B) choose freely a node such that . Then, in both cases, choose another node with probability . Expose the edge , and increase and . Notice that every time we execute case B, we start the exposition of a new connected component of the graph.

Let be the random variable representing the number of open vertexes in partially exposed nodes, i.e. , after the th edge has been exposed. Notice that we execute case B when we have , and we get . When we execute case A, there are two situations: (case A1) if node is a partially exposed node (i.e. ), then , and (case A2) if node has never been exposed (i.e. ), then .

Suppose that cases B and A1 does not happen very often. Then, the expected change in is

 E[Xr−Xr−1]≈∑jkj(kj−2)∑jkj=Q(λ)E[k]

and, since , a standard result of random walk theory ensures that if then, after steps, is almost surely of order ; and if , then returns to zero fairly quickly. In the first case, we generate a giant connected component of size , and in the second case, no component is larger than . In order to prove that executions of case A1 do not hurt, Molloy and Reed prove that the probability of choosing a partially exposed node (a node with ) is negligible unless we have already exposed a fraction of the nodes in the current connected component.

Theorems 2 and  3 establish a similar criterion for the existence of a giant set of implied literals from a given one. This almost surely implies unsatisfiability of the formula. The proof of the theorems resemble Molloy and Reed’s and Cohen et al.’s proofs. In Theorem 2 we fix the number of occurrences of every literal, whereas in Theorem 3 we fix the number of occurrences of variables. Compared with the definition of in Molloy and Reed’s, we observe that in Theorem 3, the is replaced by a . In Theorem 2, we combine the number of literals with the number of their negated , and the constant is a instead of a . Notice that the condition in this case is equal to the condition found by Cooper et al. (2007) for the configuration method and prescribed literal degrees.

###### Theorem 2

Let be a 2-CNF formula generated in a random model with variables , where every literal (resp ) is selected with probability (resp , and literals in clauses are not correlated (i.e. ). Assume that and . Let be the expected number of occurrences of literal . Then, if , then almost surely is unsatisfiable.

Proof: The proof resembles Molloy and Reed’s proof for the percolation threshold on graphs. This proof is quite long, and our proof does not differ very much. Therefore, we will only sketch it.

In our case, we do not deal with connected components. In fact, we do not expose the random formula with our algorithm. We assume that we already have the formula, and we describe in Algorithm 2 how to enumerates the set of literals implied by a given initial literal .

The Boolean variable denotes if the literal has been reached from the initial literal and the counter denotes the number of clauses containing that we have already removed from the formula. Therefore, is the number of clauses containing that still remains in . When implies and , for some variable , we say that implies a contradiction. In this case, also implies . The algorithm returns the set of literals implied by or a contradiction (in this second case, we abort, since we already have that is what we want to check). Notice also that implies .

Notice that this algorithm is quite similar to Molloy and Reed’s algorithm for exposing connected components of a random graph. Counter has a similar meaning and we only require the Boolean variable to denote the condition of open vertex (expressed as in Molloy and Reed’s algorithm). The algorithm is deterministic, if you consider the formula given. However, for a random formula, the algorithm perform exactly the same steps and can be seen as a random algorithm. Similarly, we can define the random variable after iteration . At every iteration, this variable satisfies:

(case A)

, if and ,

(case B)

, if , and

(case C)

, if and .

Notice that line 2 decreases in , line 2 decreases in , when , and line 2 increases in