Mining Maximal Induced Bicliques using Odd Cycle Transversals

10/26/2018
by   Kyle Kloster, et al.
0

Many common graph data mining tasks take the form of identifying dense subgraphs (e.g. clustering, clique-finding, etc). In biological applications, the natural model for these dense substructures is often a complete bipartite graph (biclique), and the problem requires enumerating all maximal bicliques (instead of just identifying the largest or densest). The best known algorithm in general graphs is due to Dias et al., and runs in time O(M |V|^4 ), where M is the number of maximal induced bicliques (MIBs) in the graph. When the graph being searched is itself bipartite, Zhang et al. give a faster algorithm where the time per MIB depends on the number of edges in the graph. In this work, we present a new algorithm for enumerating MIBs in general graphs, whose run time depends on how "close to bipartite" the input is. Specifically, the runtime is parameterized by the size k of an odd cycle transversal (OCT), a vertex set whose deletion results in a bipartite graph. Our algorithm runs in time O(M |V||E|k^2 3^(k/3) ), which is an improvement on Dias et al. whenever k <= 3log_3(|V|). We implement our algorithm alongside a variant of Dias et al.'s in open-source C++ code, and experimentally verify that the OCT-based approach is faster in practice on graphs with a wide variety of sizes, densities, and OCT decompositions.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

03/04/2019

Faster Biclique Mining in Near-Bipartite Graphs

Identifying dense bipartite subgraphs is a common graph data mining task...
10/06/2021

An Improved Approximation for Maximum k-Dependent Set on Bipartite Graphs

We present a (1+k/k+2)-approximation algorithm for the Maximum k-depende...
11/25/2020

Solving the r-pseudoforest Deletion Problem in Time Independent of r

The feedback vertex set problem is one of the most studied parameterized...
06/13/2020

When Algorithms for Maximal Independent Set and Maximal Matching Run in Sublinear-Time

Maximal independent set (MIS), maximal matching (MM), and (Δ+1)-coloring...
09/03/2020

Bipartite Matching in Nearly-linear Time on Moderately Dense Graphs

We present an Õ(m+n^1.5)-time randomized algorithm for maximum cardinali...
07/26/2017

A Change-Sensitive Algorithm for Maintaining Maximal Bicliques in a Dynamic Bipartite Graph

We consider the maintenance of maximal bicliques from a dynamic bipartit...
10/24/2011

Quilting Stochastic Kronecker Product Graphs to Generate Multiplicative Attribute Graphs

We describe the first sub-quadratic sampling algorithm for the Multiplic...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bicliques (complete bipartite graphs) naturally arise in many data mining applications, including detecting cyber communities [19], data compression [2], epidemiology [25]

, artificial intelligence 

[32], and gene co-expression analysis [16, 17]. In many settings, the bicliques of interest are maximal (not contained in any larger biclique) and/or induced (each side of the bipartition is independent in the host graph), and there is a large body of literature [4, 7, 8, 21, 24, 25, 28, 34] giving algorithms for enumerating all such subgraphs. Many of these approaches make strong structural assumptions on the host graph; the case when the host graph is bipartite has been particularly well-studied, and the iMBEA algorithm of Zhang et al. has been empirically established to be state-of-the-art [34]. In general graphs, the only known non-trivial algorithm for enumerating maximal induced bicliques (MIBs) is that of Dias et al. [7].

We consider the problem of efficiently enumerating all MIBs in general graphs. In particular, we design an algorithm OCT-MIB that extends ideas from iMBEA to work on non-bipartite graphs by using an odd cycle transversal (OCT set): a set of nodes such that is bipartite. We prove that our algorithm has runtime where , is the number of MIBs in the graph, and and denote the number of vertices and edges in the graph respectively. This is asymptotically faster than the approach of Dias et al. whenever the OCT set has size . Since all graphs have OCT sets (although they can be size , as in cliques), our algorithm can be run in the general case; its correctness does not require minimality or optimality.

We also present several additional algorithms which may be of independent interest. The first is LexMIB, a modification of the algorithm of Dias et al [7], which addresses a flaw in the original method and enumerates all MIBs in time . The second is for a variant of biclique enumeration where we are only interested in bicliques with one part inside a specified independent set of the host graph. Specifically, given a graph with partitioned into with an independent set, we wish to enumerate all maximal crossing bicliques (MCBs) with , . We give an algorithm MCB which enumerates all maximal crossing bicliques in time per MCB.

Further, we implement both OCT-MIB and LexMIB in open source C++ code, and evaluate their performance on a suite of synthetic graphs with known OCT decompositions. Our experiments show that OCT-MIB is generally at least an order of magnitude faster than LexMIB, and verify that in practice, our parameterized approach yields the best performance even when the distance to bipartite (OCT set size) is , far exceeding the constant values typically required by such algorithms.

We begin with preliminaries and a brief discussion of related work, then describe our main algorithm OCT-MIB in Section 3. We include formal proofs of correctness and complexity for OCT-MIB in Section 4. Descriptions of LexMIB and MCB and their associated formal results are in Appendices A and C, respectively. Finally, we present our experimental evaluation in Section 5.

2 Preliminaries

2.1 Related work

The complexity of finding bicliques is well-studied, beginning with the results of Garey and Johnson [9] which establish that in bipartite graphs, finding the largest balanced biclique is NP-hard but the largest biclique (number of vertices) can be found in polynomial time. Finding the biclique with the largest number of edges was shown to be NP-complete in general graphs [33], but the case of bipartite graphs remained open for many years. Several variants (including the weighted version) were proven NP-complete in [6], and in 2000, Peeters finally resolved the problem, proving the edge maximization variant is NP-complete in bipartite graphs [27]. Particularly relevant to the mining setting, Kuznetsov showed that enumerating maximal bicliques in a bipartite graph is #P-complete [20], the NP-completeness analogue for counting problems [31].

For the problem of enumerating maximal induced bicliques, the best known algorithm in general graphs is due to Dias et al. [7]; in the non-induced setting, other approaches include a consensus algorithm [4], an efficient algorithm for small arboricity [8], and a general framework for enumerating maximal cliques and bicliques [10]. We note that as described, the method in [7] may fail to enumerate all MIBs; we describe a graph eliciting this behaviour, along with a modified algorithm (LexMIB) with proof of correctness in Appendix A. We note that our correction increases the runtime of the approach from to per MIB.

There has also been significant work on enumerating MIBs in bipartite graphs. We note that since all bicliques in a bipartite graph are necessarily induced, non-induced solvers for general graphs (such as [4]) can be applied, and have been quite competitive. The best known approach, however, is an algorithm due to Zhang et al. [34] that directly exploits the bipartite structure111Due to a typo in their runtime (see Appendix B), the worst-case complexity of this algorithm is not an improvement on the time per MIB of several other approaches. However, in practice, the experimental results in [34] support this being faster than [4].. Other approaches in the bipartite setting include frequent closed itemset mining [21] and transformations to the maximal clique problem [24]; faster algorithms are known when a lower bound on the size of bicliques to be enumerated is assumed [25, 28].

In this work, we extend techniques for bipartite graphs to the general setting using odd cycle transversals, a form of “near-bipartiteness” which arises naturally in many applications [12, 26, 29]. Although finding a minimum size OCT set is NP-hard, the problem of deciding if an OCT set with size exists is fixed parameter tractable (FPT), with algorithms in [22] and [15] running in times and , respectively. Other algorithms for OCT include a -approximation [1], a randomized polynomial kernelization algorithm based on matroids [18], and a subexponential algorithm for planar graphs [23]

. Since our algorithm only requires a valid OCT set (not a minimal or optimal one), any of these approaches or one of several high-performing heuristics may be used to pre-process the data. Recent implementations 

[11] of a heuristic ensemble alongside algorithms from  [3, 14] alleviate concerns about finding an OCT decomposition creating a barrier to usability for our algorithm.

2.2 Notation and Terminology

Let be a graph. We denote and , and use to represent the neighborhood of a node . An independent set in is a maximal independent set (MIS) if is not contained in any other independent set of . We use to denote all nodes which are independent from a set and to denote all nodes which are completely connected to a set .

A biclique in a graph consists of disjoint sets such that every vertex of is connected to every vertex of . We say a biclique is induced if both and are independent sets in . A biclique is maximal in if no biclique in properly contains it. Given a fixed independent set , we define a crossing biclique with respect to to be an induced biclique such that and .

If is bipartite, we write , where the vertices are partitioned as and we refer to the two partitions as the left and right “sides” of the graph. For a biclique in , by convention we list the “left” set first, i.e. and .

If has OCT set , we denote the corresponding OCT decomposition of by , where the induced subgraph is bipartite, and called the bipartite part. We write and for and , respectively. We let . Given an arbitrary vertex set we abbreviate and .

We present algorithms for enumerating maximal induced bicliques in two settings. The first setting, Maximal Induced Biclique Enumeration, is our primary focus.

Maximal Induced Biclique Enumeration

The second setting, Maximal Crossing Biclique Enumeration arises as a subproblem in our approach to Maximal Induced Biclique Enumeration.

Maximal Crossing Biclique Enumeration

3 Algorithms

OCT-MIB enumerates all MIBs in a graph by using an OCT decomposition to drive a divide and conquer approach. Removing an OCT set from enables use of efficient methods for enumerating MIBs in the bipartite setting, such as Zhang et al.’s iMBEA algorithm. Then a given MIB, , in the bipartite graph can be checked for maximality in by attempting to add vertices from to .

Each MIB not found in necessarily contains at least one vertex , allowing us to enumerate them by iterating over each vertex and identifying all MIBs that contain . This process requires careful bookkeeping that we organize using the observation that in all MIBs containing a given vertex , one side of the biclique must be an independent set completely connected to . Hence, we proceed with constructing “seed” bicliques from independent sets in . We then “grow” these into maximal bicliques by adding vertices from an MIS in containing .

In 3.1 we provide more detailed algorithm outlines for both OCT-MIB and MCB, highlighting key phases. We also provide high-level pseudo-code for OCT-MIB and describe blueprints, a shared data structure. The remainder of the section is dedicated to a complete description of OCT-MIB (3.2). A complete description of MCB, along with proofs of its correctness and runtime can be found in Appendix C.

3.1 Algorithm outline & data structure

In both MCB and OCT-MIB we take an independent set and build bicliques such that . We let . During the initialization phase we find bicliques of the form where contains exactly one node from . During this phase, we enumerate MISs in a subgraph using MIS [30]; in OCT-MIB, we also utilize MCB.

We then “grow” the bicliques found during initialization in the expansion phase by repeatedly adding a node to and removing nodes from and to ensure is still an induced biclique. We refer to this process as expanding with . In MCB we let , and in OCT-MIB we let each MIS in be once. While MCB only contains initialization and expansion phases, OCT-MIB also contains the bipartite phase alluded to earlier, where the MIBs which are completely contained in are found. (Line 2 of Algorithm 1; this can be completed using e.g. iMBEA).

We employ an ordering on the vertices of to limit the amount of redundant expansions we make. We say is near-maximal if is maximal with respect to and there is some set of nodes such that is a maximal biclique. We allow so maximal bicliques are also near-maximal. We refer to near-maximal bicliques where all of the missing nodes occur later in as future-maximal, and we only expand on future-maximal bicliques (discarding all others). We group bicliques which contain the same set of -nodes in bags. During the expansion phase we iterate over the bags, expanding their set of bicliques with all such that for all occuring in some biclique in the bag. For each -node we expand with, we create a new bag to hold the new future-maximal bicliques. The general approach of OCT-MIB is outlined in Algorithm 1. The pseudo-code for MCB is similar but does not include a bipartite phase (lines 2 to 6) and only uses MIS in initialization. We now describe a key data structure used in both OCT-MIB and MCB.

1:Graph , partitioning of into , where and are independent sets
2:, all maximal induced bicliques of
3:
4: maximal bicliques in
5:for  do
6:     if  is maximal in  then
7:         add to      
8:
9:for  do
10:      to hold bags
11:     Fix an order of
12:     for  do
13:          unique initial bicliques via MCB & MIS
14:         for  do
15:              if not-future-max(, then
16:                  remove from               
17:              if maximal(, then
18:                  add to (and keep in )                        
19:         if  is not empty then
20:              add bag to               
21:     while  is not empty do
22:         Pick and let
23:         for :  do
24:              
25:              for biclique  do
26:                   expand with
27:                  if not-future-max(, then
28:                       continue                   
29:                  
30:                  if maximal(, then
31:                       add to (and keep in )                                 
32:              if  is not empty then
33:                  add to                             
Algorithm 1 OCT-MIB pseudo-code. Key phases: Bipartite (Lines 2–5), Initialization (Lines 10–18), and Expansion (Lines 19–31).

Definition. A blueprint is an octuple of sets and a specified node . The sets satisfy

  • [noitemsep,topsep=0pt,partopsep=0pt, parsep=0pt, leftmargin=1.2em]

  • contains the -nodes in the biclique

  • }

  • .

The sets with

  • [noitemsep,topsep=0pt,partopsep=0pt, parsep=0pt, leftmargin=1.2em]

  • ,

and the set . Finally,

By design, is an induced biclique and is represented by the blueprint. We refer to blueprints and the bicliques they represent interchangeably in the description and discussion of the algorithms.

To guide the reader, we now describe the algorithmic roles of other sets in the blueprint. Within , is the set of nodes that are still candidates to be expanded with, is the set of nodes that have been considered to be in a biclique with the current elsewhere. We use and to check near-maximality, and to check global maximality in OCT-MIB. The vertex is used to prevent expansions which produce non-future-maximal bicliques.

3.2 Oct-Mib Algorithm

We begin by explaining checks for various properties of a blueprint, including not-future-max(, ) and maximal(). If the node we are expanding with is ordered later than , then we are able to detect that the expansion will not yield a new blueprint. We say that a biclique is invalid if is empty. A blueprint is not future-maximal if at least one of the following conditions is met; (i) , (ii) , or (iii) . Note conditions (ii) and (iii) imply is not near-maximal. Finally a blueprint is maximal . We note that each of these checks can be done in time.

Bipartite phase. Zhang et al. developed an algorithm to find all maximal induced bicliques in a bipartite graph in time where is the number of maximal bicliques [34]. We run this algorithm on , and for each biclique found we check if an OCT node can be added to either side, which can be done in time. If an OCT node cannot be added then we have found a MIB.

Initialization phase. Recall that OCT-MIB iterates over MISs in ; let be the current MIS. For each we create a new bag which will hold blueprints. Let , , and . Now we find candidate blueprints in three rounds, adding the future-maximal ones to .

In the first round, for each MIS , we check if any node from or is in . If no node from either set is completely connected to , we create a candidate blueprint where , , , , and , , , and are defined as above. Otherwise we continue on to processing the next MIS in and do not create a candidate blueprint for .

In the second round, we run MCB on the subgraph induced on with . Then for each returned MCB , we find the set of nodes . We create a candidate blueprint with , , , , and , , , and defined as above.

In the third round, we run MCB on the subgraph induced on with as the designated independent set. For each returned MCB , we check if . If so, we create a candidate blueprint where , , , , and , , , and are defined as above. Otherwise, we continue processing the next MCB (without creating a candidate blueprint for ).

We check each candidate blueprint for future-maximality, adding it to if true, and discarding it otherwise. If the blueprint is maximal we add to the set of maximal bicliques and let . If the blueprint is not maximal because an OCT node can be added to it we also let ; if it is not maximal because of an -node later in the ordering we let be the first such node from . The blueprint remains in in either case. As long as is non-empty we add it to , the set of bags to be processed in the expansion phase.

Expansion phase. We now process bags from until it is empty. We refer to removing a bag from and expanding on all of the blueprints in with a vertex as branching on with . In OCT-MIB , , , and are the same in all blueprints in a given bag and we branch on a bag with all of the nodes in in the order which matches the order of ; .

Let be the node we are currently branching with and be a bag we create to hold new blueprints formed by expanding with . Let be the blueprint currently being expanded with. If we can detect that the expansion will not yield a new blueprint we terminate this expansion. Otherwise we create a new blueprint as follows.

Let , , , , , , , and . If is invalid or it is not future-maximal, we terminate the expansion.

We must consider the case where the tuples from different blueprints are identical after expansion. To handle this we maintain a hashtable for the current being branched with, and hash each tuple and terminate if we find a conflict. This hashtable can be discarded once we finish branching with the current node.

If the expansion has not been terminated then is future-maximal so we add to . If the blueprint is maximal we add to the set of maximal bicliques and let . If the blueprint is not maximal solely because of nodes from we also let , but if it is not maximal because of a node from we let be the first such node from . The blueprint remains in in either case. Once again we continue the expansion phase until all bags have been branched upon.

4 Theoretical guarantees

Figure 1: Runtime ratios of OCT-MIB to LexMIB with varying edge densities (Left)

and coefficient of variance

(Right). Each curve corresponds to a fixed and ; for settings where OCT-MIB completed but LexMIB timed out on some instances, we display both upper (dashed, using 3600s for LexMIB) and lower (solid, using infinity for LexMIB) bounds on the ratio. For edge density, each point is averaged over 10 randomly generated instances (5 with OCT-OCT density equal to and 5 with it fixed to , so as to ensure the results are not an artifact of the OCT-OCT density; when there are only 5 instances); note that both algorithms timed out on all instances with = when edge density was greater than . For cv, points are an average over 5 instances.

We now establish the correctness and asymptotic complexity of OCT-MIB.

Theorem 4.1.

OCT-MIB finds all maximal induced bicliques.

Proof.

Suppose that there is a MIB that OCT-MIB does not find. If there are no nodes from in then the biclique would be found in the bipartite phase, thus we may assume there is a node from in the biclique.

Without loss of generality we may assume that contains OCT-nodes and that if contains nodes from or then contains nodes from the other. We let and . Note that must be contained in some MIS in . Let be the first node in with respect to ; we show that when we initialize with , and for some blueprint in . is either empty or contains nodes from one of , and we show that a candidate blueprint as described above is created in both cases.

By definition, must be contained within an MIS in . If is empty and a candidate blueprint with is not created in the first round, then there must be nodes in or which are in . If a node from is in then is a crossing biclique in the instance of MCB called in the second round. Thus a maximal crossing biclique which contains is returned and a candidate blueprint with is created. Otherwise a node from is in and is a crossing biclique in the instance of MCB called in the third round, leading to the creation of a candidate blueprint with .

If , assume without loss of generality that . Then there is a crossing biclique in the instance of MCB called in the second round, and a candidate blueprint with and is created.

If a candidate blueprint with and is pruned away because of an node, then is not a maximal biclique. So there exists a blueprint in where and . Expanding so that would yield the biclique as . Thus as long as we expand for each node in we will find the biclique .

Assume that an expansion with a node in is not made and let be the first such node. The blueprint would not have been discarded because of being empty or a node from or being in , as this would imply is not a MIB. If a node from from or made not near-maximal when expanding with the node prior to in , , consider a blueprint formed by adding nodes from to and to such that forms a future-maximal biclique. This blueprint must exist at the bag created by expanding with and it contains in and in . We now let that blueprint be .

Thus must be greater than for blueprint at the bag with . Because of how we constructed , it is not in . Therefore is not in but because is an independent set, it is independent from . Furthermore is completely connected to because it is completely connected to and . Thus would not be a maximal biclique and we obtain a contradiction. We can apply this argument inductively to show the complete correctness of the algorithm.

Theorem 4.2.

Given a graph with OCT decomposition , OCT-MIB runs in time, where is the number of MIBs and is the number of maximal independent sets in .

Proof.

We note that has no isolates, therefore and are . We first compute the complexity of finding a maximal biclique when iterating over a single MIS in OCT. In the initialization phase, finding the MISs in round one takes time per MIS. Finding the MCBs in rounds two and three takes per MCB. This is because the arboricity of in each call to MCB is . Therefore the time it takes to initialize a blueprint is not  [5]. Because of how we have utilized , each node in is expanded with times per MIB. Each expansion can be done in time, and thus the total time spent expanding is per maximal biclique. Since we only expand on blueprints which are future-maximal, every expansion is accounted for.

There may be an additional initializations of a blueprint. Thus the total complexity of finding a single maximal biclique when iterating over is . A biclique can be found once per MIS in OCT, and thus the total complexity of OCT-MIB is , where is the number of maximal bicliques and is the number of MIS’s in OCT.

5 Empirical Evaluation

Figure 2: Runtime ratios of OCT-MIB to LexMIB with varying OCT sizes (Left) and expected OCT- edge densities (Right). Each data point represents the average over 5 random instances with and . For settings where OCT-MIB completed but LexMIB timed out on some instances, we display both upper (dashed, using 3600s for LexMIB) and lower (solid, using infinity for LexMIB) bounds on the ratio. In the OCT- edge density experiment, the expected edge density is within and between and .

In this section we evaluate the performance of OCT-MIB on a suite of synthetic graphs with a variety of sizes, densities, degree distributions, and OCT decomposition structures, to see how various aspects of the graph impact the runtime. We benchmark against LexMIB, our modified version of the approach described in [7] (see Appendix A).

(5%, 1:10) 1000 - 3717.8
1000 20 16631.0
(5% 1:1) 1000 - 50424.4
600 10 962305.8
(10%, 1:1) 600 - 86239.0
300 5 185760.4
Table 1: Comparison of the average number of MIBs in bipartite and near-bipartite graphs with equivalent expected edge density and balance .

5.1 Experimental setup and data

We implemented OCT-MIB and LexMIB 222 The experiments described in this manuscript were run using an early implementation of LexMIB which output bicliques according to a fixed ordering different from that described in [7]. The implementation available at [13] removes this inconsistency. in C++; we note that no prior implementation of Dias was available. All code is open source under a BSD 3-clause license and publicly available at [13]. Hardware specifications are in Appendix D.

Data.

Our datasets are generated using a modified version of the random graph generator of Zhang et al. [34] that augments the random bipartite graph to have OCT sets of known size. The generator allows a user to specify , , and , the expected edge densities between and , and , and within , and control the coefficient of variance (cv := ) of the expected number of neighbors in the larger partition over the smaller partition and in over the OCT set. For additional details on the generator and parameter settings see Appendix D.

Unless otherwise specified, the following default parameters are used: expected edge density , , and 1:10; additionally, the edge density between and is the same as that between and . For all experiments where it was appropriate we tested two OCT sizes, and . Our default timeout was one hour (3600s).

We note that many of our graphs are significantly smaller than those used in [34], due to an explosion in the number of MIBs in even slightly non-bipartite instances (see Table 1). Our graph corpus was designed to include instances with approximately the same number of MIBs as those in [34] for all experiments evaluating variation in the bipartite subgraph.

5.2 Comparison of Oct-Mib and LexMIB

We first measured the impact of the heterogeneity of the degree distribution by varying the cv between and from to in steps of (cv between and is still 0.5). As seen in the right panel in Figure 1, the ratio of the runtime of OCT-MIB to that of LexMIB generally decreases as cv increases when , but is at least consistent, and possibly increases when .

In order to evaluate the effect of edge density, we restricted our attention to graphs with 1:1 balance (as in [34]), where and uniform expected edge density ranges from to in steps of . Internal density within was either set to match edge density or fixed to be . As seen at left in Figure 1, the ratio of the runtime of OCT-MIB to that of LexMIB decreases as the expected density increases to , after which it is likely to continue to decrease but can not be determined due to LexMIB timing out.

Finally, as was done in [34], we tested the effect of the size and balance of the bipartite subgraph on runtime. This was particularly challenging due to the significant increase in the number of MIBs at lower balance ratios (see Table 1). Results and discussion are in Appendix D (Figure 5); the effect of making the bipartite subgraph more imbalanced was similar for both algorithms.

5.3 Varying OCT structures

This set of experiments explores how OCT-MIB performs on graphs with widely varying OCT sizes and densities.

We tested by varying it from to in steps of size (left panel of Figure 2), noting that the runtime of OCT-MIB tends to increase relative to that of LexMIB as increases. We examined the impact of the density between and by varying it from to in steps of size (right panel of Figure 2), observing that the runtime of OCT-MIB slightly increases relative to that of LexMIB as the expected density increases. In the latter we let the cv between and be 0.

We also looked at graphs where the structure of was optimal for OCT-MIB (independent) and adversarial (perfect matching) with respect to the number of MISs in . For the best-case we fixed and the balance at 1:10, while setting , and edge density to and . For this experiment, we increased the timeout to 7200 seconds. Table 2 shows the comparatively modest growth in runtime as the independent OCT set is increased from 5 to 25.

5 312.7 1345.8
10 597.8 2828.1
25 1552.1 6683.9
Table 2: Runtimes for OCT-MIB on graphs with and independent sets . Each entry represents the average runtime (seconds) over 5 random graphs.

For graphs where the OCT set was a perfect matching, we let be or , , and set the balance to 1:1. As seen in Table 3, OCT-MIB still outperforms LexMIB by nearly an order of magnitude.

Finally, we evaluated both algorithms on graphs with large values of (where OCT-MIB’s computational complexity is worse than that of LexMIB) and confirmed that in this scenario, LexMIB is the preferable algorithm in practice and theory. Specifically, on graphs where , OCT-MIB was faster than LexMIB when , but at , it was already an order of magnitude slower. The effect was even more pronounced for a corpus where : LexMIB had average runtimes of and when was and , respectively, whereas OCT-MIB timed out (s) on 9/10 instances (finishing a single instance in s).

Algorithm
LexMIB (74.0, -) (3420, 4)
OCT-MIB (19.8, -) (634, -)
Table 3: Runtimes for OCT-MIB and LexMIB on adversarially created graphs. Each entry represents the average runtime (seconds) on completed instances and the number of timeouts (). For each , use 5 random graphs with , = 1:1, , and an OCT set which is a perfect matching.

6 Conclusions

We present a new algorithm OCT-MIB for enumerating maximal induced bicliques in general graphs, with runtime parameterized by the size of an odd cycle transversal. Additionally, we describe a flaw in the algorithm of Dias et al. [7], and give a corrected variant LexMIB.

It is particularly noteworthy that OCT-MIB has the best-known complexity for enumerating MIBs even when the parameter is logarithmic in the instance size – far from the constant regime often necessary for FPT approaches to be efficient.

We implement and benchmark both algorithms on a corpus of synthetic graphs, establishing that in practice, OCT-MIB is typically an order of magnitude faster than LexMIB— even when is . We also confirm that OCT-MIB finishes on graphs with over 1,000,000 MIBs in minutes, and enumerates all MIBs in sparse graphs () where and is in less than two hours.

Our experiments also demonstrate that size may not be the most important feature of an OCT set in determining OCT-MIB’s runtime in practice. Although this paper focused on the algorithm’s performance in the setting where an OCT decomposition with given was known, it would be interesting to further analyze how the edge structure within the OCT set impacts the runtime (for example, we know that the number of maximal independent sets in plays a key role in OCT-MIB’s execution), and whether OCT sets found by different algorithms/heuristics are more or less advantageous for biclique enumeration.

Finally, we note that the current implementation of OCT-MIB could be improved by replacing the MIS-enumeration algorithm with that of [30] and the bipartite phase with the implementation of iMBEA used in [34].

References

  • [1] A. Agarwal, M. Charikar, K. Makarychev, and Y. Makarychev, approximation algorithms for min UnCut, min 2CNF deletion, and directed cut problems, STOC, 2005, pp. 573–581.
  • [2] P. Agarwal, N. Alon, B. Aronov, and S. Suri, Can visibility graphs be represented compactly?, Discrete & Computational Geometry, 12 (1994), pp. 347–365.
  • [3] T. Akiba and Y. Iwata, Branch-and-reduce exponential/fpt algorithms in practice: A case study of vertex cover, Theoretical Computer Science, 609 (2016), pp. 211–225.
  • [4] G. Alexe, S. Alexe, Y. Crama, S. Foldes, P. Hammer, and B. Simeone, Consensus algorithms for the generation of all maximal bicliques, Discrete Applied Mathematics, 145 (2004), pp. 11–21.
  • [5] N. Chiba and T. Nishizeki, Arboricity and subgraph listing algorithms, SIAM J. on Computing, 14 (1985), pp. 210–223.
  • [6] M. Dawande, P. Keskinocak, J. Swaminathan, and S. Tayur, On bipartite and multipartite clique problems, J. of Algorithms, 41(2001), pp. 388–403.
  • [7] V. Dias, C. De Figueiredo, and J. Szwarcfiter, Generating bicliques of a graph in lexicographic order, Theoretical Computer Science, 337 (2005), pp. 240–248.
  • [8] D. Eppstein, Arboricity and bipartite subgraph listing algorithms, Inf. Process. Lett., 51 (1994), pp. 207–211.
  • [9] M. Garey and D. Johnson, Computers and intractability: a guide to NP-completeness, 1979.
  • [10] A. Gély, L. Nourine, and B. Sadi, Enumeration aspects of maximal cliques and bicliques, Discrete applied mathematics, 157(7), (2009) pp. 1447–1459.
  • [11] T. Goodrich, E. Horton, and B. Sullivan, Practical Graph Bipartization with Applications in Near-Term Quantum Computing, arXiv preprint arXiv:1805.01041, 2018.
  • [12] N. Gülpinar, G. Gutin, G. Mitra, and A. Zverovitch,

    Extracting pure network submatrices in linear programs using signed graphs

    , Discrete Applied Mathematics, 137 (2004), pp. 359–372.
  • [13] E. Horton, K. Kloster, B. D. Sullivan, A. van der Poel, MI-bicliques, https://github.com/TheoryInPractice/MI-bicliques, October 2018.
  • [14] F. Hüffner, Algorithm engineering for optimal graph bipartization, International Workshop on Experimental and Efficient Algorithms, 2005, pp. 240–252.
  • [15] Y. Iwata, K. Oka, and Y. Yoshida, Linear-time FPT algorithms via network flow, SODA, 2014, pp. 1749–1761.
  • [16] M. Kaytoue-Uberall, S. Dupelessis, and A. Napoli, Using formal concept analysis for the extraction of groups of co-expressed genes, Modelling, Computation and Optimization in Information Systems and Management Sciences, 2008, pp. 439–449.
  • [17] M. Kaytoue, S. Kuznetsov, A. Napoli, and S. Duplessis, Mining gene expression data with pattern structures in formal concept analysis, Information Sciences, 181 (2011), pp. 1989–2011.
  • [18] S. Kratsch and M. Wahlström, Compression via matroids: a randomized polynomial kernel for odd cycle transversal, ACM Transactions on Algorithms (TALG), 10 (2014), pp. 20:1–20:15.
  • [19] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, Trawling the Web for emerging cyber-communities, Computer Networks, 31 (1999), pp. 1481–1493.
  • [20] S. Kuznetsov, On computing the size of a lattice and related decision problems, Order, 18 (2001), pp. 313–321.
  • [21] J. Li, G. Liu, H. Li, and L. Wong, Maximal biclique subgraphs and closed pattern pairs of the adjacency matrix: A one-to-one correspondence and mining algorithms, IEEE Trans. Knowl. Data Eng., 19 (2007), pp. 1625–1637.
  • [22] D. Lokshtanov, S. Saurab, and S. Sikdar, Simpler parameterized algorithm for OCT, International Workshop on Combinatorial Algorithms, 2009, pp. 380–384.
  • [23] D. Lokshtanov, S. Saurab, and M. Wahlström, Subexponential parameterized odd cycle transversal on planar graphs, LIPIcs-Leibniz International Proceedings in Informatics, 18 (2012).
  • [24] K. Makino and T. Uno, New algorithms for enumerating all maximal cliques, Scandinavian Workshop on Algorithm Theory, 2004, pp. 260–272.
  • [25] R. Mushlin, A. Kershenbaum, S. Gallagher, and T. Rebbeck, A graph-theoretical approach for pattern discovery in epidemiological research, IBM Systems J., 46 (2007), pp. 135–149.
  • [26] A. Panconesi and M. Sozio, Fast hare: A fast heuristic for single individual SNP haplotype reconstruction, International workshop on algorithms in bioinformatics, 2004, pp. 266–277.
  • [27] R. Peeters, The maximum edge biclique problem is NP-complete, Discrete Applied Mathematics, 131 (2003), pp. 651–654.
  • [28] M. Sanderson, A. Driskell, R. Ree, O. Eulenstein, and S. Langley, Obtaining maximal concatenated phylogenetic data sets from large sequence databases, Molecular biology and evolution, 20 (2003), pp. 1036–1042.
  • [29] J. Schrook, A. McCaskey, K. Hamilton, T. Humble, and N. Imam, Recall Performance for Content-Addressable Memory Using Adiabatic Quantum Optimization Entropy, 19 (2017).
  • [30] S. Tsukiyama, M. Ide, H. Ariyoshi, and I. Shirakawa, A new algorithm for generating all the maximal independent sets, SIAM J. on Computing, 6 (1977), pp. 505–517.
  • [31] L. Valiant, The complexity of enumeration and reliability problems, SIAM J. on Computing, 8 (1979), pp. 410–421.
  • [32] R. Wille, Restructuring lattice theory: an approach based on hierarchies of concepts, Ordered sets, 1982, pp. 445–470.
  • [33] M. Yannakakis, Node-and edge-deletion NP-complete problems, STOC, 1978, pp. 253–264.
  • [34] Y. Zhang, C. A. Phillips, G. L. Rogers, E. J. Baker, E. J. Chesler, and M. A. Langston, On finding bicliques in bipartite graphs: a novel algorithm and its application to the integration of diverse biological data types, BMC Bioinformatics, 15 (2014).

Appendix A Discussion of Dias et al.

Here we identify graphs containing induced maximal bicliques which would not be discovered by the algorithm of Dias et al. as originally stated in [7], which we refer to as Dias. We then describe a modified approach, which we prove is guaranteed to output all induced maximal bicliques in lexicographic order333We believe Dias could also be modified by removing the if conditions on lines 11 and 13 to output all MIBs in non-lexicographic order, with runtime , but did not focus on this alternate strategy in time , where is the number of MIBs in the graph.

For the reader’s convenience we have transcribed the pseudo-code of Dias from [7] in Algorithm 2. Line 12 relies on a subroutine described in the original paper, which finds the lexicographically least biclique containing an given set of nodes in time. For consistency with [7], we let be the neighbors of node and the non-neighbors throughout this appendix.

1:Graph , order on V
2:List of all bicliques of in lexicographic order
3:Find the least biclique of
4:
5:insert in the queue
6:while  do
7:     find the least biclique of
8:     remove from and output it
9:     for each vertex  do
10:         ;
11:         if  or  then
12:              ;
13:              if there exists no               such that extends to a               biclique of  then
14:                  find the least biclique of                   containing , if any
15:                  if  and  then
16:                       Include in                                          
17:         Swap contents of and , repeat lines 9           to 14 (once per for loop)      
Algorithm 2 Dias pseudo-code [7].

We point out that needs to be able to recall all MIBs which have been stored in it, which can easily be accomplished by augmenting the data structure without impacting the complexity. Furthermore, the pseudo-code in Algorithm 2 was corrected in line 11 to exclude from the range of values.

In the proof of correctness in [7] they show that for any MIB there exists a node and another MIB which does not contain , but does contain both and some . They examine the maximal such for each , and consider the iteration where and are defined as such (lines 6 and 7). Without loss of generality, assume that should be added to (as defined in line 8). If only contains non-neighbors of and contains non-neighbors of which are not in , then no biclique will be found in line 14. Thus the algorithm as written will not produce all MIBs, as shown in Figure 3.

0

1

2

3

4

5

Figure 3: An example of a graph where Dias would not find all MIBs. The MIB would not be found. In the notation of the algorithm and should yield . When , no biclique is found as have no common neighbors. When the biclique is found. Even if we removed the if-condition on line 11, when we would add the bicliques and to . When (line 6), no new bicliques are added to and then the next least biclique output in line 5 would not be .

a.1 Modified Dias

We now describe LexMIB and show that it finds all MIBs in lexicographic order. The first issue in Dias arises when in line 8, is empty and does not contain any neighbors of . In this case, we fail to satisfy the if-condition in line 9. The problem is resolved by appending an additional or-condition in line 9: “or ”.

The other problematic case is when is empty and contains non-neighbors of which are not in the lexicographically next MIB (e.g. node for in Figure 3). In this case, in line 13, so we add a new else-if clause to the conditional: “else if and then for all find the least biclique containing ”.

We now argue that LexMIB will find any biclique missed by Dias in the above manner. Let . Let be the least node in which shares an edge with . Consider the iteration of the above process where . Clearly all of is contained in . Furthermore due to the maximality of , the original argument from [7] can be used to show that finding the least biclique containing returns . In Figure 3 where biclique is missed, when and , and , which returns this previously missing biclique .

This augmentation increases the delay time of the algorithm to be since finding the least biclique may be called times within the for loop at line 7, which itself has iterations. Note that this addition does not alter their argument for the bicliques being output in lexicographic order.

Appendix B Extremal case for Zhang et al.

The runtime444 where is the number of MIBs found stated for the algorithm iMBEA in [34] contains a typo. The corrected runtime is . In their analysis they bound the number of nodes in their seach tree and derive their complexity by bounding the time spent on each one by . They show that the number of intermediate search tree nodes is at most , which is , and that the number of leaves is at most which is said to be . However, by the geometric series, the number of leaves can only be bounded by , giving an bound on the number of search tree nodes. We now provide an example of a graph where the number of leaves is , implying that the number of leaves cannot be in the general case.

Graphs which are a perfect matching plus an apex to one node from each edge in the matching provide an example of a graph family where this extremal behavior manifests. Note that these graphs are bipartite with nodes in the smaller partition and in the larger. See Figure 4 for the instance where and . There are MIBs in such a graph. Assume the nodes in the smaller partition are labeled . For , iMBEA attempts to expand the biclique containing and its two neighbors with all nodes in , each of which creates a leaf in the search tree. Thus there are leaves created and the number of leaves is .

Figure 4: An example of a graph in which iMBEA would run in time .

Appendix C Mcb Algorithm

Figure 5: Runtimes of OCT-MIB and LexMIB under varied bipartite balance conditions. For (Left, and (Right), each curve represents the runtime in seconds of an algorithm on graphs with a given OCT size and varied balance. See the Extended Results: Size/Balance paragraph for details on timeouts.

In this section we present the details of our algorithm MCB for finding all maximal crossing bicliques, along with proofs of correctness and runtime complexity.

Mcb Description

Given an instance , MCB enumerates all maximal crossing bicliques. MCB makes use of the same checks as OCT-MIB as described in Section 3.2.

Initialization phase. Recall that in MCB. To begin this phase, we fix an order of , . For each we create a new bag which will hold blueprints. We let , , . We then find all maximal independent sets in in time per MIS [30]. For each MIS we create a blueprint where , , and , , and are defined as above. In MCB , , and are not used in any blueprints.

For each blueprint, if it is not future-maximal, we discard it; otherwise, we add it to . If the blueprint is maximal we add to the set of maximal crossing bicliques and let . If it is not maximal we let be the first node from which is completely connected to . The blueprint remains in in either case. As long as is non-empty we add it to , the set of bags to be processed in the expansion phase.

Expansion phase. Once we have initialized with each , we process bags from until it is empty. We refer to branching in the same manner as in Section 3.2. Note that and are the same in all of the blueprints at bag . We will branch on with all nodes in , which we order to be consistent with the order of . Let be the node we are currently branching on with and be a bag we create to hold new blueprints formed by expanding with . When branching with , we iterate over the blueprints in and expand on them one at a time. Whether an expansion is terminated or completed, we continue with expanding the next blueprint in .

Let be the blueprint currently being expanded on. If we determine the expansion will not yield a future-maximal blueprint, we terminate. Otherwise we will create a new blueprint with values