Faster Biclique Mining in Near-Bipartite Graphs

03/04/2019
by   Blair D. Sullivan, et al.
NC State University
0

Identifying dense bipartite subgraphs is a common graph data mining task. Many applications focus on the enumeration of all maximal bicliques (MBs), though sometimes the stricter variant of maximal induced bicliques (MIBs) is of interest. Recent work of Kloster et al. introduced a MIB-enumeration approach designed for "near-bipartite" graphs, where the runtime is parameterized by the size k of an odd cycle transversal (OCT), a vertex set whose deletion results in a bipartite graph. Their algorithm was shown to outperform the previously best known algorithm even when k was logarithmic in |V|. In this paper, we introduce two new algorithms optimized for near-bipartite graphs - one which enumerates MIBs in time O(M_I |V||E| k), and another based on the approach of Alexe et al. which enumerates MBs in time O(M_B |V||E| k), where M_I and M_B denote the number of MIBs and MBs in the graph, respectively. We implement all of our algorithms in open-source C++ code and experimentally verify that the OCT-based approaches are faster in practice than the previously existing algorithms on graphs with a wide variety of sizes, densities, and OCT decompositions.

READ FULL TEXT VIEW PDF

Authors

10/26/2018

Mining Maximal Induced Bicliques using Odd Cycle Transversals

Many common graph data mining tasks take the form of identifying dense s...
03/06/2019

An Efficient Algorithm for Enumerating Chordal Bipartite Induced Subgraphs in Sparse Graphs

In this paper, we propose a characterization of chordal bipartite graphs...
05/12/2018

Bipartite Graphs of Small Readability

We study a parameter of bipartite graphs called readability, introduced ...
10/06/2021

An Improved Approximation for Maximum k-Dependent Set on Bipartite Graphs

We present a (1+k/k+2)-approximation algorithm for the Maximum k-depende...
12/29/2021

Efficient Algorithms for Maximal k-Biplex Enumeration

Mining maximal subgraphs with cohesive structures from a bipartite graph...
10/28/2021

Finding a Concise, Precise, and Exhaustive Set of Near Bi-Cliques in Dynamic Graphs

A variety of tasks on dynamic graphs, including anomaly detection, commu...
05/16/2018

Investigating the Agility Bias in DNS Graph Mining

The concept of agile domain name system (DNS) refers to dynamic and rapi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bicliques (complete bipartite graphs) naturally arise in many data mining applications, including detecting cyber communities [18], data compression [1], epidemiology [23]

, artificial intelligence 

[30], and gene co-expression analysis [15, 16]. In many settings, the bicliques of interest are maximal (not contained in any larger biclique) and/or induced (each side of the bipartition is independent in the host graph), and there is a large body of literature giving algorithms for enumerating all such subgraphs [3, 5, 6, 20, 22, 23, 26, 32]. Many of these approaches make strong structural assumptions on the host graph; the case when the host graph is bipartite has been particularly well-studied, and the iMBEA algorithm of Zhang et al. has been empirically established to be state-of-the-art [32]. Until recently, the only known non-trivial algorithm for enumerating maximal induced bicliques (MIBs) in general graphs was that of Dias et al. which did so in lexicographic order [5]. In [17], Kloster et al. presented a new algorithm for enumerating MIBs in general graphs, OCT-MIB, which extended ideas from iMBEA to work on non-bipartite graphs by using an odd cycle transversal (OCT set): a set of nodes such that is bipartite. This yielded an algorithm with runtime where , is the number of MIBs in , and and denote and , respectively. The term arises from OCT-MIB’s dependence on the number of maximal independent sets (MISs) in . In this paper, we give new algorithms for enumerating both MIBs and maximal, not necessarily induced bicliques (MBs) in general graphs. We first present OCT-MIB-ii@ which again leverages odd cycle transversals to enumerate MIBs in time . In contrast to OCT-MIB, the worst-case runtime of OCT-MIB-ii@ is not dependent on the number of MISs in , making it better than OCT-MIB when . We also give a second algorithm for MIB-enumeration, Enum-MIB, which has runtime . Enum-MIB is essentially a modified version of the algorithm of Dias et al. [5], which achieves a faster runtime by dropping the lexicographic output requirement.

In the setting considering non-induced bicliques, the state-of-the-art approach is MICA of Alexe et al. [3]. MICA employs a consensus mechanism to iteratively find maximal bicliques by combining them together, resulting in an algorithm, where is the number of MBs. We introduce a new algorithm OCT-MICA which leverages odd cycle transversals and runs in time.

Since all graphs have OCT sets (although they can be size , as in cliques), OCT-MIB, OCT-MIB-ii@, and OCT-MICA can all be run in the general case; their correctness does not require minimality or optimality of the OCT set. Further, we implement OCT-MIB-ii@, Enum-MIB and OCT-MICA in open source C++ code, and evaluate their performance on a suite of synthetic graphs with known OCT decompositions. Our experiments show that OCT-MICA and OCT-MIB-ii@ are the dominant algorithms for their respective problems in many settings. Their efficiencies allow us to run on larger graphs than in [17].

We begin with preliminaries and a brief discussion of related work in Section 2, then describe each of our three new algorithms and provide proofs of their correctness and runtimes in Section 3. We highlight several implementation details in Section 4, before presenting our experimental evaluation in Section 5.

2 Preliminaries

2.1 Related work

The complexity of finding bicliques is well-studied, beginning with the results of Garey and Johnson [7] which establish that in bipartite graphs, finding the largest balanced biclique is NP-hard but the largest biclique can be found in polynomial time. Particularly relevant to the mining setting, Kuznetsov showed that enumerating MBs in a bipartite graph is #P-complete [19]. Finding the biclique with the largest number of edges was shown to be NP-complete in general graphs [31], but the case of bipartite graphs remained open for many years. Several variants (including the weighted version) were proven NP-complete in [4] and in 2000, Peeters finally resolved the problem, proving the edge maximization variant is NP-complete in bipartite graphs [25].

For the problem of enumerating MIBs, the best known algorithm in general graphs is due to Dias et al. [5]; in the non-induced setting, approaches include a consensus algorithm MICA [3], an efficient algorithm for small arboricity [6], and a general framework for enumerating maximal cliques and bicliques [8], with MICA the most efficient among them, running in . We note that, as described, the method in [5] may fail to enumerate all MIBs; a modified, correct version was given in [17].

There has also been significant work on enumerating MIBs in bipartite graphs. We note that since all bicliques in a bipartite graph are necessarily induced, non-induced solvers for general graphs (such as MICA) can be applied, and have been quite competitive. The best known algorithm however, is due to Zhang et al. [32] and directly exploits the bipartite structure. Other approaches in bipartite graphs include frequent closed itemset mining [20] and transformations to the maximal clique problem [22]; faster algorithms are known when a lower bound on the size of bicliques to be enumerated is assumed [23, 26].

Kloster et al. [17] extended techniques for bipartite graphs to the general setting using odd cycle transversals, a form of “near-bipartiteness” which arises naturally in many applications [10, 24, 27]. This work resulted in OCT-MIB, an algorithm for enumerating MIBs in a general graph, parameterized by the size of a given OCT set. Although finding a minimum size OCT set is NP-hard, the problem of deciding if an OCT set with size exists is fixed parameter tractable (FPT) with algorithms in [21] and [14] running in times and

, respectively. We note non-optimal OCT sets only affect the runtime (not correctness) of our algorithms, allowing us to use heuristic solutions. Recent implementations 

[9] of a heuristic ensemble alongside algorithms from  [2, 12] alleviate concerns about finding an OCT decomposition creating a barrier to usability.

2.2 Notation and terminology

Let be a graph; we set and . We define to be the neighborhood of and write for ’s non-neighbors. An independent set is a maximal independent set (MIS) if is not contained in any other independent set of . Unless otherwise noted, we assume without loss of generality that is connected.

A biclique in a graph consists of non-empty disjoint sets such that every vertex of is neighbors with every vertex of . We say a biclique is induced if both and are independent sets in . A maximal biclique (MB) in is a biclique not properly contained in any other; a maximal induced biclique (MIB) is analogous among induced bicliques. We use and to denote the number of MBs and MIBs in , respectively. If is an OCT set in , we denote the corresponding OCT decomposition of by , where the induced subgraph is bipartite. We write and for and , respectively.

3 Algorithms

In this section we provide three novel algorithms, two of which of solve Maximal Induced Biclique Enumeration (Enum-MIB and OCT-MIB-ii@) and the other of which solves Maximal Biclique Enumeration (OCT-MICA). Both Enum-MIB and OCT-MIB-ii@ follow the same general framework, which we now describe.

3.1 MIB Algorithm Framework

The MIB-enumeration algorithms both use two subroutines, MakeIndMaximal and AddTo. MakeIndMaximal takes in , where is an induced biclique and , and either returns a MIB where , , , or returns . If it returns and then there is another MIB which contains and . AddTo takes in where is an induced biclique and , and returns the induced biclique where is added to , is removed from , and is removed from if ; otherwise, is returned. Both MakeIndMaximal and AddTo operate in time. We defer algorithmic details and proofs of the complexity and correctness for these routines to the Appendix.

The MIB-enumeration framework (shown in Algorithm 1) begins by finding a seed set of MIBs . At a high level, it operates by attempting to add vertices from the designated set to previously found MIBs to make them maximal. We utilize a dictionary to track which MIBs have already been found and a queue to store bicliques which have not yet been explored. We now prove two technical lemmas used to show the correctness of this framework.

1:Input: ,
2: FindSeedSet set of initial MIBs
3:Add each to and
4:while  is not empty do
5:     pop
6:     for  do
7:         
8:         
9:         if  is not in  then
10:              Add to and          
11:         
12:         
13:         if  is not in  then
14:              Add to and               
15:return
Algorithm 1 MIB-enumeration algorithm framework
Lemma 1

Let be a MIB in graph which contains a non-empty subset of , another MIB in . Running AddTo with parameters and returns a biclique which contains , , and if .

Proof

By construction, must be independent from and completely connected to . Thus, none of will be removed from and all of will remain in , as required. Therefore, as long as , the desired biclique is returned.

Lemma 2

In Algorithm 1, if there exists a MIB in such that , and , for each MIB in , then all MIBs in are included in .

Proof

Assume not. Let be a MIB in which is not in with maximum. Let be the MIB in such that , and and let . Without loss of generality assume and .

Consider the iteration of Algorithm 1 when and (lines 5-6). By Lemma 1, one of the calls to AddTo returns an induced biclique which contains , , and . Both sides of are non-empty (since and ). If we obtain a contradiction, as MakeIndMaximal would return , resulting in its addition to . Otherwise, either MakeIndMaximal returns or a biclique which is added to . Since both sides of are nonempty, if MakeIndMaximal returns , there exists a MIB in containing and . Let be such a MIB; since it has more vertices in than , it must be in , and we set . In either case, , . We can repeat this argument for the new , noting that will include vertices on both sides. Thus, the argument still holds without any assumption on the non-empty side of the intersection and will strictly decrease; when it reaches 0, , a contradiction.

Note that as MakeIndMaximal only returns MIBs, this framework will only include MIBs in . Together with Lemma 2, this yields the following corollary.

Corollary 1

If for every MIB there is a MIB such that , and , then upon completion of Algorithm 1, will contain exactly the MIBs in .

Recall that AddTo and MakeIndMaximal each run in time. Combining this with the fact that each MIB in is popped at most once from we have:

Corollary 2

The time complexity of this framework is , where is the time needed by FindSeedSet to compute .

3.2 Enum-MIB

We now present Enum-MIB, which follows the MIB-enumeration framework. To form , for each vertex we run MakeIndMaximal where and add it to . We also let . To show the correctness of this approach, we note that and any MIB contains the empty set. Thus all that remains to show is that for each MIB there is a MIB in with which it has a non-empty intersection. As every is in some MIB in , this condition is met. Thus, via Corollary 1, Enum-MIB will find all MIBs. There may be duplicates in which can be removed in time per duplicate. As MakeIndMaximal runs in time, by Corollary 2, the time complexity of Enum-MIB is . We note that Enum-MIB is essentially a simplified version of the LexMIB algorithm from [17] which does not guarantee lexicographic order on output.

3.3 Oct-Mib-ii@

Next we describe OCT-MIB-ii@, an algorithm for enumerating all MIBs in a graph with a given OCT decomposition . OCT-MIB-ii@ also makes use of the MIB-enumeration framework described in Section 3.1. In the calls to MakeIndMaximal we let . To form , we begin by running iMBEA [32] to find the set of MIBs in . For each we run MakeIndMaximal on . This creates a set of MIBs in .

Then for each node , we find the set of MISs in . This can be done in time per MIS using the algorithm of Tsukiyama et al. [28]. For each MIS found, run MakeIndMaximal on the induced biclique . Let the multiset of all MIBs produced by this process be denoted . Note that a MIB may be in up to times (once per , stemming from an MIS in ), but we can remove duplicates from in per MIB, forming . We then let . Thus, FindSeedSet runs in per unique MIB found, and by Corollary 2, the total time complexity of OCT-MIB-ii@ is .

To show the correctness of OCT-MIB-ii@, we must show that for every MIB in , we include a MIB in which includes all of its non-OCT nodes and a node in the MIB if the MIB is completely contained in . If an entire MIB is contained in , then any MIB containing for suffices. If a MIB has non-OCT nodes on both sides, then there must be a MIB in which contains these non-OCT nodes because there is a MIB in containing them. If a MIB has all of its non-OCT nodes on one side, then there is an OCT node which is neighbors with all of the non-OCT nodes, which thus must be contained in an MIS in . Thus, by Corollary 1, we find all of the MIBs in .

3.4 Oct-Mica

OCT-MICA is an algorithm for enumerating the maximal bicliques (MBs) in a general graph with a given OCT decomposition . We adapt the approach of MICA [3], which relies on a seed set of bicliques which “cover” the graph. Specifically, we restrict MICA’s coverage requirement for the seed set to only the OCT set and leverage iMBEA [32] to enumerate the MBs entirely within . This reduces the runtime from to .

OCT-MICA begins by running iMBEA (line 2 in Algorithm 2) to get , the MBs in , in time , where is the number of edges in and . Using MakeMaximal, we convert elements of to be maximal with respect to (lines 3-4). MakeMaximal runs in time and its algorithmic details are deferred to the Appendix. OCT-MICA then initializes its seed set of size consisting of bicliques from the stars of the OCT set (lines 6-8), and adds these to the working set of all identified MBs (line 9). Similar to MICA, the remainder of the algorithm builds new bicliques by combining (via Consensus, see Appendix) pairs of elements from the seed set and previously identified MBs (lines 11-20), until no new bicliques are generated. This runs in time .

1:procedure Enumerate()
2:      Implementation of iMBEA,
3:     for  do
4:          Extend in place,      
5:     
6:     for  in  do Initialize Bicliques from stars,
7:         
8:         .add()      
9:     .
10:     sort()
11:     
12:     while  do
13:         
14:         for  in  do
15:              for  in  do
16:                  for  in Consensus(do
17:                       
18:                       if  not in  then
19:                           
20:                           .InsertInSortedOrder()                                                                      
21:     return
Algorithm 2 OCT-MICA
Lemma 3

OCT-MICA returns exactly , the set of maximal bicliques in .

Proof

Running iMBEA and MakeMaximal ensures all maximal bicliques from were found and added to . Thus, we restrict our attention to maximal bicliques with at least one node from , and proceed similarly to the proof of Theorem 3 in [3]. We say that a biclique absorbs a biclique if and or and .

We show that every biclique in is absorbed by some biclique in by induction on , the number of OCT vertices in . In the base case (), is contained in and is absorbed by a biclique in . We now consider ; without loss of generality, assume contains some OCT vertex . Then is absorbed by some biclique , where is formed from the star centered on . Further, has fewer vertices from OCT than , so by induction it is absorbed by some biclique , where . Now is a consensus of and , and will be absorbed by the corresponding consensus of and , guaranteeing absorption by a biclique in .

Lemma 4

The runtime of OCT-MICA after iMBEA is .

Proof

We begin by noting that , so is .

Finding the bicliques in requires time for iMBEA (line 2); making them maximal (lines 3-4) is . The bicliques generated by the OCT stars (lines 6-8) can be found in . Sorting the initial set (line 10) incurs an additional . Since is , the total runtime for our initialization (lines 2-10) is .

The consensus-building stage of OCT-MICA contains nested loops over (line 14) and (line 15), which execute at most and times, respectively. The Consensus operation (line 16) executes in , and produces a constant number of candidate bicliques to check. Each execution of the inner loop incurs a cost of for MakeMaximal (line 17) and to insert the new MB in sorted order (lines 18-20). We note that the runtime of Consensus is dominated by the cost of the loop. Thus, the total runtime of consensus-building is , or .

This analysis leads to an overall runtime of , as desired. We note that for , OCT-MICA’s runtime degenerates to the of MICA. Additionally, the stronger results for incremental polynomial time described for MICA in [3] still apply; the proofs are similar and are omitted for space. For bipartite graphs (), OCT-MICA is effectively iMBEA, which was empirically shown to be more efficient than MICA on bipartite graphs [32].

4 Implementation

In this section we describe several relevant implementation details and design decisions.

4.1 Algorithm Framework

We always (re-label and) store vertices as and maintain internal dictionaries as needed to recover original labels – e.g. when taking subgraphs. This allows us to leverage native data types and structures; vertices are stored as size_t.

For efficiency in subroutines, we utilize two representations of

. One representation is as adjacency lists, stored as sorted vectors (to improve union and intersection relative to dictionaries or unsorted vectors). This representation is essential in the performance of

Consensus in MICA/ OCT-MICA and MakeIndMaximal and AddTo in OCT-MIB or OCT-MIB-ii@. We also store the graph as a dictionary of dictionaries which is more amenable to taking subgraphs (as when finding MISs in OCT-MIB, OCT-MIB-ii@). Deleting a node requires time as compared to , where is the maximum degree, in the adjacency list representation.

4.2 Mica

The public implementation of MICA used in [32] is available at [13]. However, this implementation is only suitable for bipartite graphs as it makes certain efficiency improvements in storage, etc. which assume bipartite input. As such, we implemented MICA from scratch in the same framework as OCT-MIB and OCT-MICA, etc., using the data structures discussed above. This is incompatible with the technique described in [3] for storing only one side of each biclique (since in the non-induced case, maximality completely determines the other side). We note this could improve efficiency of both MICA and OCT-MICA in a future version of our software, and should not significantly affect their relative performance as analyzed in this work.

5 Experiments

5.1 Data and experimental setup

Figure 1: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where , , and . The expected edge density between and was varied; all other densities were .

We implemented OCT-MIB-ii@, Enum-MIB, MICA, and OCT-MICA in C++, and used the implementation of OCT-MIB from [17]. All code is open source under a BSD 3-clause license and publicly available as part of MI-Bicliques at [11].

Data

For convenience, throughout this section, we assume and let . Our synthetic data was generated using a modified version of the random graph generator of Zhang et al. [32] that augments random bipartite graphs to have OCT sets of known size. The generator allows a user to specify the sizes of , , and (, , and ), the expected edge densities between and , and , and within , and the coefficient of variation (

; the standard deviation divided by the mean) of the expected number of neighbors in

over and in over . The generator is seeded for replicability. We use the naïve OCT decomposition returned by the generator for our algorithm evaluation, but the techniques mentioned in Section 2 could also be used to find alternative OCT sets. Unless otherwise specified, the following default parameters are used: expected edge density , , and ; additionally, the edge density between and is the same as that between and .

To add the edges between and , the edge density and values are used to assign vertex degrees to , and then neighbors are selected from uniformly at random; this was implemented in the generator of [32]. Edges are added between and via the same process, only with the corresponding edge density and values. Finally, we add edges within with an Erdős-Rényi process based on expected density (no value is used here).

In most experiments we limit to be , and use a timeout of one hour (3600s). Unless otherwise noted we run each parameter setting with five seeds and plot the average over these instances, using the time-out value as the runtime for instances that don’t finish. If not all instances used for a plot point finished, we annotate it with the number of instances that did not time out.

We began by running our algorithms on the same corpus of graphs as in [17] (see  5.2). As the new algorithms finished considerably faster than those in [17], we were able to scale up both and to create new sets of experiments, discussed in 5.3. We also ran our algorithms on computational biology graphs from [29], which have been shown to be near-bipartite; these results are in 5.4.

Hardware

All experiments were run on identical hardware; each server had four Intel Xeon E5-2623 v3 CPUs (3.00GHz) and 64GB DDR4 memory. The servers ran Fedora 27 with Linux kernel 4.16.7-200.fc27.x86_64. The C/C++ codes were compiled using gcc/g++ 7.3.1 with optimization flag -O3.

5.2 Initial Benchmarking

We begin by evaluating our algorithms on the corpus of graphs used in [17]. This dataset was designed to independently test the effect of each parameter (the expected densities in various regions of the graph, the values, , , and ) on the algorithms’ runtime. We observe that OCT-MIB-ii@ and OCT-MICA are generally the best algorithms for their respective problems, and include comprehensive plots of all experiments in the Appendix.

For Maximal Induced Biclique Enumeration, we observe that in general, OCT-MIB-ii@ outperforms OCT-MIB and Enum-MIB. This is the case when the varying parameter is the density within , the between and , the size of the OCT set , and the ratio between and , amongst other settings. In these “near-bipartite” synthetic graphs, Enum-MIB unsurprisingly is slowest on most instances. When and , Enum-MIB outperforms OCT-MIB when the density within increases above 0.05. This is likely due to the adverse effect of the number of MISs in the OCT set on OCT-MIB. The most interesting observation occurs when varying the edge density between and (left panel of Figure 1). In the case, OCT-MIB-ii@ is the fastest algorithm until the density exceeds 0.11, when OCT-MIB becomes faster. We believe this is likely due to OCT-MIB efficiently pruning away attempted expansions which are guaranteed to fail, while the number of MISs in does not increase. This behavior is also seen in the case where , though the magnitude of the difference is not as extreme.

In the non-induced setting of Maximal Biclique Enumeration, OCT-MICA consistently outperforms MICA on this corpus, typically by at least an order of magnitude. The more interesting takeaway is that both MB-enumerating algorithms run considerably faster than their MIB-enumerating counterparts (e.g. right panel of Figure 1), mostly because the number of MIBs is often one to two orders of magnitude larger than the number of MBs in these instances.

Figure 2: Runtimes of the OCT-based MIB-enumeration algorithms on graphs where and varies. In the left panel, () and the expected edge density is . In the right panel, () and the expected edge density (excluding within ) is ; the marker-type denotes the expected edge density within (see legend). For these larger instances we used 3 seeds and a 7200s timeout.

5.3 Larger Graphs

Given the much faster runtimes achieved in Section 5.2 we created a new corpus of larger synthetic graphs. For Maximal Induced Biclique Enumeration, we scaled up to 10,000 and varied in two settings, increasing the timeout to 7200 seconds. When the expected density was 0.03 and , OCT-MIB-ii@ outperformed OCT-MIB for all values of by at least an order of magnitude and finished on all instances, whereas OCT-MIB timed out on all instances with (left panel of Figure 2). However, when the expected density was 0.01 and , OCT-MIB was faster (right panel of Figure 2). We speculated that this was due to the sparsity of , allowing for a speed-up due to the efficient pruning of OCT-MIB similar to what was seen in Section 5.2. To test this theory, we increased expected edge density within to 0.05 while leaving the other parameters the same (right panel of Figure 2), and observed that once , OCT-MIB-ii@ outperforms OCT-MIB, confirming our hypothesis.

For Maximal Biclique Enumeration, we also designed a new experiment where and was scaled up to 1000 (left panel of Figure 3). OCT-MICA finished on all instances, whereas MICA finished on none when was 1000. We also tested how large we could scale the expected density between and (right panel of Figure 3). When , OCT-MICA finished on all instances with density at most 0.4, while MICA finished on two of five when density is 0.4. Neither algorithm finished in less than the timeout of an hour when the density was 0.5 or greater, exhausting the hardware’s memory in many cases. Thus OCT-MICA is able to scale to graphs with considerably larger OCT sets and higher density than both MICA and the MIB-enumerating algorithms.

We additionally created graphs with , which was not done in [17], and ran the algorithms for both MIBs and MBs (Figure 4). These graphs had values up to 4000 and for each value of , we used three values of ; and . The results were most interesting for the MIB-enumerating algorithms (Figure 4 top). OCT-MIB performed the worst of the three algorithms when , but outperformed Enum-MIB in the other settings. This verifies the analysis from [17] on the range in which OCT-MIB is most effective. In general, OCT-MIB-ii@ once again was the fastest algorithm and did best when was smaller. The impact of on OCT-MIB-ii@ and Enum-MIB appeared comparable. In the MB-enumeration case, OCT-MICA consistently outperforms MICA, and there is a distinguishable difference in the runtime based on the value of (Figure 4 bottom). The value of has far less effect on MICA, which does not finish on any graphs with .

Figure 3: Runtimes of the MB-enumerating algorithms on graphs with larger and expected edge density. In the left panel, (), the expected edge density is , and varied. In the right panel, (), , and the expected edge density varied.
(a)
(b)
Figure 4: Runtimes of the MIB-enumerating (top) and MB-enumerating (bottom) algorithms on graphs where and all expected edge densities are . is varied (x-axis) and the marker-type denotes the value of (see legend). The time-out value is set to 7200s for the MIB-enumerating algorithms and 3600s for the MB-enumerating algorithms.

5.4 Computational Biology Data

Finally, we tested performance on real-world data using the graphs from [29], which come from computational biology. These graphs have previously been exhibited to have small OCT sets [12], and we used the implementation from [9] of Hüffner’s iterative compression algorithm [12] to find the OCT decompositions. Computing the OCT decomposition for each graph ran in less than ten seconds, and often in less than one second. As can be seen in Table 1, OCT-MIB-ii@ performs the best of the MIB-enumerating algorithms and OCT-MICA is faster than MICA. Full results are in the Appendix.

OCT-MIB-ii@ OCT-MIB Enum-MIB OCT-MICA MICA
aa-24 258 1108 21 3890 2.108 9.140 14.167 1334 0.237 2.477
aa-30 39 71 4 56 0.002 0.007 0.006 36 0.002 0.007
aa-41 296 1620 40 11705 16.519 82.439 50.205 20375 9.059 47.789
aa-50 113 468 18 1272 0.322 0.778 1.098 1074 0.132 0.612
j-20 241 640 1 274 0.013 0.065 0.484 228 0.009 0.188
j-24 142 387 4 150 0.013 0.027 0.089 104 0.007 0.025
Table 1: A sampling of the runtimes of the biclique-enumeration algorithms on the Wernicke-Hüffner computational biology data [29].

6 Conclusion

We present a suite of new algorithms for enumerating maximal (induced) bicliques in general graphs, two of which are parameterized by the size of an odd cycle transversal. It is particularly noteworthy that the parameterized algorithms empirically outperform the general approaches even when their asymptotic worst-case complexities are worse. This highlights a weakness of standard complexity analysis, as many aspects of an algorithm get “swept under the rug”.

It is also interesting that even though Maximal Induced Biclique Enumeration and Maximal Biclique Enumeration are closely related problems, the MB-enumerating algorithms are often an order of magnitude faster than their MIB-enumerating counterparts. The reason for this can likely be attributed to two causes: the number of MBs is significantly less than the number of MIBs in sparse graphs, and that the stricter structure of MIBs requires more work to ensure. For , there is exactly one MB of the form in , but there can be many MIBs with this structure.

We implement and benchmark all of the algorithms on a corpus of synthetic and real-world computational biology graphs, and establish that parameterized approaches are often at least an order of magnitude faster than the general approaches. This remains true even when . It would be interesting to experimentally evaluate as increases, at what point the standard methods outperform those optimized for near-bipartite graphs. Finally, we note as in  [17], the current implementations of the algorithms could be improved by replacing the MIS-enumeration algorithm with that of [28], and the M(I)B-enumeration on bipartite graphs with the implementation used in [32].

References

  • [1] P. Agarwal, N. Alon, B. Aronov, and S. Suri, Can visibility graphs be represented compactly?, Discrete & Computational Geometry, 12 (1994), pp. 347–365.
  • [2] T. Akiba and Y. Iwata, Branch-and-reduce exponential/fpt algorithms in practice: A case study of vertex cover, Theoretical Computer Science, 609 (2016), pp. 211–225.
  • [3] G. Alexe, S. Alexe, Y. Crama, S. Foldes, P. Hammer, and B. Simeone, Consensus algorithms for the generation of all maximal bicliques, Discrete Applied Mathematics, 145 (2004), pp. 11–21.
  • [4] M. Dawande, P. Keskinocak, J. Swaminathan, and S. Tayur, On bipartite and multipartite clique problems, J. of Algorithms, 41(2001), pp. 388–403.
  • [5] V. Dias, C. De Figueiredo, and J. Szwarcfiter, Generating bicliques of a graph in lexicographic order, Theoretical Computer Science, 337 (2005), pp. 240–248.
  • [6] D. Eppstein, Arboricity and bipartite subgraph listing algorithms, Inf. Process. Lett., 51 (1994), pp. 207–211.
  • [7] M. Garey and D. Johnson, Computers and intractability: a guide to NP-completeness, 1979.
  • [8] A. Gély, L. Nourine, and B. Sadi, Enumeration aspects of maximal cliques and bicliques, Discrete Applied Mathematics, 157(7), (2009) pp. 1447–1459.
  • [9] T. Goodrich, E. Horton, and B. Sullivan, Practical Graph Bipartization with Applications in Near-Term Quantum Computing, arXiv preprint arXiv:1805.01041, 2018.
  • [10] N. Gülpinar, G. Gutin, G. Mitra, and A. Zverovitch,

    Extracting pure network submatrices in linear programs using signed graphs

    , Discrete Applied Mathematics, 137 (2004), pp. 359–372.
  • [11] E. Horton, K. Kloster, B. D. Sullivan, A. van der Poel, MI-bicliques, https://github.com/TheoryInPractice/MI-bicliques, October 2018.
  • [12] F. Hüffner, Algorithm engineering for optimal graph bipartization, International Workshop on Experimental and Efficient Algorithms, 2005, pp. 240–252.
  • [13] W. Chang, Maximal Biclique Enumeration, http://genome.cs.iastate.edu/supertree/download/biclique/README.html, December 2004
  • [14] Y. Iwata, K. Oka, and Y. Yoshida, Linear-time FPT algorithms via network flow, SODA, 2014, pp. 1749–1761.
  • [15] M. Kaytoue-Uberall, S. Dupelessis, and A. Napoli, Using formal concept analysis for the extraction of groups of co-expressed genes, Modelling, Computation and Optimization in Information Systems and Management Sciences, 2008, pp. 439–449.
  • [16] M. Kaytoue, S. Kuznetsov, A. Napoli, and S. Duplessis, Mining gene expression data with pattern structures in formal concept analysis, Information Sciences, 181 (2011), pp. 1989–2011.
  • [17] K. Kloster, B. Sullivan, and A. van der Poel, Mining Maximal Induced Bicliques using Odd Cycle Transversals, Proceedings of the 2019 SIAM International Conference on Data Mining, 2019 (to appear).
  • [18] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, Trawling the Web for emerging cyber-communities, Computer Networks, 31 (1999), pp. 1481–1493.
  • [19] S. Kuznetsov, On computing the size of a lattice and related decision problems, Order, 18 (2001), pp. 313–321.
  • [20] J. Li, G. Liu, H. Li, and L. Wong, Maximal biclique subgraphs and closed pattern pairs of the adjacency matrix: A one-to-one correspondence and mining algorithms, IEEE Trans. Knowl. Data Eng., 19 (2007), pp. 1625–1637.
  • [21] D. Lokshtanov, S. Saurab, and S. Sikdar, Simpler parameterized algorithm for OCT, International Workshop on Combinatorial Algorithms, 2009, pp. 380–384.
  • [22] K. Makino and T. Uno, New algorithms for enumerating all maximal cliques, Scandinavian Workshop on Algorithm Theory, 2004, pp. 260–272.
  • [23] R. Mushlin, A. Kershenbaum, S. Gallagher, and T. Rebbeck, A graph-theoretical approach for pattern discovery in epidemiological research, IBM Systems J., 46 (2007), pp. 135–149.
  • [24] A. Panconesi and M. Sozio, Fast hare: A fast heuristic for single individual SNP haplotype reconstruction, International workshop on algorithms in bioinformatics, 2004, pp. 266–277.
  • [25] R. Peeters, The maximum edge biclique problem is NP-complete, Discrete Applied Mathematics, 131 (2003), pp. 651–654.
  • [26] M. Sanderson, A. Driskell, R. Ree, O. Eulenstein, and S. Langley, Obtaining maximal concatenated phylogenetic data sets from large sequence databases, Molecular Biology and Evolution, 20 (2003), pp. 1036–1042.
  • [27] J. Schrook, A. McCaskey, K. Hamilton, T. Humble, and N. Imam, Recall Performance for Content-Addressable Memory Using Adiabatic Quantum Optimization Entropy, 19 (2017).
  • [28] S. Tsukiyama, M. Ide, H. Ariyoshi, and I. Shirakawa, A new algorithm for generating all the maximal independent sets, SIAM J. on Computing, 6 (1977), pp. 505–517.
  • [29] S. Wernicke, On the algorithmic tractability of single nucleotide polymorphism (SNP) analysis and related problems, 2014.
  • [30] R. Wille, Restructuring lattice theory: an approach based on hierarchies of concepts, Ordered sets, 1982, pp. 445–470.
  • [31] M. Yannakakis, Node-and edge-deletion NP-complete problems, STOC, 1978, pp. 253–264.
  • [32] Y. Zhang, C. A. Phillips, G. L. Rogers, E. J. Baker, E. J. Chesler, and M. A. Langston, On finding bicliques in bipartite graphs: a novel algorithm and its application to the integration of diverse biological data types, BMC Bioinformatics, 15 (2014).

Appendix 0.A MIB-Enumeration Framework Subroutines

We now provide algorithmic details and proofs of the complexity and correctness of MakeIndMaximal and AddTo.

0.a.1 MakeIndMaximal

Recall that MakeIndMaximal takes in , where is an induced biclique and , and either returns a MIB where , , , or returns . If it returns and then there is another MIB which contains and . We give pseudo-code of MakeIndMaximal in Algorithm 3.

1:Input: , ,
2:Let
3:if  then
4:     return
5:for  do
6:     if  then
7:         
8:               
9:for  do
10:     if  then
11:               
12:
13:for  do
14:     if  then
15:         return      
16:for  do
17:     if  then
18:         return      
19:return
Algorithm 3 MakeIndMaximal
Lemma 5

MakeIndMaximal returns a MIB where , , , or returns .

Proof

Referring to the pseudo-code in Algorithm 3, it is clear that , as no vertices are ever removed from the input biclique . Furthermore, the only vertices added to are from , so and is the only biclique returned by MakeIndMaximal. Note that neither side of is empty and the only vertices added are independent from the side of the biclique which they are added to, so if we do not return the object returned is an induced biclique. If no node from outside of can be added to , then we will not return and thus is maximal.

Lemma 6

If MakeIndMaximal returns and then there is another MIB in which contains and .

Proof

Note that at line 12. As MakeIndMaximal returns there must be a vertex which can be added to . Let be a MIB containing and , thus suffices to prove the lemma.

Lemma 7

MakeIndMaximal runs in time.

Proof

Note that because is connected, . Setting and can be done in time. In each for loop, we can scan all of the edges incident to each in the iterated-over set and keep count of how many nodes from have been seen (checking for inclusion can be done in time with an initialization step). Thus, each edge is scanned at most once per for loop.

0.a.2 AddTo

Recall that AddTo takes in where is an induced biclique and , and returns the induced biclique where is added to , is removed from , and is removed from if and otherwise. We give pseudo-code of AddTo in Algorithm 4.

1:Input: , ,
2:
3:
4:if  then
5:     return
6:return
Algorithm 4 AddTo
Lemma 8

AddTo returns the induced biclique where is added to , is removed from , and is removed from if , and otherwise.

Proof

Referring to the pseudo-code in Algorithm 4, it is clear that is added to and is removed from . Additionally ’s non-neighbors are effectively removed from by intersecting it with . If then and is returned. Otherwise since it includes and thus is a biclique. must be an induced biclique as , , and is an induced biclique and by definition.

Lemma 9

AddTo runs in time.

Proof

Note that because is connected, . AddTo can be completed by scanning all of ’s incident edges in tandem with an preprocessing step to allow for constant-time look-ups when checking for inclusion in a set.

Appendix 0.B MB-Enumeration Framework Subroutines

We give a detailed description of the MakeMaximal and Consensus subroutines used in OCT-MICA, along with arguments of their correctness and complexity.

0.b.1 MakeMaximal

Extending a biclique to be maximal is different in the non-induced case from the induced case, since MBs are completely characterized by one side of the biclique.

1:Input: ,
2:
3:
4:return
Algorithm 5 MakeMaximal
Lemma 10

MakeMaximal runs in time.

Proof

In order to form , we can scan the edges incident to each and keep count of how many nodes from have been seen (checking for inclusion can be done in time with an initialization step). The same can be done for , where instead we scan the edges incident to each . Thus, each edge is scanned at most twice in MakeMaximal.

0.b.2 Consensus

The MICA section of OCT-MICA relies heavily on the Consensus operation introduced in [3] for finding new candidate bicliques. For each pair of bicliques, there are four candidate bicliques which form the consensus of the pair. Note that any of the four candidates may be empty and if so discarded. Consensus runs in time using standard techniques for set union and intersection.

1:Input: , ,
2:
3:
4:
5:
6:
7:for  do
8:     if  then
9:         .add()      
10:return
Algorithm 6 Consensus

Appendix 0.C Additional Enumeration Experiments

Here we include figures corresponding to additional experimental results of our initial benchmarking and on the computation biology data from [29] described in sections 5.2 and 5.4 respectively.

Figure 5: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where and . The ratio was varied.
Figure 6: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where and . The ratio was varied.
Figure 7: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where and . The ratio was varied.
Figure 8: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where and . The ratio was varied.
Figure 9: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where and . The coefficient of variation between and was varied.
Figure 10: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where and . The coefficient of variation between and was varied.
Figure 11: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where and . The coefficient of variation between and was varied.
Figure 12: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where and . The coefficient of variation between and was varied.
Figure 13: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where and was varied.
Figure 14: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where and . The expected edge density between and was varied.
Figure 15: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where and . The expected edge density within was varied.
Figure 16: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where and . The expected edge density within was varied.
Figure 17: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where , and . The expected edge density in the graph was varied except for the expected edge density within which was fixed to .
Figure 18: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where , and . The expected edge density in the graph was varied, including the expected edge density within .
Figure 19: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where , and . The expected edge density in the graph was varied except for the expected edge density within which was fixed to .
Figure 20: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where , and . The expected edge density in the graph was varied, including the expected edge density within .
Figure 21: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where , and . The expected edge density in the graph was varied except for the expected edge density within which was fixed to .
Figure 22: Runtimes of the MIB-enumerating (left) and MB-enumerating (right) algorithms on graphs where , and . The expected edge density in the graph was varied, including the expected edge density within .
OCT-MIB-ii@ OCT-MIB Enum-MIB OCT-MICA MICA
aa-10 69 191 6 178 0.008 0.023 0.057 98 0.007 0.031
aa-11 102 307 11 424 0.055 0.115 0.259 206 0.018 0.120
aa-13 129 383 12 523 0.083 0.239 0.470 269 0.028 0.166
aa-14 125 525 19 1460 0.366 0.902 1.254 605 0.090 0.485
aa-15 66 179 7 206 0.010 0.019 0.053 113 0.011 0.030
aa-16 13 15 0 15 0.000 0.000 0.000 8 0.000 0.000
aa-17 151 633 25 2252 1.023 2.132 3.457 1250 0.242 1.137
aa-18 87 381 14 660 0.100 0.173 0.389 823 0.090 0.351
aa-19 191 645 19 1262 0.449 1.569 2.385 519 0.069 0.450
aa-20 224 766 19 1607 0.705 2.431 3.809 949 0.154 1.061
aa-21 28 90 9 116 0.006 0.013 0.008 213 0.019 0.030
aa-22 167 641 16 1520 0.423 1.387 2.629 560 0.074 0.638
aa-23 139 508 18 1766 0.435 0.788 1.651 1530 0.210 1.000
aa-24 258 1108 21 3890 2.108 9.140 14.167 1334 0.237 2.477
aa-25 14 15 1 10 0.000 0.001 0.001 10 0.000 0.000
aa-26 92 284 13 583 0.084 0.186 0.309 370 0.030 0.128
aa-27 118 331 11 458 0.054 0.270 0.343 229 0.015 0.114
aa-28 167 854 27 2606 1.464 2.201 4.162 2814 0.755 3.250
aa-29 276 1058 21 3122 1.909 8.418 10.707 1924 0.382 3.344
aa-30 39 71 4 56 0.002 0.007 0.006 36 0.002 0.007
aa-31 30 51 2 37 0.002 0.002 0.002 22 0.001 0.002
aa-32 143 750 30 4167 2.286 7.694 5.290 3154 0.684 2.635
aa-33 193 493 4 578 0.046 0.204 0.993 218 0.012 0.218
aa-34 133 451 13 705 0.132 0.316 0.756 275 0.031 0.226
aa-35 82 269 10 459 0.037 0.108 0.178 215 0.019 0.081
aa-36 111 316 7 248 0.015 0.076 0.155 143 0.011 0.078
aa-37 72 170 5 135 0.005 0.018 0.054 82 0.005 0.022
aa-38 171 862 26 4270 2.428 5.223 7.586 4964 1.136 5.179
aa-39 144 692 23 2153 0.872 1.574 3.034 1177 0.237 1.009
aa-40 136 620 22 2727 1.022 2.086 2.973 1911 0.301 1.324
aa-41 296 1620 40 11705 16.519 82.439 50.205 20375 9.059 47.789
aa-42 236 1110 30 6967 5.646 45.560 21.244 8952 2.428 13.479
aa-43 63 308 18 905 0.137 0.294 0.311 875 0.116 0.302
aa-44 59 163 10 211 0.014 0.024 0.051 158 0.008 0.037
aa-45 80 386 20 1768 0.336 0.775 0.859 1716 0.244 0.796
aa-46 161 529 13 719 0.157 0.438 0.922 374 0.036 0.257
aa-47 62 229 14 572 0.057 0.082 0.138 451 0.051 0.127
aa-48 89 343 17 896 0.144 0.338 0.497 519 0.060 0.230
aa-49 26 62 5 50 0.004