Boolean matrix multiplication (BMM) is one of the core problems in discrete algorithms, with numerous applications including triangle detection in graphs , context-free grammar parsing , and transitive closure etc. [6, 7, 10]. Boolean matrix multiplication can be naturally interpreted as a path problem in graphs. Given a layered graph with three layers and edges between layers and and between and , compute the bipartite graph between and in which and are joined if and only if they have a common neighbor. If we identify the bipartite graph between and with its boolean adjacency matrix and the graph between and with its boolean adjacency matrix then the desired graph between and is just the boolean product .
Boolean matrix multiplication is the combinatorial counterpart of integer matrix multiplication. Both involve the computation of output values, each of which can be computed in a straightforward way in time yielding a algorithm for both problems. One of the celebrated classical results in algorithms is Strassen’s discovery  that by ordinary matrix multiplication has truly subcubic algorithms, i.e. algorithms that run in time for some , which compute the entries by computing and combining carefully chosen (and highly non-obvious) polynomial functions of the matrix entries. Subsequent improvements [5, 15, 8] have reduced the value of .
One of the fascinating aspects of BMM is that, despite its intrinsic combinatorial nature, the asymptotically fastest algorithm known is obtained by treating the boolean entries as integers and applying fast integer matrix multiplication. The intermediate calculations done for this algorithm seemingly have little to do with the combinatorial structure of the underlying bipartite graphs. There has been considerable interest in developing ”combinatorial” algorithms for BMM, that is algorithms where the intermediate computations all have a natural combinatorial interpretation in terms of the original problem. Such interest is motivated both by intellectual curiosity, and by the fact that the fast integer multiplication algorithms are impractical because the constant factor hidden in is so large.
The straightforward algorithm has a straightforward combinatorial interpretation: for each pair of vertices check each vertex of to see whether it is adjacent to both and . The so-called Four Russians Algorithm by Arlazarov, Dinic, Kronrod, Faradzhev  solves BMM in operations, and was the first combinatorial algorithm for BMM with complexity . Overt the past 10 years, there have been a sequence of combinatorial algorithms [3, 4, 17] developed for BMM, all having complexities of the form for increasingly large constants . The best and most recent of these, due to Yu  has complexity (where the notation suppresses factors. (It should be noted that the algorithm presented in each of these recent papers is for the problem of determining whether a given graph has a triangle; it was shown in  that a (combinatorial) algorithm for triangle finding with complexity can be used as a subroutine to give a (combinatorial) algorithm for BMM with a similar complexity.)
While each of these combinatorial algorithms uses interesting and non-trivial ideas, each one saves only a polylogarithmic factor as compared to the straightforward algorithm, in contrast with the algebraic algorithms which save a power of . The motivating question for the investigations in this paper is: Is there a truly subcubic combinatorial algorithm for BMM? We suspect that the answer is no.
In order to consider this question precisely, one needs to first make precise the notion of a combinatorial algorithm. This itself is challenging. To formalize the notion of a combinatorial algorithm requires some computation model which specifies what the algorithm states are, what operations can be performed, and what the cost of those operations is. If one examines each of these algorithms one sees that the common feature is that the intermediate information stored by the algorithm is of one of the following three types (1): for some pair of subsets with and , the submatrix (bipartite subgraph) induced by on has some specified monotone property (such as, every vertex in has a neighbor in ), (2) for some pair of subsets with and , the bipartite subgraph induced by on has some specific monotone property, or (3) for some pair of subsets with and , the bipartite subgraph induced by on has some specific monotone property.
If one accepts the above characterization of the possible information stored by the algorithm, we are still left with the problem of specifying the elementary steps that the algorithm is permitted to make to generate new pieces of information, and what the computational cost is. The goal in doing this is that the allowed operations and cost function should be such that they accurately reflect the cost of operations in an algorithm. In particular, we would like that our model is powerful enough to be able to simulate all of the known combinatorial algorithms with running time no larger than their actual running time, but not so powerful that it allows for fast (e.g. quadratic time) algorithms that are not implementable on a real computer. We still don’t have a satisfactory model with these properties.
This paper takes a step in this direction. We develop a model which captures some of what a combinatorial algorithm might do. In particular our model is capable of efficiently simulating the Four Russians algorithm, but is sufficiently more general. We then prove a superquadratic lower bound in the model: Any algorithm for BMM in this model requires time at least .
Unfortunately, our model is not strong enough to simulate the more recent combinatorial approaches. Our hope is that our approach provides a starting point for a more comprehensive analysis of the limitation of combinatorial algorithms.
One of the key features of our lower bound is the identification of a family of ”hard instances” for BMM. In particular, we use tripartite graphs on roughly vertices that have almost quadratic number a pairs of vertices from the first and the last layers connected by a single (unique) path via the middle layer. These graphs are derived from -graphs of Rusza and Szemeredi , which are dense bipartite graphs on vertices that can be decomposed into linear number of disjoint induced matchings. More recently, Alon, Moitra Sudakov  provides strengthening of Rusza and Szemeredi’s construction although they lose in the parameters that are most relevant for us.
1.1 Combinatorial models
The first combinatorial model for BMM was given by Angluin . For the product of , the model allows to take bit-wise OR (union) of rows of the matrix to compute the individual rows of the resulting matrix . The cost in this model is the number of unions taken. By a counting argument, Angluin  shows that there are matrices and such that the number of unions taken must be . This matches the number of unions taken by the Four Russians Algorithm, and in that sense the Four Russians Algorithm is optimal.
If the cost of taking each row union were counted as , the total cost would become . The Four Russians Algorithm improves this time to by leveraging “word-level parallelism” to compute each row union in time .
A possible approach to speed-up the Four Russians Algorithm would be to lower the cost of each union operation even further. The above analysis ignores the fact that we might be taking the union of rows with identical content multiple times. For example if and are random matrices (as in the lower bound of Angluin) then each row of the resulting product is an all-one row. Such rows will appear after taking an union of merely rows from . An entirely naive algorithm would be taking unions of an all-one row with possible rows of after only few unions. Hence, there would be only different unions to take for the total cost of . We could quickly detect repetitions of unions by maintaining a short fingerprint for each row evaluated.
Our first model takes repetitions into account. Similarly to Angluin, we focus on the number of unions taken by the algorithm but we charge for each union differently. The natural cost of a union of rows with values counts the cost as the minimum of the number of ones in and . This is the cost we count as one could use sparse set representation for and
. In addition to that if unions of the same rows (vectors) are taken multiple times we charge all of them only ones, resp. we charge the first one the proper cost and all the additional unions are for a unit cost. As we have argued, on random matricesand , BMM will cost in this model. Our first lower bound shows that even in this model, there are matrices for which the cost of BMM is almost cubic.
Theorem 1 (Informal statement)
In the row-union model with removed repetitions the cost of Boolean matrix multiplication is .
The next natural operation one might allow to the algorithm is to divide rows into pieces. This is indeed what the Four Russians Algorithm and many other algorithms do. In the Four Russians Algorithm, this corresponds to the “word-level parallelism”. Hence we might allow the algorithm to break rows into pieces, take unions of the pieces, and concatenate the pieces back. In our more general model we set the cost of the partition and concatenation to be a unit cost, and we only allow to split a piece into continuous parts. More complex partitions can be simulated by performing many two-sided partitions and paying proportionally to the complexity of the partition. The cost of a union operation is again proportional to the smaller number of ones in the pieces, while repeated unions are charged for a unit cost. In this model one can implement the Four Russians Algorithm for the cost , matching its usual cost. In the model without partitions the cost of the Four Russians Algorithm is .
In this model we are able to prove super-quadratic lower bound when we restrict that all partitions happen first, then unions take place, and then concatenations.
Theorem 2 (Informal statement)
In the row-union model with partitioning and removed repetitions the cost of Boolean matrix multiplication is .
Perhaps, the characteristic property of “combinatorial” algorithms is that from the run of such an algorithm one can extract a combinatorial proof (witness) for the resulting product. This is how we interpret our models. For given and we construct a witness circuit that mimics the work of the algorithm. The circuit operates on rows of to derive the rows of the resulting matrix . The values flowing through the circuit are bit-vectors representing the values of rows together with information on which union of which submatrix of the row represents. The gates can partition the vectors in pieces, concatenate them and take their union. For our lower bound we require that unions take place only after all partitions and before all concatenations. This seems to be a reasonable restriction since we do not have to emulate the run of an algorithm step by step but rather see what it eventually produces. Also allowing to mix partitions, unions and concatenations in arbitrary order could perhaps lead to only quadratic cost on all matrices. We are not able to argue otherwise.
The proper modelling of combinatorial algorithms is a significant issue here: one wants a model that is strong enough to capture known algorithms (and other conceivable algorithms) but not so strong that it admits unrealistic quadratic algorithms. We do not know how to do this yet, and the present paper is intended as a first step.
1.2 Our techniques
Central to our lower bounds are graphs derived from -graphs of Rusza and Szemeredi . Our graphs are tripartite with vertices split into parts , where and . The key property of these graphs is that there are almost quadratically many pairs that are connected via a single (unique) vertex from . In terms of the corresponding matrices and this means that in order to evaluate a particular row of their product we must take a union of very specific rows in . The number of rows in the union must be almost linear. Since is dense this might lead to an almost cubic cost for the whole algorithm provided different vertices in are connected to different vertices in so we take different unions.
This is not apriori the case for the -derived graph but we can easily achieve it by removing edges between and
at random, each independently with probability 1/2. The neighborhoods of different vertices inwill be very different then. We call such a graph diverse (see a later section for a precise definition). It turns out that for our lower bound we need a slightly stronger property, not only that we take unions of different rows of but also that the results of these unions are different. We call this stronger property unhelpfulness.
Using unhelpfulness of graphs we are able to derive the almost cubic lower bound on the simpler model. Unhelpfulness is a much more subtle property than diversity, and we crucially depend on the properties of our graphs to derive it.
Next we tackle the issue of lower bounds for the partition model. This turns out to be a substantially harder problem, and most of the proof is in the appendix. One needs unhelpfulness on different pieces of rows (restrictions to columns of ), that is making sure that the result of union of some pieces does not appear (too often) as a result of union of another pieces. This is impossible to achieve in full generality. Roughly speaking what we can achieve is that different parts of any witness circuit cannot produce the same results of unions.
The key lemma that formalizes it (Lemma 11) shows that the results of unions obtained for a particular interval of columns in can be used at most times on average in the rest of the circuit. This is a property of the graph which we refer to as that the graph admitting only limited reuse. This key lemma is technically complicated and challenging to prove (albeit elementary). Putting all the pieces together turns out to be also quite technical.
2 Notation and preliminaries
For any integer , . For a vertex in a graph and a subset of vertices of , are the neighbors of in , and . (To emphasize which graph we mean we may write .) A subinterval of is any set , for some . By we understand and by we mean . For a subinterval of and a vector , denotes the set . For a vector , . For a binary vector , denotes the number of ones in .
We will denote matrices by calligraphic letters . All matrices we consider are binary matrices. For integers , is the -th row of and is the -th entry of . Let be an matrix and be an matrix, for some integers . We associate matrices with a tripartite graph . The vertices of is the set where , and . The edges of are for each such that , and for each such that . In this paper we only consider graphs of this form. Sometimes we may abuse notation and index matrix by vertices of and , and similarly by vertices from and . For a set of indices , is the bit-wise Or of rows of given by .
Circuit. A circuit is a directed acyclic graph where each node (gate) has in-degree either zero, one or two. The degree of a gate is its in-degree, the fan-out is its out-degree. Degree one gates are called unary and degree two gates are binary. Degree zero gates are called input gates. For each binary gate , and are its two predecessor gates. A computation of a circuit proceeds by passing values along edges, where each gate processes its incoming values to decide on the value passed along the outgoing edges. The input gates have some predetermined values. The output of the circuit is the output value of some designated vertex or vertices.
Witness. Let and be matrices of dimension and , resp., with its associated graph . A witness for the matrix product is a circuit consisting of input gates, unary partition gates, binary union gates and binary concatenation gates. The values passed along the edges are triples , where identifies a set of rows of the matrix , the subinterval identifies a set of columns of , and is the restriction of to the columns of . Each input gate outputs for some assigned . A partition gate with an assigned subinterval on input outputs undefined if and outputs otherwise, where is such that for each , . A union gate on inputs and from its children outputs undefined if , and outputs otherwise. A concatenation gate, on inputs and where , is undefined if or or and outputs otherwise, where is obtained by concatenating with the last bits of .
It is straightforward that whether a gate is undefined depends solely on the structure of the circuit but not on the actual values of or . We will say that the circuit is structured if union gates do not send values into partition gates, and concatenation gates do not send values into partition and union gates. Such a circuit first breaks rows of into parts, computes union of compatible parts and then assembles resulting rows using concatenation.
We say that a witness is a correct witness for if is structured, no gate has undefined output, and for each , there is a gate in with output for .
Cost. The cost of the witness is defined as follows. For each union gate with inputs and and an output we define its row-class to be . If is a set of union gates from , is the row-class of some gate in . The cost of a row-class is . The cost of is . The cost of witness is the number of gates in plus the cost of the set of all union gates in .
We can make the following simple observation.
If is a correct witness for , then for each , there exists a collection of subintervals such that and for each , there is a union gate in which outputs .
Union and resultant circuit. One can look at the witness circuit from two separate angles which are captured in the next definitions. A union circuit over a universe is a circuit with gates of degree zero and two where each gate is associated with a subset of so that for each binary gate , . For integer , a resultant circuit is a circuit with gates of degree zero and two where each gate is associated with a vector from so that for each binary gate , , where is a coordinate-wise Or.
For a vertex and a subinterval of , a union witness for is a union circuit over with a single output gate where and for each input gate of , for some connected to .
Induced union witness. Let be a correct witness for . Pick and a subinterval . Let there be a union gate in with output . An induced union witness for is a union circuit over whose underlying graph consists of copies of the union gates that are predecessors of , and a new input gate for each input or partition gate that feeds into one of the union gates. They are connected in the same way as in . For each gate in the induced witness we let whenever its corresponding gate in outputs for some and . From the correctness of it follows that each such and the resulting circuit is a correct union witness for .
We will use special type of graphs for constructing matrices which are hard for our combinatorial model of Boolean matrix multiplication. For integers , an -graph is a graph whose edges can be partitioned into pairwise disjoint induced matchings of size . Somewhat counter-intuitively as shown by Rusza and Szemeredi  there are dense graphs on vertices that are -graphs for and close to .
Theorem 4 (Rusza and Szemerédi )
For all large enough integers , for there is a -graph .
A more recent work of Alon, Moitra Sudakov  provides a construction of a -graphs on vertices with and . The graphs of Rusza and Szemerédi are sufficient for us.
Let be the graph from the previous theorem and let be the disjoint induced matchings of size . We define a tripartite graph as follows: has vertices , and . For each such that there are edges and in . The following immediate lemma states one of the key properties of .
If in then there is a unique path between and in .
For the rest of the paper, we will fix the graphs . Additionally, we will also use a graph which is obtained from by removing each edge between and independently at random with probability . (Technically,
is a random variable.) Whenis clear from the context we will drop the subscript .
Fix some large enough . Let be the adjaceny matrix between and in and be the adjacency matrix between and in . The adjacency matrix between and in will be denoted by . ( is also a random variable.) The adjacency matrix between and in is .
We say that is unique for if there is exactly one such that and are edges in . The previous lemma implies that on average has many unique vertices in , namely . For , let denote the set of vertices from that are unique for in . E.g., are all vertices unique for . Let denote the set of vertices from that are connected to and some vertex in . Notice, . Since and depend on edges in graph , to emphasise which graph we have in mind we may subscript them by : and .
For the randomized graph we will denote by the set of vertices from that are unique for in and that are connected to via also in . (Thus, vertices from that are not unique for in but became unique for in are not included in .) Let denotes
2.4 Diverse and unhelpful graphs
In this section we define two properties of that capture the notion that one needs to compute many different unions of rows of to calculate . The simpler condition stipulates that neighborhoods of different vertices from are quite different. The second condition stipulates that not only the neighborhoods of vertices from are different but also the unions of rows from that correspond to these neighborhoods are different.
Let and and be as in the previous section. For integers , we say is -diverse if for every set of size at least , no vertices in are all connected to all the vertices of .
Let be integers. The probability that is -diverse is at least .
Proof. Let and . is not -diverse if for some set of size , and some -tuple of distinct vertices , each vertex is connected to all vertices from in . The probability that all vertices of a given -tuple are connected to all vertices in in is at most . (The probability is zero if some is not connected to some vertex from in .) Hence, the probability that there is some set of size , and some -tuple of distinct vertices where each vertex is connected to all vertices from in is bounded by:
where the second inequality follows from .
For , and a subinterval , we say that is helpful for on if there exists a set such that and . In other words, the condition means that and agree on coordinates in that correspond to vertices unique for in . This is a necessary precondition for which allows one to focus only on the hard-core formed by the unique vertices. In particular, if for some in , , then satisfies . (See the proof below.)
For integers , we say is -unhelpful on if for every set of size at least , there are at most vertices in for which is helpful on .
Let be integers. Let and be a subinterval of . The probability that is -unhelpful on is at least .
Proof. Take any set of size and arbitrary vertices for . Consider and some . Since edges between and are always the same in , is always the same in . If is helpful on for then there exists such that and . It turns out that given , the possible is uniquely determined by . Whenever has one in a position that corresponds to a unique vertex of in , must have one there as well so the corresponding must be in . Conversely, whenever has zero in a position that corresponds to a unique vertex of in , must have zero there as well so the corresponding is not in . The probability that is .
Hence, the probability over choice of that is helpful for on is at most . For different ’s this probability is independent as it only depends on edges between and . Thus the probability that is helpful for is at most .
There are at most choices for the set of size and . Hence, the probability that is not -unhelpful on is at most:
where the third inequality follows from .
3 Union circuits
Our goal is to prove the following theorem:
There is a constant such that for all large enough there are matrices and such that any correct witness for consisting of only union gates has cost at least .
Here by consisting of only union gates we mean consisting of union gates and input gates. Our almost cubic lower bound on the cost of union witnesses is an easy corollary to the following lemma.
Let be a large enough integer and be the graph from Section 2.3, and be its corresponding matrices. Let be a correct witness for consisting of only union gates. Let have at least ones. Let each row of have at least ones. If is -unhelpful on for some integers then any correct witness for consisting of only union gates has cost at least .
Proof. Let be a correct witness for consisting of only union gates. For each gate of with output , for some , define . Consider . Let be a gate of such that (which equals ). Take a maximal set of gates from , descendants of , such that for each , and either or , and furthermore for , .
Notice, if then . This is because for any sets , . (Say, , then there is 1 in which corresponds to a vertex unique for . Thus, whereas .)
We claim that since is maximal, . We prove the claim. Assume otherwise there is nothing to prove. Take any and consider a path of gates in such that . Since , and , there is some with and . By maximality of there is some gate such that . Hence, is in or of size . Thus
Hence, and the claim follows.
For a given , gates in have different row-classes. Since is -unhelpful on , the same row-class can appear in only for at most different ’s. (Say, there were vertices in and gates of the same row-class. For each , and . The smallest would be helpful for contradicting the unhelpfulness of .) Since
witness contains gates of at least different row-classes. Since, each contains at least ones, the total cost of is as claimed.
Proof of Theorem 8. Let be the graph from Section 2.3, and be its corresponding matrices. Let . By Lemma 7, the graph is -unhelpful on with probability at least , and by Chernoff bound, contains at least ones with probability at least . So with probability at least , has ones while is -unhelpful on . By the previous lemma, any witness for is of cost . For large enough , this is at least , and the theorem follows.
4 Circuits with partitions
In this section, our goal is to prove the lower bound on the cost of a witness for matrix product when the witness is allowed to partition the columns of . Namely:
For all large enough there are matrices and such that any correct witness for has cost at least .
We provide a brief overview of the proof first. The proof builds on ideas seen already in the previous part but also requires several additional ideas. Consider a correct witness for . We partition its union gates based on their corresponding subinterval of . If there are many vertices in that use many different subintervals (roughly in total) the lower bound follows by counting the total number of gates in the circuit using diversity of (Lemma 14). If there are many vertices in which use only few subintervals (less than roughly each) then these subintervals must be large on average (about ) and contain lots of vertices from unique for their respective vertices from .
In this case we divide the circuit (its union gates) based on their subinterval, and we calculate the contribution of each part separately. To do that we have to limit the amount of reuse of a given row-class within each part, and also among distinct parts. Within each part we limit the amount of reuse using a similar technique to Lemma 9 based on unhelpfulness of the graph (Lemma 13). However, for distinct parts we need a different tool which we call limited reuse. Limited reuse is somewhat different than unhelpfulness in the type of guarantee we get. It is a weaker guarantee as we are not able to limit the reuse of a row-class for each single gate but only the total reuse of row-classes of all the gates in a particular part. On average the reuse is again roughly .
However, the number of gates in a particular part of the circuit might be considerably larger than the number of gates we are able to charge for work in that part. In general, we are only able to charge gates that already made some non-trivial progress in the computation (as otherwise the gates could be reused heavily.) We overcome this obstacle by balancing the size of the part against the number of chargeable gates in that part.
If the total number of gates in the part is at least -times larger than the total number of chargeable gates, we charge the part for its size. Otherwise we charge it for work. Each chargeable gates contributes by about units of work or more, however this can be reused almost -times elsewhere. Either way, approximately of work must be done in total. Now we present the actual proof.
In order to prove the theorem we need few more definitions. Let and and be as in the Section 2.3. All witness circuits in this section are with respect to (i.e., ). Let and be some constants that we will fix later.
The following definition aims to separate contribution from different rows within a particular subcircuit. A witness circuit may benefit from taking a union of the same row of multiple times to obtain a particular union. This could help various gates to attain the same row-class. In order to analyze the cost of the witness we want to effectively prune the circuit so that contribution from each row of is counted at most once. The following definition captures this prunning.
Let be a union circuit over with a single vertex of out-degree zero (output gate). The trimming of is a map that associates to each gate of a subset such that and for each non-input gate , . For each circuit , we fix a canonical trimming that is obtained from by the following process: For each , find the left-most path from to an input gate such that , and remove from of every gate that is not on this path.
Given the trimming of a union circuit we will focus our attention only on gates that contribute substantially to the cost of the computation. We call such gates chargeable in the next definition. For a vertex and a subinterval , let be a union witness for with its trimming. We say a gate in is -chargeable if and and are both different from . -Chargeable descendants of are -chargeable gates in where . Observe that the number of -chargeable descendants of a gate is at most .
From a correct witness for , we extract some induced union circuit for and some resultant circuit . We say that a gate from is compatible with a gate from if .
We want to argue that chargeable gates corresponding to gates of a given correct witness have many different row-classes. Hence, we want to bound the number of gates whose result is compatible with each other. This is akin to the notion of helpfulness. In the case of helpfulness we were able to limit the repetition of the same row-class for individual gates operating on the same subinterval of columns of . In addition to that we need to limit the occurence of the same row-class for gates that operate on distinct subintervals. As opposed to the simpler case of helpfulness, we will need to focus on the global count of row-classes that can be reused elsewhere from gates operating on the same subinterval. The next definition encapsulates the desired property of .
For and subintervals of , we say that and are independent if either or . A resultant circuit over is consistent with , if there exists a subinterval of size , such that for each input gate of , for some . We say that admits only limited reuse if for any resultant circuit of size at most which is consistent with and any correct witness circuit for