1 Introduction
A typical visualization of a binary data matrix is a hierarchicallyclustered heatmap, with dendrograms in which the higherlevel clusters are recursively comprised of smaller clusters, the hierarchy being computed with an agglomeration strategy involving a distance function defined pairwise between samples (or features) to be clustered. In favorable cases a cluster may appear at some level of the hierachy which is especially characteristic of an important underlying state or measure, i.e. an outcome. For example, likelihood of favorable response to some medical treatment.
But it is often difficult to decide whether a cluster found this way, or any other way, could just as easily have occurred by random chance. This is obviously a primary concern in the unsupervised context, where outcomes which might guide cluster assessment are not present. It is also a concern in the supervised context, due to the possibility of overfitting or multiplehypothesis false discovery.
As an example, in recent work of the authorspsgpaper, a subtype of several types of cancers (including lung and uterus cancers) was identified which exhibited a molecular signature defined by about 10 genes, the PSGs. Network analysis methods implicated the gene subset, but initially confidence concerning its actual significance was low. Pearson correlation analysis was inconclusive due to the presence of outliers. The rarity of the subtype displaying the full signature added to this uncertainty. Ultimately KaplanMeier analysis did show that the PSG+ phenotype confers a poor prognosis, confirming the biological significance of this subtype, but we still lacked an objective basis for any claim of statistical significance of the signature/subtype itself. The exact test we introduce in this article turns out to provide such a basis, as described in Figure
1.2 Theory
2.1 Setup
Let be a binary matrix of shape . We call the columns features and the rows samples. Given a element subset of the feature set, let denote the restriction of to the columns , and let denote the corresponding column sums. Let denote the set of samples which have all of the features . That is,
A pair obtained as above may be called a maximal bicluster or a formal concept. We shall use the term signature, emphasizing the feature set , and call the set of samples displaying signature . In appendix A we explain how to identify, in practice, many examples of for which is nonempty and relatively large. Of course, any other signature discovery method may be used instead.
We propose to assess the significance of a given signature finding in terms of the size , under the intuition that simultaneous display of multiple features by a large set of samples indicates a nontrivial relation between the features. We call this size the incidence or intersection statistic, and denote it .
2.2 Binary matrix configurations
We are concerned with binary matrices . If has columns, it will be convenient to do some calculations in the ring of formal power series . This is because of the correspondence between:

Multiplicityfree monomials in , i.e. elements of the form for some

Subsets (i.e. )

Possible rows of
The correspondence is
Denote by the set defined by any of these 3 equivalent descriptions. (Here stands for ”features”.)
Symmetrically, if has rows, we consider the ring , and the 3 sets in correspondence:

Multiplicityfree monomials in , i.e. elements of the form for some

Subsets (i.e. )

Possible columns of
Denote this set by . (Here stands for ”samples”).
In these terms, the set of all is naturally identified with and with by regarding as an tuple of rows or, respectively, as a tuple of columns.
We will also call the matrices configurations, writing
In counting configurations satisfying certain conditions, we will appeal to the notation introduced above for corresponding elements in lieu of explicit notation for the bijection functions.
2.3 Incidence statistic, its PMF, and CDF
Define integers , for integers and with , by the generating function:
The following counting theorem is the underlying fact needed to prove a formula for the probability mass function (PMF) of the incidence statistic.
Theorem 1.

is the number of configurations in which the mutual intersection of the is empty, that is , and such that for each .

Proof.
(1) By expansion, consists of the sum of all the monomials in . So is the sum of all the monomials except . Before collecting terms with the same monomial part, the terms of are labelled by ordered tuples of elements of . That is, by certain elements of . Thus the notation we have introduced for elements of may be brought to bear. In particular, the monomial part of a given term is
It follows that the coefficient of is the number of configurations, in which no equals the whole set (due to the missing element ), such that for all . The condition that no be equal to the whole set is equivalent to the mutual intersection of being empty.
The proof above is clarified somewhat by the observation that can be expressed as a specialization of the power series in ,
namely where .
Theorem 2.
Fix integers , , , and . Consider the configurations in which:

for each .

The cardinality of the intersection of the is exactly , that is .
The number of such configurations is given by the formula:
Proof.
The indicated set of configurations is partitioned equally into sets, according to which element sample subset is the mutual intersection, denoted . By construction the reduced configuration matrix, not involving the elements of , must consist of features with sample sets of sizes and with no intersection. Thus the size of each part of the partition is . The number of configurations is therefore
The result follows from the formula for given in Theorem 1.2. ∎
The null assumption we make for our test is the one that is made implicitly in a standard permutation test, namely the uniform distribution on the subset of
defined by , given . Note that this entails that we do not assume is comprised of independent and identically distributed (iid) samples. Also, despite the fact that appears to be samples from a set of binary discrete variables, it is definitely notsamples of Bernoulli variables; for example, the variance of the number of positives is 0 for each feature, rather than
for some positivity rate .Under this assumption the incidence statistic
is an integervalued random variable. The following corollary provides a formula for its PMF.
Corollary 3.
Consider samples observed with binary features of respective frequencies . The probability of observing exactly samples positive for all features is:
By summing over several values of in Corollary 3
, one can compute a value of the cumulative distribution function (CDF) of
. This is (one minus) the value for the proposed “exact test for coincidence”.The next theorem provides an alternative, more closedform calculation of the CDF, with significantly decreased computational complexity compared with direct summation of PMF values, namely rather than .
The proof of this theorem depends on two basic lemmas.
Lemma 4.
Proof.
∎
Lemma 5.
Proof.
By induction. Base case :
Now assume the formula holds (for all ) for a fixed .
∎
Theorem 6.
where
Proof.
First note that if , so the sum stops at . We apply the formula for :
∎
2.4 CDF generating function and incomplete beta function
The generating function for the values of CDF, that is with and fixed and the variable, is nearly expressible as the regularized incomplete beta function
with certain arguments, establishing a strong analogy to the binomial distribution. The number of configurations with up to
incidence statistic is given by the generating function:The last equation above is a ”formal” application of the expression for the CDF of a binomial distribution with trials, that is,
except that instead of the usual real parameter of such a distribution, must be permitted to be equal to the power series which tabulates information across all of the different values of the parameters .
The total number of configurations is given by the generating function , so the generating function for CDF is the ratio:
Here the double division symbol means the coefficientwise ratio of the multidimensional series represented by the respective generating functions. Thus, despite the analogy with the binomial distribution, the generating function for CDF is not literally equal to .
3 Software implementation
3.1 Python package
A Python package coincidencetest is released on PyPI. It contains a selfcontained module, with no dependencies beyond the standard library, that calculates the value for the test.
3.2 Commandline tool
A commandline tool is distributed with coincidencetest that bundles together a basic, lightweight signature discovery algorithm as well as test evaluation on an input binary matrix file. This may be run in a noninteractive context on a remote server or as part of a pipeline.
3.3 Web application
A simple GUI performs signature discovery and evaluation in realtime after user upload of a binary matrix file. A screenshot is shown in Figure 4.
3.4 Testing
The Python package contains a test suite which verifies the value formulas (i.e. the PMF and CDF) against bruteforce enumerations for several small values of the parameters, furnishing rigorous computational evidence for the main theorems in addition to the proofs.
4 Related work
The test turns out to specialize to the Fisher exact testfisher1922 in the case of 2 features, . The incidence statistic and the frequencies of each feature provide the same information as a
integer contingency table, and the formula for the probability value agrees with ours in this case.
The Fisher exact test has been generalized to larger,
contingency tableshighercontingency. Whether such tables are regarded as pertaining to 2 categorical variables with
andcategories respectively, or as pertaining to pairs of binary variables, one from a list of
variables and one from a list of variables, contingency table methods are secondorder in that they only involve interactions between pairs of variables. Much work on exact inference generally has focused on contingency tables, with multidimensional generalizations appearing in the literature up to order 3 (e.g. tablesagresti1992survey).By contrast our test is inherently higherorder, depending, albeit in a simple way, on the mutual interaction of all
variables. As for other higherorder methods, an investigation of the joint distribution of Bernoulli variables under certain constraints has been publishedkolev2006joint, and this may yield a test with comparable domain of applicability as our test. However, as indicated in section
2.3, the Bernoulli context involves a different null assumption.In Goodgood1976application a very similar generating function to our is identified as a tabulation of the number of contingency tables (not binary matrix configurations) with fixed column and row sums. The function is (c.f. page 1166 item 5.6 and page 1182, “”). This connection may help to explain the appearance of the beta function in the generating function for the CDF of the incidence statistic.
Appendix A Formal Concept Analysis bicluster identification
Formal Concept Analysis (FCA)ganter2005introtext studies a binary data matrix, called a formal context, in terms of a lattice of certain patterns found in the matrix. The patterns are known as (formal) concepts. Such a concept consists of a bicluster , defined as a set of features and a set of samples for which the submatrix along consists of all 1s, which is maximal in two senses: (1) cannot be enlarged without reducing , and (2) cannot be enlarged without reducing .
The containment relations of the sets (respectively ) confer a partial ordering or lattice structure on the set of all concepts, which turns out to be complete. The maximality condition amounts to a closure condition on the sets (respectively ), and the whole apparatus can be formulated as a Galois correspondence between two closure systems on the full feature set and full sample set.
A straightforward recursive algorithm can be used to enumerate all concepts in a given contextganter2016conceptual. This algorithm applies to any finite closure system, and it works by computing the closure of the union of any pair of previouslyfound closed sets.
In practice, however, data sets of intermediate size or larger furnish too many concepts for a complete enumeration to provide a useful direction of attention to important subsamples or signatures. The present work is partly motivated by this problem, as it can be used to filter signatures by significance.