An exact test for significance of clusters in binary data

09/28/2021
by   James Mathews, et al.
0

Unsupervised clustering of feature matrix data is an indispensible technique for exploratory data analysis and quality control of experimental data. However, clusters are difficult to assess for statistical significance in an objective way. We prove a formula for the distribution of the size of the set of samples, out of a population of fixed size, which display a given signature, conditional on the marginals (frequencies) of each individual feature comprising the signature. The resulting "exact test for coincidence" is widely applicable to objective assessment of clusters in any binary data. We also present a software package implementing the test, a suite of computational verifications of the main theorems, and a supplemental tool for cluster discovery using Formal Concept Analysis.

READ FULL TEXT VIEW PDF
01/09/2014

Efficient unimodality test in clustering by signature testing

This paper provides a new unimodality test with application in hierarchi...
10/05/2016

Non-Parametric Cluster Significance Testing with Reference to a Unimodal Null Distribution

Cluster analysis is an unsupervised learning strategy that can be employ...
05/02/2019

Selection of the Number of Clusters in Functional Data Analysis

Identifying the number K of clusters in a dataset is one of the most dif...
10/07/2019

Gaussian Mixture Clustering Using Relative Tests of Fit

We consider clustering based on significance tests for Gaussian Mixture ...
01/31/2010

Classifying the typefaces of the Gutenberg 42-line bible

We have measured the dissimilarities among several printed characters of...
07/01/2019

Permutation inference with a finite number of heterogeneous clusters

I introduce a simple permutation procedure to test conventional (non-sha...
08/28/2020

The UU-test for Statistical Modeling of Unimodal Data

Deciding on the unimodality of a dataset is an important problem in data...

1 Introduction

A typical visualization of a binary data matrix is a hierarchically-clustered heatmap, with dendrograms in which the higher-level clusters are recursively comprised of smaller clusters, the hierarchy being computed with an agglomeration strategy involving a distance function defined pairwise between samples (or features) to be clustered. In favorable cases a cluster may appear at some level of the hierachy which is especially characteristic of an important underlying state or measure, i.e. an outcome. For example, likelihood of favorable response to some medical treatment.

But it is often difficult to decide whether a cluster found this way, or any other way, could just as easily have occurred by random chance. This is obviously a primary concern in the unsupervised context, where outcomes which might guide cluster assessment are not present. It is also a concern in the supervised context, due to the possibility of overfitting or multiple-hypothesis false discovery.

As an example, in recent work of the authorspsgpaper, a subtype of several types of cancers (including lung and uterus cancers) was identified which exhibited a molecular signature defined by about 10 genes, the PSGs. Network analysis methods implicated the gene subset, but initially confidence concerning its actual significance was low. Pearson correlation analysis was inconclusive due to the presence of outliers. The rarity of the subtype displaying the full signature added to this uncertainty. Ultimately Kaplan-Meier analysis did show that the PSG+ phenotype confers a poor prognosis, confirming the biological significance of this subtype, but we still lacked an objective basis for any claim of statistical significance of the signature/subtype itself. The exact test we introduce in this article turns out to provide such a basis, as described in Figure

1.

2 Theory

2.1 Setup

Let be a binary matrix of shape . We call the columns features and the rows samples. Given a -element subset of the feature set, let denote the restriction of to the columns , and let denote the corresponding column sums. Let denote the set of samples which have all of the features . That is,

A pair obtained as above may be called a maximal bicluster or a formal concept. We shall use the term signature, emphasizing the feature set , and call the set of samples displaying signature . In appendix A we explain how to identify, in practice, many examples of for which is non-empty and relatively large. Of course, any other signature discovery method may be used instead.

We propose to assess the significance of a given signature finding in terms of the size , under the intuition that simultaneous display of multiple features by a large set of samples indicates a non-trivial relation between the features. We call this size the incidence or intersection statistic, and denote it .

2.2 Binary matrix configurations

We are concerned with binary matrices . If has columns, it will be convenient to do some calculations in the ring of formal power series . This is because of the correspondence between:

  1. Multiplicity-free monomials in , i.e. elements of the form for some

  2. Subsets (i.e. )

  3. Possible rows of

The correspondence is

Denote by the set defined by any of these 3 equivalent descriptions. (Here stands for ”features”.)

Symmetrically, if has rows, we consider the ring , and the 3 sets in correspondence:

  1. Multiplicity-free monomials in , i.e. elements of the form for some

  2. Subsets (i.e. )

  3. Possible columns of

Denote this set by . (Here stands for ”samples”).

In these terms, the set of all is naturally identified with and with by regarding as an -tuple of rows or, respectively, as a -tuple of columns.

We will also call the matrices configurations, writing

In counting configurations satisfying certain conditions, we will appeal to the notation introduced above for corresponding elements in lieu of explicit notation for the bijection functions.

2.3 Incidence statistic, its PMF, and CDF

Define integers , for integers and with , by the generating function:

The following counting theorem is the underlying fact needed to prove a formula for the probability mass function (PMF) of the incidence statistic.

Theorem 1.
  1. is the number of configurations in which the mutual intersection of the is empty, that is , and such that for each .

Proof.

(1) By expansion, consists of the sum of all the monomials in . So is the sum of all the monomials except . Before collecting terms with the same monomial part, the terms of are labelled by ordered -tuples of elements of . That is, by certain elements of . Thus the notation we have introduced for elements of may be brought to bear. In particular, the monomial part of a given term is

It follows that the coefficient of is the number of configurations, in which no equals the whole set (due to the missing element ), such that for all . The condition that no be equal to the whole set is equivalent to the mutual intersection of being empty.

(2) We apply the binomial theorem times:

The proof above is clarified somewhat by the observation that can be expressed as a specialization of the power series in ,

namely where .

Theorem 2.

Fix integers , , , and . Consider the configurations in which:

  1. for each .

  2. The cardinality of the intersection of the is exactly , that is .

The number of such configurations is given by the formula:

Proof.

The indicated set of configurations is partitioned equally into sets, according to which -element sample subset is the mutual intersection, denoted . By construction the reduced configuration matrix, not involving the elements of , must consist of features with sample sets of sizes and with no intersection. Thus the size of each part of the partition is . The number of configurations is therefore

The result follows from the formula for given in Theorem 1.2. ∎

The null assumption we make for our test is the one that is made implicitly in a standard permutation test, namely the uniform distribution on the subset of

defined by , given . Note that this entails that we do not assume is comprised of independent and identically distributed (iid) samples. Also, despite the fact that appears to be samples from a set of binary discrete variables, it is definitely not

samples of Bernoulli variables; for example, the variance of the number of positives is 0 for each feature, rather than

for some positivity rate .

Under this assumption the incidence statistic

is an integer-valued random variable. The following corollary provides a formula for its PMF.

Corollary 3.

Consider samples observed with binary features of respective frequencies . The probability of observing exactly samples positive for all features is:

By summing over several values of in Corollary 3

, one can compute a value of the cumulative distribution function (CDF) of

. This is (one minus) the -value for the proposed “exact test for coincidence”.

The next theorem provides an alternative, more closed-form calculation of the CDF, with significantly decreased computational complexity compared with direct summation of PMF values, namely rather than .

The proof of this theorem depends on two basic lemmas.

Lemma 4.
Proof.

Lemma 5.
Proof.

By induction. Base case :

Now assume the formula holds (for all ) for a fixed .

Theorem 6.

where

Proof.

First note that if , so the sum stops at . We apply the formula for :

Plots of the PMF/CDFs for some values of the parameters are shown in Figure 2. The behavior of the test in an example case is illustrated in Figure 3.

2.4 CDF generating function and incomplete beta function

The generating function for the values of CDF, that is with and fixed and the variable, is nearly expressible as the regularized incomplete beta function

with certain arguments, establishing a strong analogy to the binomial distribution. The number of configurations with up to

incidence statistic is given by the generating function:

The last equation above is a ”formal” application of the expression for the CDF of a binomial distribution with trials, that is,

except that instead of the usual real parameter of such a distribution, must be permitted to be equal to the power series which tabulates information across all of the different values of the parameters .

The total number of configurations is given by the generating function , so the generating function for CDF is the ratio:

Here the double division symbol means the coefficient-wise ratio of the multi-dimensional series represented by the respective generating functions. Thus, despite the analogy with the binomial distribution, the generating function for CDF is not literally equal to .

3 Software implementation

3.1 Python package

A Python package coincidencetest is released on PyPI. It contains a self-contained module, with no dependencies beyond the standard library, that calculates the -value for the test.

3.2 Command-line tool

A command-line tool is distributed with coincidencetest that bundles together a basic, lightweight signature discovery algorithm as well as test evaluation on an input binary matrix file. This may be run in a non-interactive context on a remote server or as part of a pipeline.

3.3 Web application

A simple GUI performs signature discovery and evaluation in real-time after user upload of a binary matrix file. A screenshot is shown in Figure 4.

3.4 Testing

The Python package contains a test suite which verifies the -value formulas (i.e. the PMF and CDF) against brute-force enumerations for several small values of the parameters, furnishing rigorous computational evidence for the main theorems in addition to the proofs.

4 Related work

The test turns out to specialize to the Fisher exact testfisher1922 in the case of 2 features, . The incidence statistic and the frequencies of each feature provide the same information as a

integer contingency table, and the formula for the probability value agrees with ours in this case.

The Fisher exact test has been generalized to larger,

contingency tableshighercontingency. Whether such tables are regarded as pertaining to 2 categorical variables with

and

categories respectively, or as pertaining to pairs of binary variables, one from a list of

variables and one from a list of variables, contingency table methods are second-order in that they only involve interactions between pairs of variables. Much work on exact inference generally has focused on contingency tables, with multi-dimensional generalizations appearing in the literature up to order 3 (e.g. tablesagresti1992survey).

By contrast our test is inherently higher-order, depending, albeit in a simple way, on the mutual interaction of all

variables. As for other higher-order methods, an investigation of the joint distribution of Bernoulli variables under certain constraints has been publishedkolev2006joint, and this may yield a test with comparable domain of applicability as our test. However, as indicated in section

2.3, the Bernoulli context involves a different null assumption.

In Goodgood1976application a very similar generating function to our is identified as a tabulation of the number of contingency tables (not binary matrix configurations) with fixed column and row sums. The function is (c.f. page 1166 item 5.6 and page 1182, “”). This connection may help to explain the appearance of the beta function in the generating function for the CDF of the incidence statistic.

Appendix A Formal Concept Analysis bicluster identification

Formal Concept Analysis (FCA)ganter2005introtext studies a binary data matrix, called a formal context, in terms of a lattice of certain patterns found in the matrix. The patterns are known as (formal) concepts. Such a concept consists of a bicluster , defined as a set of features and a set of samples for which the submatrix along consists of all 1s, which is maximal in two senses: (1) cannot be enlarged without reducing , and (2) cannot be enlarged without reducing .

The containment relations of the sets (respectively ) confer a partial ordering or lattice structure on the set of all concepts, which turns out to be complete. The maximality condition amounts to a closure condition on the sets (respectively ), and the whole apparatus can be formulated as a Galois correspondence between two closure systems on the full feature set and full sample set.

A straightforward recursive algorithm can be used to enumerate all concepts in a given contextganter2016conceptual. This algorithm applies to any finite closure system, and it works by computing the closure of the union of any pair of previously-found closed sets.

In practice, however, data sets of intermediate size or larger furnish too many concepts for a complete enumeration to provide a useful direction of attention to important subsamples or signatures. The present work is partly motivated by this problem, as it can be used to filter signatures by significance.

Appendix B Figures

Figure 1: (Above) The dichotomized expression of several PSG genes on 510 lung tumor samples from the TCGA-LUAD project. (Below) The same expression matrix, with the 19 samples that are positive for all features grouped together on the left. The number of positives for each feature are respectively 101, 105, 106, 73, 69, 104. The exact test for coincidence yields , suggesting that the PSG+ phenotype is highly statistically significant. The loci of the PSG genes are very near to each other, so this is not too surprising; it is likely that gene amplification events near this locus were the cause of the observation.
Figure 2: (Blue) The probability mass functions for the incidence statistic at several values of the set sizes and the ambient set size . (Red) The cumulative distribution functions.
Figure 3: Illustration of an application of the exact test for coincidence.
Figure 4: A screenshot of the in-browser GUI.
Figure 5: A portion of the lattice of feature subsets in cell data extracted from 29-channel multiplexed mass cytometry of breast tumor Tissue Micro Arrays (TMA)jackson2020single. The original data are available on Zenodohartland_w_jackson_2019_3518284. For signature discovery, a random subsample was taken from the cell table. Green indicates lower -value, and pink indicates higher -value. The node size and vertical placement are proportional to the frequency of the sample set displaying the given signature. Only signatures with frequency between 60 and 400 (out of 2280) are shown.