Group Testing with Runlength Constraints for Topological Molecular Storage

01/14/2020 ∙ by Abhishek Agarwal, et al. ∙ University of Illinois at Urbana-Champaign Imperial College London 0

Motivated by applications in topological DNA-based data storage, we introduce and study a novel setting of Non-Adaptive Group Testing (NAGT) with runlength constraints on the columns of the test matrix, in the sense that any two 1's must be separated by a run of at least d 0's. We describe and analyze a probabilistic construction of a runlength-constrained scheme in the zero-error and vanishing error settings, and show that the number of tests required by this construction is optimal up to logarithmic factors in the runlength constraint d and the number of defectives k in both cases. Surprisingly, our results show that runlength-constrained NAGT is not more demanding than unconstrained NAGT when d=O(k), and that for almost all choices of d and k it is not more demanding than NAGT with a column Hamming weight constraint only. Towards obtaining runlength-constrained Quantitative NAGT (QNAGT) schemes with good parameters, we also provide lower bounds for this setting and a nearly optimal probabilistic construction of a QNAGT scheme with a column Hamming weight constraint.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Group testing is a pooling scheme first introduced by Dorfman [1] for the purpose of testing individuals for diseases. Since its inception, the problem and its subsequent solutions have found a number of applications in bioinformatics (see [2, 3] and references therein), information and coding theory [4, 5], and many other disciplines, as outlined in [6, 7].

In classical Non-Adaptive Group Testing (NAGT), one is concerned with the following question: Given a collection of objects of which are “defective,” devise a testing strategy that uses the smallest possible number of tests to identify the defectives. A test is allowed to involve an arbitrary number of objects from the pool and returns a positive answer if at least one of the objects involved is defective. The tests are usually summarized in what is referred to as a test matrix – a binary matrix in which the rows correspond to the tests while the columns correspond to the test objects. In other words, we have if and only if item participates in the -th test. The set of

items is usually described by a sparse vector with at most

non-zero entries corresponding to the defectives. Research in this area was kickstarted by the seminal early works of Kautz and Singleton [5] and D’yachkov and Rykov [8]. Currently, we know explicit NAGT schemes requiring tests in the zero-error setting [9] (while the probabilistic method shows the existence of such schemes with tests), and randomized schemes requiring tests in the average-case (i.e., vanishing error) setting with simple decoding algorithms (e.g., see [10, 11]). In both cases, the number of tests is optimal up to an factor [8].

In a different context, group testing was recently shown to increase the storage density of topological DNA-based data storage [12]. In such a system, nanoscopic holes are punched into the sugar-phosphate backbone of one strand of a double-stranded DNA molecule. A “hole” indicates the value 1 while the absence of a hole indicates the value 0. Multiple copies of the same native DNA strands, referred to as registers, are punched to bear different user signatures. These are mixed together according to a group testing scheme within one pool and subsequently stored in a single microwell. The mixing process allows for using only one microwell per pool rather than using multiple microwells for individual registers thereby reducing the implementation cost.

One constraint that arises in the above described group testing scheme is a runlength constraint for zeros between pairs of 1s on the same DNA strand, as depicted in Figure 1. This constraint is associated with the quality of the readout as it is required to place nicks at sufficiently large distance from each other [12]. We propose to address this problem by introducing a new runlength limited group testing paradigm.

Fig. 1: DNA Punchcards for molecular data storage. The rows index the potential nicking sites, while the columns index the DNA strands used in the mixture. Two ones in a column delineating a run of zeros correspond to a DNA fragment whose length has to be sufficiently long. Note that while Register 1 obeys a runlength constraint of , Registers 5 and 8 include two consecutive s which may lead to readout errors.

Runlength limited group testing represents a simplification of the actual mixture identification problem as the readout process also provides information about the distance between two 1s in a register as well as the number of DNA fragments of each type [12]. The counts may be exploited through runlength-constrained Quantitative NAGT (QNAGT) [13, 14].

I-a Problem Setting and Basic Definitions

The Hamming weight of vector is denoted by . A vector is said to be -sparse if . We formulate the problem by assuming that there are items (registers), out of which are defective (i.e., included in the pool) and represented by a -sparse vector . The set of non-adaptive tests to be performed is represented by a (possibly random) binary test matrix , with if and only if item participates in the -th test, and is otherwise. In NAGT, the test outcomes for the input are obtained through a logical OR operation, . In contrast, the test outcomes in QNAGT equal . In the testing scheme, we allow for all- rows in that correspond to noninformative tests but simplify our analysis.

We say that the matrix is -runlength constrained if for every there is a -run of length at least between any two ’s in . Moreover, we say that is -constrained if for every . Observe that every -runlength constrained matrix is also -constrained for . For simplicity, throughout this work we assume that divides . If this is not the case, we simply add up to all- tests to the scheme. As a result, our upper and lower bounds may change by at most by an additive factor of .

Next, we discuss two error regimes of practical interest.

I-A1 The zero-error setting

We say that the test matrix represents a zero-error NAGT scheme if for all -sparse vectors . Our goal is to design -runlength zero-error NAGT schemes with as small as possible, given , , and . To this end, we find the following definition useful.

Definition 1 (-disjunct matrices)

A binary matrix is said to be -disjunct if the support of the bit-wise union of any collection of up to columns of does not contain the support of any other column of .

In particular, every -disjunct matrix corresponds to a zero-error NAGT scheme, and every zero-error NAGT scheme is -disjunct. Moreover, -disjunct matrices have an efficient deterministic decoding procedure such that . Consequently, our goal is to design -runlength -disjunct matrices with small .

In zero-error -runlength (resp. -constrained) QNAGT, the goal is to design a -runlength (resp. -constrained) binary matrix such that for all -sparse vectors with as small as possible.

I-A2 Average-case setting

Here, one only aims to ensure that the average

decoding error probability (over the randomness of the test matrix and uniform sampling of a set of

defectives) vanishes with . More precisely, an average-case NAGT scheme is described by a random binary matrix along with a deterministic decoding procedure such that

(1)

Our goal is to design -runlength average-case NAGT schemes with small , i.e., average-case NAGT schemes with such that every valid fixing is -runlength constrained.

The definition of an average-case -runlength (or -constrained) QNAGT scheme is analogous to the definition above with replaced by in (1).

I-B Related Work

To the best of our knowledge this is the first line of work to consider runlength-constrained NAGT. Some recent work focused on other constrained versions of NAGT sparse NAGT [15, 16, 17, 18] which relates to our formulation. In sparse NAGT, columns and rows of the test matrix are required to satisfy certain Hamming weight constraints. Clearly, our runlength constraint on the columns of the test matrix also induces a weight constraint: Indeed, a -runlength constraint on the columns of a matrix induces a weight constraint on the columns as well. The weight constraints imposed by runlengths and those studied in the works mentioned above are, however, qualitatively different. In the latter, the weight constraint depends on and only, while in our case the weight constraint is a linear fraction of the number of tests (for fixed and ). Therefore, our results are incomparable with those of sparse NAGT.

Furthermore, starting with the work of Söderberg and Shapiro [13], several works have studied the minimum number of tests required for unconstrained QNAGT as a function of and . In particular, Lindström [14] provided an elegant explicit, asymptotically optimal QNAGT scheme for the case . More recently, the optimal number of tests for linear in was determined in [19, 20], and for sublinear in in [21, 22]. The latter work also described efficient constructions of nearly optimal QNAGT schemes (we note that [21] allows non-binary test matrices, while all other works mentioned deal with binary test matrices only). Our results on QNAGT can be seen as a natural extension of the problem studied above to a setting with a column runlength constraint or weight constraint.

Finally, for the motivating application, the work on semiquantitative group testing that generalizes Lindström QNAGT [3] is also of interest as it allows for handling test-dependent noise in the measurements.

I-C Our Contributions

We briefly summarize our main results below:

  • Nearly-optimal runlength-constrained NAGT: We present a probabilistic construction of a -runlength NAGT scheme using tests in the zero-error setting and tests in the average-case setting. Moreover, we derive lower bounds that show the number of tests above are optimal up to and factors in the zero-error and average-case settings, respectively.

  • Nearly optimal weight-constrained QNAGT: As a significant step towards designing good runlength-constrained QNAGT schemes, we analyze a probabilistic construction of weight-constrained QNAGT, and derive complementary lower bounds that show that our construction is order-optimal for a large range of parameters. Note that these lower bounds also hold for runlength-constrained QNAGT.

Two interesting consequences of our results for NAGT are that (i) When , runlength-constrained NAGT is not more restrictive than unconstrained NAGT, and (ii) For essentially all and , runlength-constrained NAGT is not more restrictive than weight-constrained NAGT, which is a significantly weaker constraint.

I-D Notation

Random variables are denoted by uppercase letters such as , , and , while sets are denoted by uppercase calligraphic letters such as and . The set is denoted by . The -th row of is denoted by and its -th column by . The support of a vector is denoted by ; stands for the base-2 logarithm, while stands for the binary entropy function. The Rényi entropy of order of , also known as the collision entropy, is denoted by .

Ii Probabilistic Construction of Runlength-Constrained Schemes

We start our discussion with a simple NAGT construction that satisfies an arbitrary runlength constraint . Let be a given test matrix. We construct a test matrix with from by introducing all- rows between every two rows of . Clearly, is a valid NAGT scheme which satisfies the given runlength constraint. We can instantiate this construction with the best explicit [9] and probabilistic [5] constructions of -disjunct matrices. In the explicit setting, we obtain an NAGT scheme with runlength constraint using tests. In the probabilistic setting, we obtain an NAGT scheme with runlength constraint using tests. In what will be clear from Theorems 2 and 3, we show that our scheme based on a probabilistic construction offers significant reductions in the number of tests compared to this simplistic scheme.

One may also ask whether the standard probabilistic construction in which each entry of the test matrix is i.i.d. Bernoullie with some probability (which yields nearly optimal unconstrained NAGT schemes with high probability for ) also leads to a runlength-constrained NAGT scheme. However, unless is very small (in which case the scheme’s parameters are far from optimal), the resulting test matrix will have several pairs of consecutive 1s with high probability, and hence will not satisfy any -runlength constraint for . As a result, we must consider new probabilistic constructions to obtain good parameters.

Ii-a The Zero-Error Setting

Algorithm describes our scheme to construct a matrix , that with high probability, is -disjunct with a weight bound .         Runlength constraint , weight constraint .

-disjunct matrix   Each column , is constructed identically and independently via the following procedure:

  1. [leftmargin=3]

  2. Set the list . Set .

  3. Pick an index uniformly at random from the list .

  4. ; .

  5. Let be the set of indices symmetrically and cyclically “surrounding” in .

    For example, if and in the current iteration

    , , then

    as 6 and 7 precede 13, while 14 and 1 succeed 13 in cyclic order.

  6. For all , .

  7. Update .

  8. Iterate starting from step 2) as long as .

  9. For all , .

   

Theorem 2

Let and . returns a -disjunct matrix that satisfies a runlength constraint with probability at least .

Proof:

Given a matrix that is an output of the with the parameters as set in Theorem 2, we show that the the probability of not being -disjunct is at most . Let denote an arbitrary column of and let , where the weight of is . Let denote columns of that differ from . To avoid notational clutter, we write instead of . We wish to show that with high probability over the randomness of the algorithm.

The probability that an index is covered in the support of satisfies

(2)

which is an immediate consequence of the i.i.d. assumption on the columns

’s and the union bound. Also, by the chain rule of probability we have

(3)

We start by deriving an upper bound for the first term in (3). Note that the probability of the event is the probability that the following events occur: Index is picked at the first step, or an index outside the symmetric -neighborhood of the index is picked at the first step and is picked at the first step, or indices outside the - neighborhood of index are selected at the first two steps and is picked at the third step, etc. Thus,

Therefore,

(4)

To find an upper bound for the second term in (3) we proceed as follows.

(5)

where . Furthermore, since indices are at least apart,

Let denote the event that index was the index to be set to in . Then,

where equals

Analogously, Hence,

(6)

Through analysis analogous to equations (5) and (6), we conclude that

(7)

In summary, we have

As a result, the probability that a fixed column of the test matrix is contained in the bitwise union of any other columns of is upper bounded by

Using the union bound over all columns of we see that the probability that is not -disjunct is upper bounded by

(8)

Choosing the parameters as in the statement of Theorem 2 proves that this probability is bounded by . ∎

In Section III, we show that this construction is optimal up to logarithmic factors in and .

Ii-B Average-Case Setting

We now analyze the performance of our probabilistic construction with respect to average-case NAGT. We will sample the test matrix via the procedure, and will focus on decoding [23, 24]. In particular, we can achieve an error probability . We shall see in Section III that the number of tests required by under decoding is order-optimal for all and .

Theorem 3

Let , and be some positive constant. If the defectives are chosen uniformly at random, then returns a matrix that decodes the defectives correctly with probability at least .

Proof:

Following a similar analysis as that used to prove Theorem 2, we find that

where is decoding.

Setting and , we obtain the desired upper bound on the failure probability of from (1). ∎

Iii Lower Bounds for Runlength-Constrained Schemes

We now derive lower bounds on the number of tests required for the runlength-constrained zero-error and average-case NAGT schemes. Our lower bounds show that the number of tests of the probabilistic construction presented in Section II is tight up to a logarithmic factor in both regimes. The lower bounds we prove below only use the fact that a -runlength constraint induces a Hamming weight constraint on the columns of the test matrix. Therefore, they hold for all -constrained NAGT schemes. Surprisingly, this shows that runlength-constrained NAGT schemes are not worse than weight-constrained NAGT schemes.

Iii-a Zero-Error Setting

We begin by noting that, since every zero-error NAGT test matrix for defectives is also -disjunct, in the zero-error setting it suffices to prove lower bounds on the number of rows of runlength-constrained disjunct matrices.

Before we proceed with the proof of the main lower bound, we need the following definition and lemma.

Definition 4 (Private set)

Given a matrix and , a set is said to be -private if for all and for every there is an such that .

Lemma 5

Every -runlength -disjunct matrix must satisfy

Proof:

Fix a -runlength -disjunct matrix , and let . If , then we immediately conclude that . To complete the proof, we show that implies that . Indeed, if , then every column of has weight at most . In turn, since is -disjunct, this means that for every there is an -private set of size . Since for all , it follows that . ∎

We are now ready to prove the main lower bound. In order to do this, we modify the technique based on private sets used to prove the well-known lower bound for unconstrained NAGT schemes [25, Section 19].

Theorem 6

Suppose that . Then, every -runlength -disjunct matrix must satisfy

Proof:

Fix a -runlength -disjunct matrix , and let . We may assume that (otherwise by Lemma 5). In particular, this means that .

We begin by showing that every has an -private set of size . To establish a contradiction, suppose that there is a that does not satisfy this condition. Partition into subsets each of size at most . This is possible because . By assumption, no is -private for . This means that for every (since ), but there is a such that for all . Since partition , it follows that cover . This contradicts the fact that is -disjunct, because .

Since every has an -private set of size at most , and each such set is private for at most one index , it follows that

The standard entropy upper bound on the volume of the Hamming ball then implies that

(9)

where the last inequality follows from the fact that for and the assumption that . Rearranging (9) leads to the desired lower bound on . ∎

Combining the lower bound from Theorem 6 with the lower bound for general NAGT schemes allows us to conclude that the probabilistic construction from Section II is optimal for all regimes of and up to an factor.

Iii-B Average-Case Setting

In order to show our probabilistic construction is also nearly optimal (up to an factor) in the average-case setting, we employ a simple information-theoretic argument that stems directly from the fact that the vector of test outcomes must have relatively small Hamming weight (and hence it has low entropy).

Theorem 7

Suppose that . Then, every average-case -runlength NAGT scheme must have

Proof:

Suppose is an average-case -runlength NAGT scheme. Let

be uniformly distributed over the set of

-sparse vectors, and denote the test outcomes of on input by .

Using the fact that is an average-case NAGT scheme coupled with Fano’s inequality, it follows that, for some ,

(10)

where denotes the volume of the -dimensional radius- Hamming ball. It remains to upper bound appropriately. Fix . Then, by the runlength constraint, we have that every column of has weight at most . Therefore, since is -sparse, it follows that . As a result, we conclude that

(11)

for every possible fixing . Combining (III-B), (11), and the inequality valid for all leads to the desired lower bound on . ∎

Iv Constrained Quantitative Group Testing

Recall that in QNAGT our goal is to design a (potentially random) matrix

such that for all pairs of -sparse vectors using as few rows as possible. In this section, we study lower bounds for, and constructions of, such QNAGT schemes with runlength or column weight constraints.

Iv-a Lower Bounds

In this section, we present our lower bounds on the number of tests of runlength-constrained QNAGT schemes. Similarly to Section III, the lower bounds we prove below also hold for all QNAGT schemes with a column weight constraint.

The first lower bound follows from a non-trivial modification of the information-theoretic argument used by Lindström [14] to derive lower bounds for general QNAGT schemes. Before we proceed, we require the following lemmas.

Lemma 8 ([26])

If are independent -valued random variables, it holds that

Lemma 9 ([27])

Let

be an integer-valued random variable with variance

. Then, we have

Theorem 10

For every constant , large enough, and for a large enough constant , every -runlength QNAGT scheme must have

Proof:

Suppose is a -runlength QNAGT scheme and fix a constant . Consider obtained by sampling each independently according to , and denote the -tuple of test outcomes on input by . Observe that each is distributed according to , where . Moreover, the -runlength constraint enforces that

(12)

where .

A standard application of the Chernoff bound shows that

provided . Let

denote the indicator random variable of the event

. Then, for large enough (and hence ), we have

(13)

where the last inequality holds for large enough and because , , and . We proceed to show that (12) enforces the inequality

(14)

Coupled with (13), this immediately implies that

which yields the desired lower bound on .

It remains to show that (14) holds. Note that the inequality holds when for all , since for every constant we have

for large enough by Lemma 9. We claim that this is the maximizing assignment of the ’s. This follows easily from the fact that

(15)

for all , which in turn corresponds to Lemma 8 with , , and , where the are i.i.d. according to . To prove the claim above using (15), fix some arbitrary assignment of the ’s satisfying (12). Without loss of generality, we may assume that (12) is satisfied with equality. Then, either for all , or there are such that and . In the latter case, consider the alternative assignment (satisfying (12) with equality) such that , , and for . If and , then (15) immediately implies that

Repeating this argument until for all leads to the claim. ∎

Using the fact that the vector of test outcomes from a -runlength QNAGT scheme satisfies and for every , a reasoning similar to that used to prove Theorem 7 leads to the following result.

Theorem 11

Suppose that . Then, every -runlength QNAGT scheme must have

The lower bounds from Theorems 10 and 11 complement each other, and can also be seen to hold (with slightly smaller leading constants) in the average-case setting. Indeed, the lower bound from Theorem 11 improves upon the one from Theorem 10 whenever the runlength constraint is significantly larger than .

Iv-B Towards a Tight Probabilistic Construction

The lower bound from Theorem 10 holds even if we only require that the QNAGT scheme satisfy a column-weight constraint . Phrased in terms of , every average-case -constrained QNAGT scheme must have

(16)

Below, we construct an average-case -constrained QNAGT scheme consisting of

tests, with for a large range of and . According to (16), this is optimal up to a constant factor. Before we present our construction, we present two lemmas.

Lemma 12

Let and be i.i.d. according to . If and are such that when , then

for large enough.

Proof:

First, note that

Hence, it suffices to show the desired inequality for , which is achieved at or

. This follows from a direct application of Stirling’s approximation of the factorial to estimate the relevant probabilities, which holds whenever the conditions in the lemma statement are satisfied. ∎

Lemma 13

Let be i.i.d. according to some integer-valued distribution for . Then, it holds that

for every .

Proof:

Note that the collision probability of a given distribution is the squared -norm of its probability mass function (pmf) seen as a real-valued sequence, and that the pmf of is the discrete convolution of the pmf’s of and . The desired inequality then follows directly by applying Young’s inequality for convolution. ∎

We are now ready to describe and analyze our candidate construction of a -constrained QNAGT scheme.

Theorem 14

Given arbitrary constants and , for large enough , , and such that and , there exists an average-case -constrained -QNAGT scheme with

Proof:

Consider the following process for sampling a binary matrix : Each entry is i.i.d. according to , with . A straightforward application of the Chernoff bound coupled with the choice of shows that holds with probability at most for every . Therefore, a union bound over all columns implies that satisfies the column weight constraint with probability . As a result, in order to show the existence of an average-case -constrained -QNAGT scheme, it now suffices to prove that for every vector of weight we have

(17)

where the probability is taken over the randomness of .

We proceed to show (17). Fix an arbitrary of weight , and let denote the collision probability of a distribution. Then, a union bound over all of weight yields

(18)

The inequality in (18) is obtained by noting that if and only if

Moreover, the two sums above are i.i.d., and are i.i.d. according to for every . We now divide the right-hand side of (18) into two parts which are upper bounded by in different ways. First, note that for large enough we have

where the first inequality follows from Lemma 13, and the second inequality holds for large because . Therefore, for it holds that

It is now enough to show that

for all . Using Lemma 12 along with standard upper bounds on binomial coefficients, for large enough we have

The third inequality follows from the fact that . The last equality follows from the constraints on and and the choice of , since then for large enough and we have

for some constant . The second inequality follows from the fact that . The last inequality follows by the definition of . This concludes the proof. ∎

As mentioned before, Theorem 14 shows the lower bounds derived in Section IV-A are tight for -constrained QNAGT schemes for a sizeable regime of and . This complements results for the unconstrained case presented in [19, 21]. It remains an interesting open problem to verify whether our lower bounds are also tight in the runlength-constrained scenario.

Acknowledgment

The work was supported by the NSF Grant 1618366, the SemiSynBio NSF+SRC program under grant number 1807526 and the DARPA Molecular Informatics program.

References

  • [1] R. Dorfman, “The detection of defective members of large populations,” The Annals of Mathematical Statistics, vol. 14, no. 4, pp. 436–440, 1943.
  • [2] P. Damaschke, Threshold Group Testing.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 707–718.
  • [3] A. Emad and O. Milenkovic, “Semiquantitative group testing,” IEEE Transactions on Information Theory, vol. 60, no. 8, pp. 4614–4636, 2014.
  • [4] J. Wolf, “Born again group testing: Multiaccess communications,” IEEE Transactions on Information Theory, vol. 31, no. 2, pp. 185–191, 1985.
  • [5] W. Kautz and R. Singleton, “Nonrandom binary superimposed codes,” IEEE Transactions on Information Theory, vol. 10, no. 4, pp. 363–377, October 1964.
  • [6] D. Du and F. Hwang, Combinatorial group testing and its applications.   World Scientific, 2000, vol. 12.
  • [7] M. Aldridge, O. Johnson, and J. Scarlett, “Group testing: An information theory perspective,” Foundations and Trends® in Communications and Information Theory, vol. 15, no. 3-4, pp. 196–392, 2019.
  • [8] A. G. D’yachkov and V. V. Rykov, “Bounds on the length of disjunctive codes,” Problemy Peredachi Informatsii, vol. 18, no. 3, pp. 7–13, 1982.
  • [9] E. Porat and A. Rothschild, “Explicit non-adaptive combinatorial group testing schemes,” in ICALP 2008, pp. 748–759.
  • [10] M. Aldridge, L. Baldassini, and O. Johnson, “Group testing algorithms: Bounds and simulations,” IEEE Transactions on Information Theory, vol. 60, no. 6, pp. 3671–3687, June 2014.
  • [11] O. Johnson, M. Aldridge, and J. Scarlett, “Performance of group testing algorithms with near-constant tests per item,” IEEE Transactions on Information Theory, vol. 65, no. 2, pp. 707–723, Feb 2019.
  • [12] S. K. Tabatabaei, B. Wang, N. B. M. Athreya, B. Enghiad, A. G. Hernandez, J.-P. Leburton, D. Soloveichik, H. Zhao, and O. Milenkovic, “DNA punch cards: Encoding data on native DNA sequences via topologi