Optimal Space-Depth Trade-Off of CNOT Circuits in Quantum Logic Synthesis

Due to the decoherence of the state-of-the-art physical implementations of quantum computers, it is essential to parallelize the quantum circuits to reduce their depth. Two decades ago, Moore et al. demonstrated that additional qubits (or ancillae) could be used to design "shallow" parallel circuits for quantum operators. They proved that any n-qubit CNOT circuit could be parallelized to O(log n) depth, with O(n^2) ancillae. However, the near-term quantum technologies can only support limited amount of qubits, making space-depth trade-off a fundamental research subject for quantum-circuit synthesis. In this work, we establish an asymptotically optimal space-depth trade-off for the design of CNOT circuits. We prove that for any m≥0, any n-qubit CNOT circuit can be parallelized to O(max{log n, n^2/(n+m)log (n+m)}) depth, with O(m) ancillae. We show that this bound is tight by a counting argument, and further show that even with arbitrary two-qubit quantum gates to approximate CNOT circuits, the depth lower bound still meets our construction, illustrating the robustness of our result. Our work improves upon two previous results, one by Moore et al. for O(log n)-depth quantum synthesis, and one by Patel et al. for m = 0: for the former, we reduce the need of ancillae by a factor of log^2 n by showing that m=O(n^2/log^2 n) additional qubits suffice to build O(log n)-depth, O(n^2/log n) size — which is asymptotically optimal — CNOT circuits; for the later, we reduce the depth by a factor of n to the asymptotically optimal bound O(n/log n). Our results can be directly extended to stabilizer circuits using an earlier result by Aaronson et al. In addition, we provide relevant hardness evidences for synthesis optimization of CNOT circuits in term of both size and depth.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/06/2019

Interactive shallow Clifford circuits: quantum advantage against NC^1 and beyond

Recent work of Bravyi et al. and follow-up work by Bene Watts et al. dem...
06/06/2020

Lowering the T-depth of Quantum Circuits By Reducing the Multiplicative Depth Of Logic Networks

The multiplicative depth of a logic network over the gate basis {, ⊕, } ...
02/19/2021

Sorting Short Integers

We build boolean circuits of size O(nm^2) and depth O(log(n) + m log(m))...
07/02/2019

Efficient Circuit Simulation in MapReduce

The MapReduce framework has firmly established itself as one of the most...
11/01/2019

Optimal Metastability-Containing Sorting via Parallel Prefix Computation

Friedrichs et al. (TC 2018) showed that metastability can be contained w...
12/09/2019

Approximating the Determinant of Well-Conditioned Matrices by Shallow Circuits

The determinant can be computed by classical circuits of depth O(log^2 n...
12/06/2020

Low depth algorithms for quantum amplitude estimation

We design and analyze two new low depth algorithms for amplitude estimat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the most important tasks in quantum computing is quantum-circuit synthesis. Given any -qubit unitary operator, synthesis algorithms aim to implement it as a sequence of low-level gates, and optimize the circuit size and depth [4, 13]. During the last decade, quantum synthesis algorithms have been developed to achieve asymptotically optimal size [35, 38, 22, 34]

. To reduce the circuit depth, synthesis algorithms commonly use ancillae. For example, with sufficient ancillae, Quantum Fourier Transform can be approximated by

-depth circuit [9] and stabilizer circuit can be parallelized to depth [27]. However, the near-term quantum devices only have a small number of qubits [30], which may seriously limit the amount of ancillae. This practical concern gives rise to the following fundamental space-depth trade-off problem in quantum circuit synthesis:

Can we characterize the relationship between the number of ancillae and the possible optimal depth?

Because controlled NOT gate (CNOT) and single-qubit operations form a universal set for quantum computing [4, 13], CNOT-circuit optimization has been widely studied. For circuit size, Patel, Markov, and Hayes [29] proved that each -qubit CNOT circuit can be synthesized with CNOT gates and this bound is asymptotically tight. When topological constraints — i.e., there is limited two-qubit connectivity among their addressable qubits — are taken into consideration, synthesis algorithms have been designed by Kissinger-de Griend [21] and Nash-Gheorghiu-Mosca [28] to build circuits of size . For circuit depth, Moore and Nilsson [27] proved that given ancillae, any -qubit CNOT circuit can be parallelized into depth. In addition, Aaronson and Gottesman [1] established a strong connection between CNOT circuits and stabilizer circuits. They proved that stabilizer circuits have a canonical form of blocks, and each block consists of only one type of gates from gate set . Since each block of Phase or Hadamard gates has depth 1 and size at most , optimization of CNOT-circuits can be generalized to stabilizer circuits.

Our main contribution: In this paper, we establish an asymptotically optimal space-depth tradeoff in CNOT-circuit synthesis, as stated in the following theorem.

Theorem 1.

For any integer , any -qubit CNOT circuit can be parallelized to

depth with ancillae. Moreover, there is an -time synthesis algorithm for achieving this (here is the matrix multiplication exponent [17]).

Theorem 1 can be readily extended to stabilizer circuits thanks to the wonderful result of Aaronson and Gottesman [1], stating that for any stabilizer circuit, there exists an equivalent circuit that applies a block of Hadamard gates only (H), then a block of CNOT gates only (C), and a block of Phase gates only (P), and so on in the following blocks sequence H-C-P-C-P-C-H-P-C-P-C. Since Hadamard gate or phase gate is single-qubit gate and can be merged, thus one block of them takes at most depth one and size . Therefore, it suffices to optimize the CNOT block. Besides, Theorem 1 can be extended to CNOT+ circuits, as all of gates can be moved to the end of the circuit [21]. We summarize these consequences in the following corollary.

Corollary 1.

For any integer , any -qubit stabilizer circuit can be parallelized to

depth with ancillae. The same statement also holds for CNOT+ circuits.

Our result in Theorem 1 improves upon two previous results concerning parallel CNOT-circuit synthesis, respectively, with sufficient ancillae or without any ancillae. By parallelizing any CNOT circuit into an -depth -size equivalent circuit with ancillae, we reduce the number of ancillae needed by Moore and Nilsson [27] by a factor of . By achieving asymptotically optimal depth bound of in parallel CNOT-circuit synthesis without any ancillae, i.e., , we reduce the depth implied in the work of Patel, Markov, and Hayes [29] by a factor of .

These improvements are significant theoretically, because we also prove — by a counting argument — that our space-depth trade-off is asymptotically tight. The tightness of Theorem 1 is proved in a more general setting by the following theorem. That is, even if arbitrary two-qubit quantum gate is allowed, rather than only CNOT gates, to approximately implement the given CNOT circuit, the construction still meets the lower bound. Roughly speaking, an -approximate circuit outputs a quantum state close to the CNOT circuit’s output under norm.

Theorem 2.

For fraction of -qubit CNOT circuit, any -approximate -qubit -ancilla quantum circuit has depth , where is a constant.

Besides the depth, for , our construction has size . It’s easy to generalize technique in [3, 8] to show that such -qubit -ancillae circuit must have size . Thus our construction also meets the asymptotically optimal size.

Mathematically, our synthesis method for Theorem 1 is based on carefully-designed Gaussian eliminations. As observed by Patel et al. [29], any

-qubit CNOT circuit can be represented by an invertible matrix

, and the synthesis of CNOT circuit is equivalent to transform to identity by Gaussian eliminations. As our aim of this paper is to reduce the circuit depth, we use parallel Gaussian eliminations instead. We minimize the number of parallel Gaussian eliminations by the following two techniques:

  • For the case without any ancillae, we first establish that if the structure of the matrix is near random, then it is amenable to effective parallel Gaussian elimination. We then use a popular idea from oblivious routing [20] to ensure the existence of a close-to-random structure, This randomization step is then derandomized by a standard approach with somewhat tricky conditional expectations.

  • For the case with non-zero ancillae, we adopt the idea from the Method of Four Russians. Recall that in [29], Patel et al.  use Four Russians to eliminate columns, namely elements, by Gaussian eliminations. In this work, we deal with columns, namely elements, by

    parallel Gaussian eliminations. This is done by preparing the additive basis of Boolean vectors

    [23] and properly balancing the trade-off between resource and cost.

Both of our ancilla-based and ancilla-free synthesis algorithms for CNOT circuits rely on the 1-factorization of almost regular bipartite graph [10, 2] — a direct application of Hall’s marriage theorem — to get an ideal ordering. We show both our algorithms runs in time , where is the universal constant for matrix multiplication.

Our results have some direct implication on matrix decomposition over finite fields. More precisely, we show arbitrary can be decomposed to parallel row-elimination matrices. Our technique can be easily generalized to finite field for constant .

Theorem 3.

For any , where is a constant, can be transformed to identity by parallel Gaussian eliminations.

Our construction indicates that there might be a parallel Gaussian elimination algorithm which solves linear equations over by parallel row-elimination matrices with parallel time. Some related work can be seen in [15, 33, 32, 11]. We leave this as an open problem for future research.

A related fundamental problem is to construct an equivalent circuit with optimal depth for any given CNOT circuit. Note that our parallel CNOT-circuit synthesis algorithm is optimal in the asymptotic sense. Specifically, given any matrix and a pair of integers , determine whether there exists an -qubit -ancilla CNOT circuit for with depth at most . This decision problem is similar to Minimum Circuit Size Problem () — the famous problem which is unlikely to be proven in or -complete by natural proof [19, 31]. In this paper, we provide what we consider to be relevant hardness evidence by proving hardness results for optimizing CNOT circuits in slightly different scenarios. In the first scenario, one aims to optimize the depth of a CNOT circuit under certain topological constraints [12, 14] with ancillae. In the second scenario, one aims to optimize a sub-circuit of a CNOT circuit with ancillae. We briefly summarize the inapproximability result as Theorem 4; the formal statement is in Section 5.2.

Theorem 4 (Informal).

It is -hard to approximate the solution of the following problems within any constant factor:

  • Global Constrained Minimization (): Given an -qubit CNOT circuit, integer and topological constraints on the qubits, output an equivalent -qubit -ancilla CNOT circuit of minimum size or depth.

  • Local Size Minimization (): Given an -qubit CNOT circuit, integer and a specific part of the circuit, output an equivalent -qubit -ancilla CNOT circuit which optimizes the size of the specified sub-circuit.

At a high level, Global Constrained Minimization () aims to find the optimal size or depth under certain topological constraints , where CNOT gate with control and target is legal iff . This restriction is common in existing quantum devices [12, 14] and CNOT circuit optimization on such devices has been discussed [28, 21]. The Local Size Minimization () aims to optimize the size of a selected part of the circuit while leaving other parts unchanged. Hardness for general quantum circuit optimization over topological constraints can be seen in [18, 7].

Organization of the paper. In Section 2, we review notations and basic definitions used in this paper. In Section 3, we first present our parallel CNOT-circuit synthesis algorithm without using any ancillae. Besides, we prove any tree-based CNOT circuit can be parallelized to depth without any ancilla. In Section 4, we present our ancilla-based synthesis algorithm and complete the proof of Theorem 1. In Section 5, we give the lower bound and related hardness result. Finally, in Section 6, we summarize the paper and present some open problems.

2 Preliminary

Basic Notations: We use to hide the polylogarithmic terms, to denote , to denote the complex domain, to denote the field with elements, to denote addition under , to denote the set of invertible matrices with entries from , superscription to denote the transpose of matrix or vector, to denote the identity matrix (its subscription is omitted if the context is clear),

to denote the all-zero matrix except that the

entry equals , and , , to denote, respectively the entry, the row, and column in matrix .

CNOT Gate and Circuits: A CNOT gate maps Boolean tuple to . Because it is the invertible linear map over , any -qubit CNOT circuit can be viewed as an invertable linear map over , represented as an invertible matrix and denoted by .

Ancilla-Based CNOT Circuits: An -qubit -ancilla CNOT circuit has ancillae with initial assignment , and satisfies the following key property: After the evaluation of the circuit, all ancillae are restored, regardless of the input of the qubits. An -qubit -ancilla CNOT circuit implements an invertible matrix if for any and input , the output of the circuit is . In other words, the matrix representation for is for some and invertible . Since do not interfere the output when input is , we abbreviate them as . In particular, represents some matrix in . In the remainder of this paper, we say such is an -qubit -ancilla CNOT circuit for .

Equivalent Ancilla-Based CNOT Circuits: We say an -qubit -ancilla CNOT circuit is equivalent to an -qubit -ancilla CNOT circuit (denoted by ) if for any , the output of the first qubits are the same for and .

Row-Elimination Matrices: Mathematically, a CNOT gate with control qubit and target qubit can be represented as a row elimination from to (i.e., adding row- to row-). Thus, a CNOT circuit can be viewed as the product of sequence of row-elimination matrices.

Definition 1 (Row-Elimination Matrix).

We say matrix is a row-elimination matrix if or there exists such that

Note that for any row-elimination matrix , we have . We use to denote A row-elimination matrix represents exactly one single step in the process of Gaussian elimination. For any matrix, left-multiplied by represents adding row- to the row-.

Parallel Row-Elimination Matrices: A basic concept in parallel CNOT-circuit synthesis is parallel row elimination (or equivalently parallel Gaussian elimination).

Definition 2 (Parallel Row-Elimination Matrix).

We say matrix is a parallel row-elimination matrix if or there exists such that ’s are different indices and

A parallel row-elimination matrix represents several independent steps in Gaussian elimination. Since all are distinct, there is no need to name a particular order. When is clear in the context (like in Section 3 and Section 4), we use to denote a parallel row-elimination matrix, to denote a sequence of parallel row-elimination matrices.

Quantum Approximation: In this paper, without loss of generality, we only consider quantum circuits consisting of single qubit and two-qubit gates. In Section 5, we will use the following definitions of quantum circuit approximation.

Definition 3 (-close).

For any , two vectors are said to be -close iff .

Definition 4 (-approximate).

Given -qubit quantum circuit and -qubit -ancilla quantum circuit , we say -approximates if for any ,

  • maps to for some ,

  • and are -close.

3 Parallelizing CNOT-Circuit Synthesis Without Ancillae

We will divide the proof of Theorem 1 into two parts. In this section, we prove the first part which covers the case when . We will address the rest case in Section 4. In fact, it is sufficient to prove the following Theorem 5 here, as it implies an -time algorithm to parallelize any -qubit CNOT circuit to depth with ancillae.

Theorem 5 (Ancilla-Free Parallel CNOT Synthesis).

There is an -time algorithm to parallelize any -qubit CNOT circuit to depth without ancillae.

Thanks to the connection between CNOT circuits and invertible Boolean matrices, we can reformulate Theorem 5 as the following:

Lemma 1 (Theorem 5 Reformulated).

There is an -time algorithm such that given any , it outputs parallel row-elimination matrices where such that

Proof.

By Bunch and Hopcroft [6], we can factorize, in time , , where is a permutation matrix, and are, respectively, lower and upper triangular matrices. Besides, it follows a result from Moore and Nilsson [27] that any permutation matrix can be decomposed into six parallel row-elimination matrices. Lemma 1 then follows from the claim below, as we can handle lower triangular matrices similarly. ∎

Claim 1.

Lemma 1 holds for any upper triangular .

Proof.

Our algorithm applies a divide-and-conquer scheme. Like in standard analyses for divide-and-conquer methods, we assume that is sufficiently large (the details will become clear later in the proof). For simplicity, we first consider a randomized algorithm. We will then derandomize it using Lemma 2. The synthesis process is shown in Figure 1, which has five main steps.

Step 1 (Recursion): Denote as , where is of size . After simultaneous recursive parallel row elimination on , the upper right part is , which can be computed in advance, independetly from the recursion, in time  [17, 36].

Step 2 (Find Random Layby): Divide as and each is in . Our key observation here is: If is “close to random” (to be formally defined in Lemma 2), then it can be eliminated efficiently. To ensure the needed degree of randomness in our matrix structures, we use a classical idea from oblivious routing [20]: We generate a random of same size for each as its layby and define . Note that, although they are correlated, both and by themselves are “random” matrices with entries from .

Step 3 (Generate Row-Traversal Sequence): We say matrix sequence is row-traversal if

  • ,

  • for any , sequence visits all vectors in .

Let and we apply Lemma 3 to obtain a row-traversal sequence with .

Step 4 (First Traverse): In this step, we will add ’s to ’s and then get ’s.

View the bottom right as identities of same size and name them as .

Let . For each time stamp , all ’s simultaneously go from to using original Gaussian elimination algorithm, then

  • for all , find “large” set such that any satisfies

    • (recall that entries of are from ),

    • was not selected in previous (i.e., in previous time stamps and previous repetitions),

    • for any other , holds;

  • for all and , add row- of to row- of as one parallel row elimination;

  • repeat the two steps above until all .

The detailed explanation of how to construct will be justified later.

Step 5 (Second Traverse): Now in the upper right part, all ’s have reached the pre-decided layby ’s. In this step, we do another round of traverse similar with Step 4; the only difference is that we use when constructing . Thus, we add ’s to ’s like Step 4, and the upper right square finally becomes zero.

Figure 1: Main algorithm for in-place parallel Gaussian elimination.

Now we explain the construction of in Step 4. For fixed and , although is found repeatedly in Step 4 for better description, it is actually implemented in a single shot. We justify this as well as its efficiency, where the random plays an essential role.

When is random, any vector in appears about times in every row and column of

with high probability. Then we enumerate all

such that and view them as the edges on a bipartite graph. Thus any valid is a matching in this graph and the iterated construction is equivalent to a matching decomposition. Since any vertex has degree about , the bipartite graph can be factorized into about matchings in linear time (hiding polylogarithmic terms) [10, 2].

Hence in Step 4, it will use about parallel row-elimination matrices for every time stamp. Similar analysis holds for Step 5, and we will derandomize the choice of in Lemma 2.

Thus the maximum number of parallel row-elimination matrices, if denoted as , can be obtained by the following recursion

Using Lemma 2 and Lemma 3, the running time, if denoted as , can be obtained by the following recursion

Now we give two essential components in the proof of Claim 1. Lemma 2 addresses the crucial property that must have to make the matching decomposition and parallel row elimination efficient.

Note that the proof of existence in Lemma 2 can be obtained easily by direct application of Chernoff’s bound, but in that case it would be hard to derandomize in time .

Lemma 2.

There is an -time algorithm such that for any sufficiently large, given matrix with entries from , it outputs of same format satisfying for any , it appears at most times in any row or column of .

Proof.

Pick entries of bit by bit uniformly at random. Let and set with foresight.

In the following, we prove by induction on the number of determined bits that no appears as prefix more than times in any row or column of .

Assume first bits are determined, now we randomly pick the -th bit from . For any , define four 0/1 bad event indicators:

  • () iff appears as prefix more than times in (),

  • () iff appears as prefix more than times in ().

Then the expectation of the number of bad events is

(Chernoff’s bound)

where and denotes the -round Bernoulli trial with probability .

Thus, there exists an assignment of the -th bit such that no appears as prefix more than in any row or column of as claimed.

At last, the desired property follows from

We use to denote the relation that two vectors share the same first bits. Let the undetermined bit in entries of be .

Now we derandomize the choice of the -th bit of by the method of conditional expectation for some fixed . Let the first bits of be respectively, and

  • ,

  • ,

where .

Let and the -th bit of be . Suppose we pick as the -th bit of , and define , then the expectation of bad events decreases by

where

Then we choose that decreases most, which must be non-negative.

To boost the selection, we pre-process and truncate to the highest significant bits. Then even if the best increases the expectation, the fluctuation is and accumulates as an insensitive term. ∎

Lemma 3 presents a simple way to construct the row-traversal sequence. Though its length can be further improved, the asymptotic order is already tight and sufficient for our purpose. Thus we do not particularly pursue the optimal parameter in it.

Lemma 3.

There is an -time algorithm to generate a row-traversal sequence of length .

Proof.

Let and compute the rank of matrix over . Observe that for any ,

since for any ,

Also, any row of traverses all vectors in as goes through . Thus the output of Algorithm 1 gives the desired sequence.

foreach  do
       if  then
             Let be an arbitrary index that Change to make invertible if  then
                   Construct that Output
             end if
            
       end if
      Output
end foreach
Output
Algorithm 1 Construct row-traversal sequence

Note that the technique to prove Lemma 1 can be extended to general , which is stated as Theorem 3.

Proof of Theorem 3.

The proof is almost identical except that all is replaced with . Another difference is that the length of row-traversal sequence will be in Lemma 3. ∎

By lemmas and theorem above, we have parallelized any -qubit CNOT circuit to depth without ancillae. A fundamental problem in parallel CNOT-circuit synthesis, when no ancillae is given, is to characterize the impact of circuits’ topological structures to the size-depth trade-off. Unlike in the asymptotic space-depth trade-off where CNOT circuits are essentially compressed as an invertible matrix in , the circuit details are part of the input to synthesis algorithms.

While this problem remains an on-going research subject, in the following we use a basic family of CNOT circuits to illustrate that the topological details of CNOT circuits can be effectively used. This family of the CNOT circuits has tree structures: Given a proper binary tree with leaves, in which each leaf has a unique label from and each internal node has a label from , we can define an -qubit CNOT circuit (with variables ) as the following.

We use postorder-traversal to define the CNOT circuit by first defining for each node in , its qubit index (and the gate it describes if is internal):

  • For a leaf , is its label in ;

  • for an internal node with label and children , then , and ;

  • for an internal node with label and children , then and .

Suppose the postorder-traversal projection of the internal nodes of is . Then,

An example for CNOT trees can be seen in Figure 2.

@C=2em @R=1.2em & M_v_1 [-1.4em]5 & M_v_2 [-1.4em]5 & M_v_3 [-1.4em]5 & M_v_4 &
— x_1 ⟩ & & 2 & & &
— x_2 ⟩ & -1 & & & &
— x_3 ⟩ & & & & 1 &
— x_4 ⟩ & & & & &
— x_5 ⟩ & & & -1 & &
Figure 2: An example of CNOT trees, where the right tag above an internal node is its qubit index.

The following theorem gives an equivalent -depth CNOT circuit for any -qubit CNOT circuit .

Theorem 6 (Parallel Synthesis of CNOT Trees).

For any proper binary tree with leaves, the -qubit CNOT circuit can be parallelized to depth without ancillae.

Theorem 6 can be obtained by applying Miller and Reif’s parallel-tree-contraction technique  [24, 26, 25]. See Appendix C for the proof. Theorem 6 can be generalized to the following corollary.

Corollary 2.

If an -qubit CNOT circuit can be expressed as the product of CNOT trees, then it can be parallelized into a CNOT circuit with depth without ancillae.

4 Parallelizing CNOT circuits with ancillae

In this section, we prove Theorem 1 for the part, i.e., . For any , the bound in Theorem 1 is always , thus it suffices to consider . We restate Theorem 1 in this case as follows.

Theorem 7.

There is an -time algorithm to parallelize any -qubit CNOT circuit into depth with ancillae, where .

We use a standard technique in reversible computation to simplify the problem. Given arbitrary , Theorem 7 aims to construct a CNOT circuit for with ancillae. We first construct two -qubit -ancilla CNOT circuits for respectively, i.e., for any , ,

Starting with and applying , where takes the second -bits as control and the first -bits as target, we get

Then, we permute the first and second qubits, which can be done in depth by [27], to get the final circuit.

Based on the observation above as well as the equivalence between CNOT circuits and invertible Boolean matrix, to prove Theorem 7, it suffices to construct circuit for as Lemma 4 states.

By Lemma 4, the time complexity to construct is . On the other hand, it needs time to compute by [36, 17], thus the overall time cost is for Theorem 7.

Lemma 4.

There is an -time algorithm such that given , it outputs parallel row-elimination matrices where such that

We delay the detailed proof of Lemma 4 to the end of the section and prove several key lemmas first. The key point here is, we can construct columns of using parallel Gaussian eliminations with the help of rows in the last. Then we simply construct columns of as group of sequentially.

We begin our proof with the base case as Lemma 6, which calls Lemma 5 as a sub-procedure to construct a slim sparse matrix.

Lemma 5.

There is an -time algorithm that given which has at most one’s, it outputs where such that

Proof.

For simplicity, we write for the vector, whose entries are except for the -th. Let , then .

Now we describe how to make copies of ’s on rows by parallel row eliminations. Let the rows be . We add the first row (original ), to the -th row; then double the number of by adding the first and -th to -th rows simultaneously; and keep doubling till the number reaches .

Since , we make copies of independently for all on the last rows with parallel row-elimination matrices.

Then, we construct on the middle rows . For , add ’s to the -th row if , which needs at most row additions. Since there are sufficient copies of ’s, all are considered simultaneously.

At last, we restore the last rows by reversing the copy process. ∎

In the following lemma, we use parallel row-elimination matrices to construct any given columns, which corresponds to the case .

Lemma 6.

There is an -time algorithm such that given , it outputs parallel row-elimination matrices where such that

Proof.

Rows of can be seen as a set of Boolean vectors of length . In the algorithm, we first synthesize an additive base for these vectors; then add them together to obtain . The main process is depicted in Figure 3.