1 Introduction
One of the most important tasks in quantum computing is quantumcircuit synthesis. Given any qubit unitary operator, synthesis algorithms aim to implement it as a sequence of lowlevel gates, and optimize the circuit size and depth [4, 13]. During the last decade, quantum synthesis algorithms have been developed to achieve asymptotically optimal size [35, 38, 22, 34]
. To reduce the circuit depth, synthesis algorithms commonly use ancillae. For example, with sufficient ancillae, Quantum Fourier Transform can be approximated by
depth circuit [9] and stabilizer circuit can be parallelized to depth [27]. However, the nearterm quantum devices only have a small number of qubits [30], which may seriously limit the amount of ancillae. This practical concern gives rise to the following fundamental spacedepth tradeoff problem in quantum circuit synthesis:Can we characterize the relationship between the number of ancillae and the possible optimal depth?
Because controlled NOT gate (CNOT) and singlequbit operations form a universal set for quantum computing [4, 13], CNOTcircuit optimization has been widely studied. For circuit size, Patel, Markov, and Hayes [29] proved that each qubit CNOT circuit can be synthesized with CNOT gates and this bound is asymptotically tight. When topological constraints — i.e., there is limited twoqubit connectivity among their addressable qubits — are taken into consideration, synthesis algorithms have been designed by Kissingerde Griend [21] and NashGheorghiuMosca [28] to build circuits of size . For circuit depth, Moore and Nilsson [27] proved that given ancillae, any qubit CNOT circuit can be parallelized into depth. In addition, Aaronson and Gottesman [1] established a strong connection between CNOT circuits and stabilizer circuits. They proved that stabilizer circuits have a canonical form of blocks, and each block consists of only one type of gates from gate set . Since each block of Phase or Hadamard gates has depth 1 and size at most , optimization of CNOTcircuits can be generalized to stabilizer circuits.
Our main contribution: In this paper, we establish an asymptotically optimal spacedepth tradeoff in CNOTcircuit synthesis, as stated in the following theorem.
Theorem 1.
For any integer , any qubit CNOT circuit can be parallelized to
depth with ancillae. Moreover, there is an time synthesis algorithm for achieving this (here is the matrix multiplication exponent [17]).
Theorem 1 can be readily extended to stabilizer circuits thanks to the wonderful result of Aaronson and Gottesman [1], stating that for any stabilizer circuit, there exists an equivalent circuit that applies a block of Hadamard gates only (H), then a block of CNOT gates only (C), and a block of Phase gates only (P), and so on in the following blocks sequence HCPCPCHPCPC. Since Hadamard gate or phase gate is singlequbit gate and can be merged, thus one block of them takes at most depth one and size . Therefore, it suffices to optimize the CNOT block. Besides, Theorem 1 can be extended to CNOT+ circuits, as all of gates can be moved to the end of the circuit [21]. We summarize these consequences in the following corollary.
Corollary 1.
For any integer , any qubit stabilizer circuit can be parallelized to
depth with ancillae. The same statement also holds for CNOT+ circuits.
Our result in Theorem 1 improves upon two previous results concerning parallel CNOTcircuit synthesis, respectively, with sufficient ancillae or without any ancillae. By parallelizing any CNOT circuit into an depth size equivalent circuit with ancillae, we reduce the number of ancillae needed by Moore and Nilsson [27] by a factor of . By achieving asymptotically optimal depth bound of in parallel CNOTcircuit synthesis without any ancillae, i.e., , we reduce the depth implied in the work of Patel, Markov, and Hayes [29] by a factor of .
These improvements are significant theoretically, because we also prove — by a counting argument — that our spacedepth tradeoff is asymptotically tight. The tightness of Theorem 1 is proved in a more general setting by the following theorem. That is, even if arbitrary twoqubit quantum gate is allowed, rather than only CNOT gates, to approximately implement the given CNOT circuit, the construction still meets the lower bound. Roughly speaking, an approximate circuit outputs a quantum state close to the CNOT circuit’s output under norm.
Theorem 2.
For fraction of qubit CNOT circuit, any approximate qubit ancilla quantum circuit has depth , where is a constant.
Besides the depth, for , our construction has size . It’s easy to generalize technique in [3, 8] to show that such qubit ancillae circuit must have size . Thus our construction also meets the asymptotically optimal size.
Mathematically, our synthesis method for Theorem 1 is based on carefullydesigned Gaussian eliminations. As observed by Patel et al. [29], any
qubit CNOT circuit can be represented by an invertible matrix
, and the synthesis of CNOT circuit is equivalent to transform to identity by Gaussian eliminations. As our aim of this paper is to reduce the circuit depth, we use parallel Gaussian eliminations instead. We minimize the number of parallel Gaussian eliminations by the following two techniques:
For the case without any ancillae, we first establish that if the structure of the matrix is near random, then it is amenable to effective parallel Gaussian elimination. We then use a popular idea from oblivious routing [20] to ensure the existence of a closetorandom structure, This randomization step is then derandomized by a standard approach with somewhat tricky conditional expectations.

For the case with nonzero ancillae, we adopt the idea from the Method of Four Russians. Recall that in [29], Patel et al. use Four Russians to eliminate columns, namely elements, by Gaussian eliminations. In this work, we deal with columns, namely elements, by
parallel Gaussian eliminations. This is done by preparing the additive basis of Boolean vectors
[23] and properly balancing the tradeoff between resource and cost.
Both of our ancillabased and ancillafree synthesis algorithms for CNOT circuits rely on the 1factorization of almost regular bipartite graph [10, 2] — a direct application of Hall’s marriage theorem — to get an ideal ordering. We show both our algorithms runs in time , where is the universal constant for matrix multiplication.
Our results have some direct implication on matrix decomposition over finite fields. More precisely, we show arbitrary can be decomposed to parallel rowelimination matrices. Our technique can be easily generalized to finite field for constant .
Theorem 3.
For any , where is a constant, can be transformed to identity by parallel Gaussian eliminations.
Our construction indicates that there might be a parallel Gaussian elimination algorithm which solves linear equations over by parallel rowelimination matrices with parallel time. Some related work can be seen in [15, 33, 32, 11]. We leave this as an open problem for future research.
A related fundamental problem is to construct an equivalent circuit with optimal depth for any given CNOT circuit. Note that our parallel CNOTcircuit synthesis algorithm is optimal in the asymptotic sense. Specifically, given any matrix and a pair of integers , determine whether there exists an qubit ancilla CNOT circuit for with depth at most . This decision problem is similar to Minimum Circuit Size Problem () — the famous problem which is unlikely to be proven in or complete by natural proof [19, 31]. In this paper, we provide what we consider to be relevant hardness evidence by proving hardness results for optimizing CNOT circuits in slightly different scenarios. In the first scenario, one aims to optimize the depth of a CNOT circuit under certain topological constraints [12, 14] with ancillae. In the second scenario, one aims to optimize a subcircuit of a CNOT circuit with ancillae. We briefly summarize the inapproximability result as Theorem 4; the formal statement is in Section 5.2.
Theorem 4 (Informal).
It is hard to approximate the solution of the following problems within any constant factor:

Global Constrained Minimization (): Given an qubit CNOT circuit, integer and topological constraints on the qubits, output an equivalent qubit ancilla CNOT circuit of minimum size or depth.

Local Size Minimization (): Given an qubit CNOT circuit, integer and a specific part of the circuit, output an equivalent qubit ancilla CNOT circuit which optimizes the size of the specified subcircuit.
At a high level, Global Constrained Minimization () aims to find the optimal size or depth under certain topological constraints , where CNOT gate with control and target is legal iff . This restriction is common in existing quantum devices [12, 14] and CNOT circuit optimization on such devices has been discussed [28, 21]. The Local Size Minimization () aims to optimize the size of a selected part of the circuit while leaving other parts unchanged. Hardness for general quantum circuit optimization over topological constraints can be seen in [18, 7].
Organization of the paper. In Section 2, we review notations and basic definitions used in this paper. In Section 3, we first present our parallel CNOTcircuit synthesis algorithm without using any ancillae. Besides, we prove any treebased CNOT circuit can be parallelized to depth without any ancilla. In Section 4, we present our ancillabased synthesis algorithm and complete the proof of Theorem 1. In Section 5, we give the lower bound and related hardness result. Finally, in Section 6, we summarize the paper and present some open problems.
2 Preliminary
Basic Notations: We use to hide the polylogarithmic terms, to denote , to denote the complex domain, to denote the field with elements, to denote addition under , to denote the set of invertible matrices with entries from , superscription to denote the transpose of matrix or vector, to denote the identity matrix (its subscription is omitted if the context is clear),
to denote the allzero matrix except that the
entry equals , and , , to denote, respectively the entry, the row, and column in matrix .CNOT Gate and Circuits: A CNOT gate maps Boolean tuple to . Because it is the invertible linear map over , any qubit CNOT circuit can be viewed as an invertable linear map over , represented as an invertible matrix and denoted by .
AncillaBased CNOT Circuits: An qubit ancilla CNOT circuit has ancillae with initial assignment , and satisfies the following key property: After the evaluation of the circuit, all ancillae are restored, regardless of the input of the qubits. An qubit ancilla CNOT circuit implements an invertible matrix if for any and input , the output of the circuit is . In other words, the matrix representation for is for some and invertible . Since do not interfere the output when input is , we abbreviate them as . In particular, represents some matrix in . In the remainder of this paper, we say such is an qubit ancilla CNOT circuit for .
Equivalent AncillaBased CNOT Circuits: We say an qubit ancilla CNOT circuit is equivalent to an qubit ancilla CNOT circuit (denoted by ) if for any , the output of the first qubits are the same for and .
RowElimination Matrices: Mathematically, a CNOT gate with control qubit and target qubit can be represented as a row elimination from to (i.e., adding row to row). Thus, a CNOT circuit can be viewed as the product of sequence of rowelimination matrices.
Definition 1 (RowElimination Matrix).
We say matrix is a rowelimination matrix if or there exists such that
Note that for any rowelimination matrix , we have . We use to denote A rowelimination matrix represents exactly one single step in the process of Gaussian elimination. For any matrix, leftmultiplied by represents adding row to the row.
Parallel RowElimination Matrices: A basic concept in parallel CNOTcircuit synthesis is parallel row elimination (or equivalently parallel Gaussian elimination).
Definition 2 (Parallel RowElimination Matrix).
We say matrix is a parallel rowelimination matrix if or there exists such that ’s are different indices and
A parallel rowelimination matrix represents several independent steps in Gaussian elimination. Since all are distinct, there is no need to name a particular order. When is clear in the context (like in Section 3 and Section 4), we use to denote a parallel rowelimination matrix, to denote a sequence of parallel rowelimination matrices.
Quantum Approximation: In this paper, without loss of generality, we only consider quantum circuits consisting of single qubit and twoqubit gates. In Section 5, we will use the following definitions of quantum circuit approximation.
Definition 3 (close).
For any , two vectors are said to be close iff .
Definition 4 (approximate).
Given qubit quantum circuit and qubit ancilla quantum circuit , we say approximates if for any ,

maps to for some ,

and are close.
3 Parallelizing CNOTCircuit Synthesis Without Ancillae
We will divide the proof of Theorem 1 into two parts. In this section, we prove the first part which covers the case when . We will address the rest case in Section 4. In fact, it is sufficient to prove the following Theorem 5 here, as it implies an time algorithm to parallelize any qubit CNOT circuit to depth with ancillae.
Theorem 5 (AncillaFree Parallel CNOT Synthesis).
There is an time algorithm to parallelize any qubit CNOT circuit to depth without ancillae.
Thanks to the connection between CNOT circuits and invertible Boolean matrices, we can reformulate Theorem 5 as the following:
Lemma 1 (Theorem 5 Reformulated).
There is an time algorithm such that given any , it outputs parallel rowelimination matrices where such that
Proof.
By Bunch and Hopcroft [6], we can factorize, in time , , where is a permutation matrix, and are, respectively, lower and upper triangular matrices. Besides, it follows a result from Moore and Nilsson [27] that any permutation matrix can be decomposed into six parallel rowelimination matrices. Lemma 1 then follows from the claim below, as we can handle lower triangular matrices similarly. ∎
Claim 1.
Lemma 1 holds for any upper triangular .
Proof.
Our algorithm applies a divideandconquer scheme. Like in standard analyses for divideandconquer methods, we assume that is sufficiently large (the details will become clear later in the proof). For simplicity, we first consider a randomized algorithm. We will then derandomize it using Lemma 2. The synthesis process is shown in Figure 1, which has five main steps.
Step 1 (Recursion): Denote as , where is of size . After simultaneous recursive parallel row elimination on , the upper right part is , which can be computed in advance, independetly from the recursion, in time [17, 36].
Step 2 (Find Random Layby): Divide as and each is in . Our key observation here is: If is “close to random” (to be formally defined in Lemma 2), then it can be eliminated efficiently. To ensure the needed degree of randomness in our matrix structures, we use a classical idea from oblivious routing [20]: We generate a random of same size for each as its layby and define . Note that, although they are correlated, both and by themselves are “random” matrices with entries from .
Step 3 (Generate RowTraversal Sequence): We say matrix sequence is rowtraversal if

,

for any , sequence visits all vectors in .
Let and we apply Lemma 3 to obtain a rowtraversal sequence with .
Step 4 (First Traverse): In this step, we will add ’s to ’s and then get ’s.
View the bottom right as identities of same size and name them as .
Let . For each time stamp , all ’s simultaneously go from to using original Gaussian elimination algorithm, then

for all , find “large” set such that any satisfies

(recall that entries of are from ),

was not selected in previous (i.e., in previous time stamps and previous repetitions),

for any other , holds;


for all and , add row of to row of as one parallel row elimination;

repeat the two steps above until all .
The detailed explanation of how to construct will be justified later.
Step 5 (Second Traverse): Now in the upper right part, all ’s have reached the predecided layby ’s. In this step, we do another round of traverse similar with Step 4; the only difference is that we use when constructing . Thus, we add ’s to ’s like Step 4, and the upper right square finally becomes zero.
Now we explain the construction of in Step 4. For fixed and , although is found repeatedly in Step 4 for better description, it is actually implemented in a single shot. We justify this as well as its efficiency, where the random plays an essential role.
When is random, any vector in appears about times in every row and column of
with high probability. Then we enumerate all
such that and view them as the edges on a bipartite graph. Thus any valid is a matching in this graph and the iterated construction is equivalent to a matching decomposition. Since any vertex has degree about , the bipartite graph can be factorized into about matchings in linear time (hiding polylogarithmic terms) [10, 2].Hence in Step 4, it will use about parallel rowelimination matrices for every time stamp. Similar analysis holds for Step 5, and we will derandomize the choice of in Lemma 2.
Thus the maximum number of parallel rowelimination matrices, if denoted as , can be obtained by the following recursion
Using Lemma 2 and Lemma 3, the running time, if denoted as , can be obtained by the following recursion
∎
Now we give two essential components in the proof of Claim 1. Lemma 2 addresses the crucial property that must have to make the matching decomposition and parallel row elimination efficient.
Note that the proof of existence in Lemma 2 can be obtained easily by direct application of Chernoff’s bound, but in that case it would be hard to derandomize in time .
Lemma 2.
There is an time algorithm such that for any sufficiently large, given matrix with entries from , it outputs of same format satisfying for any , it appears at most times in any row or column of .
Proof.
Pick entries of bit by bit uniformly at random. Let and set with foresight.
In the following, we prove by induction on the number of determined bits that no appears as prefix more than times in any row or column of .
Assume first bits are determined, now we randomly pick the th bit from . For any , define four 0/1 bad event indicators:

() iff appears as prefix more than times in (),

() iff appears as prefix more than times in ().
Then the expectation of the number of bad events is
(Chernoff’s bound) 
where and denotes the round Bernoulli trial with probability .
Thus, there exists an assignment of the th bit such that no appears as prefix more than in any row or column of as claimed.
At last, the desired property follows from
We use to denote the relation that two vectors share the same first bits. Let the undetermined bit in entries of be .
Now we derandomize the choice of the th bit of by the method of conditional expectation for some fixed . Let the first bits of be respectively, and

,

,
where .
Let and the th bit of be . Suppose we pick as the th bit of , and define , then the expectation of bad events decreases by
where
Then we choose that decreases most, which must be nonnegative.
To boost the selection, we preprocess and truncate to the highest significant bits. Then even if the best increases the expectation, the fluctuation is and accumulates as an insensitive term. ∎
Lemma 3 presents a simple way to construct the rowtraversal sequence. Though its length can be further improved, the asymptotic order is already tight and sufficient for our purpose. Thus we do not particularly pursue the optimal parameter in it.
Lemma 3.
There is an time algorithm to generate a rowtraversal sequence of length .
Proof.
Let and compute the rank of matrix over . Observe that for any ,
since for any ,
Also, any row of traverses all vectors in as goes through . Thus the output of Algorithm 1 gives the desired sequence.
∎
Proof of Theorem 3.
The proof is almost identical except that all is replaced with . Another difference is that the length of rowtraversal sequence will be in Lemma 3. ∎
By lemmas and theorem above, we have parallelized any qubit CNOT circuit to depth without ancillae. A fundamental problem in parallel CNOTcircuit synthesis, when no ancillae is given, is to characterize the impact of circuits’ topological structures to the sizedepth tradeoff. Unlike in the asymptotic spacedepth tradeoff where CNOT circuits are essentially compressed as an invertible matrix in , the circuit details are part of the input to synthesis algorithms.
While this problem remains an ongoing research subject, in the following we use a basic family of CNOT circuits to illustrate that the topological details of CNOT circuits can be effectively used. This family of the CNOT circuits has tree structures: Given a proper binary tree with leaves, in which each leaf has a unique label from and each internal node has a label from , we can define an qubit CNOT circuit (with variables ) as the following.
We use postordertraversal to define the CNOT circuit by first defining for each node in , its qubit index (and the gate it describes if is internal):

For a leaf , is its label in ;

for an internal node with label and children , then , and ;

for an internal node with label and children , then and .
Suppose the postordertraversal projection of the internal nodes of is . Then,
An example for CNOT trees can be seen in Figure 2.
The following theorem gives an equivalent depth CNOT circuit for any qubit CNOT circuit .
Theorem 6 (Parallel Synthesis of CNOT Trees).
For any proper binary tree with leaves, the qubit CNOT circuit can be parallelized to depth without ancillae.
Theorem 6 can be obtained by applying Miller and Reif’s paralleltreecontraction technique [24, 26, 25]. See Appendix C for the proof. Theorem 6 can be generalized to the following corollary.
Corollary 2.
If an qubit CNOT circuit can be expressed as the product of CNOT trees, then it can be parallelized into a CNOT circuit with depth without ancillae.
4 Parallelizing CNOT circuits with ancillae
In this section, we prove Theorem 1 for the part, i.e., . For any , the bound in Theorem 1 is always , thus it suffices to consider . We restate Theorem 1 in this case as follows.
Theorem 7.
There is an time algorithm to parallelize any qubit CNOT circuit into depth with ancillae, where .
We use a standard technique in reversible computation to simplify the problem. Given arbitrary , Theorem 7 aims to construct a CNOT circuit for with ancillae. We first construct two qubit ancilla CNOT circuits for respectively, i.e., for any , ,
Starting with and applying , where takes the second bits as control and the first bits as target, we get
Then, we permute the first and second qubits, which can be done in depth by [27], to get the final circuit.
Based on the observation above as well as the equivalence between CNOT circuits and invertible Boolean matrix, to prove Theorem 7, it suffices to construct circuit for as Lemma 4 states.
By Lemma 4, the time complexity to construct is . On the other hand, it needs time to compute by [36, 17], thus the overall time cost is for Theorem 7.
Lemma 4.
There is an time algorithm such that given , it outputs parallel rowelimination matrices where such that
We delay the detailed proof of Lemma 4 to the end of the section and prove several key lemmas first. The key point here is, we can construct columns of using parallel Gaussian eliminations with the help of rows in the last. Then we simply construct columns of as group of sequentially.
We begin our proof with the base case as Lemma 6, which calls Lemma 5 as a subprocedure to construct a slim sparse matrix.
Lemma 5.
There is an time algorithm that given which has at most one’s, it outputs where such that
Proof.
For simplicity, we write for the vector, whose entries are except for the th. Let , then .
Now we describe how to make copies of ’s on rows by parallel row eliminations. Let the rows be . We add the first row (original ), to the th row; then double the number of by adding the first and th to th rows simultaneously; and keep doubling till the number reaches .
Since , we make copies of independently for all on the last rows with parallel rowelimination matrices.
Then, we construct on the middle rows . For , add ’s to the th row if , which needs at most row additions. Since there are sufficient copies of ’s, all are considered simultaneously.
At last, we restore the last rows by reversing the copy process. ∎
In the following lemma, we use parallel rowelimination matrices to construct any given columns, which corresponds to the case .
Lemma 6.
There is an time algorithm such that given , it outputs parallel rowelimination matrices where such that
Proof.
Rows of can be seen as a set of Boolean vectors of length . In the algorithm, we first synthesize an additive base for these vectors; then add them together to obtain . The main process is depicted in Figure 3.
Comments
There are no comments yet.