On the qubit routing problem

We introduce a new architecture-agnostic methodology for mapping abstract quantum circuits to realistic quantum computing devices with restricted qubit connectivity, as implemented by Cambridge Quantum Computing's tket compiler. We present empirical results showing the effectiveness of this method in terms of reducing two-qubit gate depth and two-qubit gate count, compared to other implementations.



page 1

page 2

page 3

page 4


Explicit lower bounds on strong simulation of quantum circuits in terms of T-gate count

We investigate Clifford+T quantum circuits with a small number of T-gate...

Not All SWAPs Have the Same Cost: A Case for Optimization-Aware Qubit Routing

Despite rapid advances in quantum computing technologies, the qubit conn...

Phase Gadget Synthesis for Shallow Circuits

We give an overview of the circuit optimisation methods used by tket, a ...

Deterministic Algorithms for Compiling Quantum Circuits with Recurrent Patterns

Current quantum processors are noisy, have limited coherence and imperfe...

Compiler Design for Distributed Quantum Computing

In distributed quantum computing architectures, with the network and com...

Optimality Study of Existing Quantum Computing Layout Synthesis Tools

Layout synthesis, an important step in quantum computing, processes quan...

Advantages and limitations of quantum routing

The Swap gate is a ubiquitous tool for moving information on quantum har...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

There is a significant gap between the theoretical literature on quantum algorithms and the way that quantum computers are implemented. The simple and popular quantum circuit model presents the quantum computer as a finite number of qubits upon which quantum gates act; see Fig. 1 for an example. Typically gates act on one or two qubits at a time, and the circuit model allows multi-qubit gates to act on any qubits without restriction. However, in realistic hardware the qubits are typically laid out in a fixed two or three dimensional topology where gates may only be applied between neighbouring qubits. In order for a circuit to be executed on such hardware, it must be modified to ensure that whenever two qubits are required to interact, they are adjacent in memory. This is a serious departure from the von Neumann architecture of classical computers, where operations may involve data at distant locations in memory without penalty.

We refer to the task of modifying a circuit to conform to the memory layout of a specific quantum computer as the qubit routing problem. When non-adjacent qubits are required to interact we can insert additional SWAP gates to exchange a qubit with a neighbour, moving it closer to its desired partner. In general many – or even all – of the qubits may need to be swapped, making this problem non-trivial. Since quantum algorithms are usually designed without reference to the connectivity constraints of any particular hardware, a solution to the routing problem is required before a quantum circuit can be executed. Therefore qubit routing forms a necessary stage of any compiler for quantum software. Current quantum computers – the so-called NISQ111“Noisy intermediate-scale quantum” devices; see preskill2018quantum for a survey. devices – impose additional constraints. Their short coherence times and relatively low fidelity gates require that the circuit depth and the total number of gates are both as low as possible. As routing generally introduces extra gates into a circuit, increasing its size and depth, it is crucial that the circuit does not grow too much, or its performance will be compromised.

The general case of the routing problem, also called the qubit allocation problem, is known to be infeasible. The sub-problem of assigning logical qubits to physical ones is equivalent to sub-graph isomorphism siraichi2018qubit , while determining the optimal swaps between assignments is equivalent to token-swapping 10.1007/978-3-319-07890-8_31 which is at least np-hard Bonnet2018 and possibly pspace-complete JERRUM1985265 . Siraichi et al. siraichi2018qubit

propose an exact dynamic programming method (with complexity exponential in the number of qubits) and a heuristic method which approximates it well, at least on the small (5 qubit) circuits considered. Zulehner et al.

Zulehner:2017aa propose an algorithm based on depth partitioning and A* search which is specialised to the architectures of IBM devices ibm_doc_tokyo . Other approaches take advantage of the restricted topology typically found in quantum memories such as linear nearest neighbour hirata:2011:linear or hypercubic brierley2015efficient which rely on classical sorting networks; see Appendix A for a discussion of this approach. Lower bound results for routing are presented by Herbert herbert2018depth .

In this paper we describe the solution to the routing problem implemented in , a platform-independent compiler developed by Cambridge Quantum Computing Ltd222 is available as a python module from https://pypi.org/project/pytket/.. The heuristic method in matches or beats the results of other circuit mapping systems in terms of depth and total gate count of the compiled circuit, and has much reduced run time allowing larger circuits to be routed.

Aside from qubit routing, also provides translation from general circuits to any particular hardware-supported gate set, a variety of advanced circuit optimisation routines, and support for most of the major quantum software frameworks. These will be described in future papers. Compilation through guarantees hardware compatibility and minimises the depth and gate count of the final circuit across a range of hardware and software platforms.

In Section II we formalise the problem and present an example instance. In Section III we describe the method used by to solve the problem. In Section IV we describe some of the architectures on which we tested the algorithm and in Section V we present empirical results of ’s performance, both in terms of scaling and in comparison to other compiler software. Full tables of results are provided in the Appendix.

Ii The routing problem


Figure 1: Example of a quantum circuit containing one and two-qubit gates acting on four qubits, , , and . This circuit has five time steps, each with gates acting on disjoint sets of qubits.

We represent a quantum computer as a graph where nodes are physical qubits and edges are the allowed 2-qubit interactions333We don’t consider architectures with multi-qubit interactions involving more than two qubits.. Since the circuit model assumes we can realise a two-qubit gate between any pair of qubits, it is equivalent to the complete graph (Fig. 2a). Realistic qubit architectures are connectivity limited: for instance, in most superconducting platforms the qubit interaction graph must be planar; ion traps present more flexibility, but are still not fully connected. For now we will use the ring graph (Fig. 2b) as a simple example. Given such a restricted graph, our goal is to emulate the complete graph with minimal additional cost.

graph-complete      graph-ring

Figure 2: Nodes in the graph represent physical qubits and edges are the allowed interactions. (a) The circuit model assumes all-to-all communication between qubits, i.e. a complete graph and (b) a physically realistic one-dimensional nearest neighbour cyclic graph, the ring.

From this point of view, the routing problem can be stated as follows. Given (i) an arbitrary quantum circuit and (ii) a connected graph specifying the allowed qubit interactions, we must produce a new quantum circuit which is equivalent to the input circuit, but uses only those interactions permitted by the specification graph. Provided the device has at least as many qubits as the input circuit then a solution always exists; our objective is to minimise the size of the output circuit.

ii.1 Example: routing on a ring

Let’s consider the problem of routing the circuit shown in Fig. 1 on the ring graph of Fig. 2(b). The first step is to divide the circuit into timesteps, also called slices. Loosely speaking, a timestep consists of a subcircuit where the gates act on disjoint sets of qubits and could in principle all be performed simultaneously (see Section III.1 for a precise definition). The single qubit gates have no bearing on the routing problem so can be ignored, and thus a timestep can be reduced to a set of qubit pairs that are required to interact via some 2-qubit gate.

Next, the logical qubits of the circuit must be mapped to the nodes of the graph. For our example a reasonable initial mapping is , , , as shown in Fig. 3. This has the advantage that the qubits which interact in the first timestep are adjacent in the graph, and the same for the second timestep.


Figure 3: An initial mapping of logical qubits to nodes. Highlighted nodes are labelled with the mapped qubit.

However at the third timestep our luck has run out: the CNOT gate between and is not possible in the current configuration. We must add SWAP gates to exchange logical qubits to enable the desired two-qubit interactions. In the example there are two candidates: swapping nodes 1 and 3, or swapping nodes 2 and 3, yielding the configurations shown in Fig. 4. Looking ahead to the final slice – slice 4 has no 2-qubit gates so can be ignored – we see that and will need to interact. In configuration (a) these qubits are distance 3 apart, and hence two additional swaps will be needed to bring them together. In configuration (b) however they are already adjacent. As we want to minimise the number of additional elements to our circuit we choose to swap nodes 2 and 3 to yield the final circuit shown in Fig. 5.

graph-ring-red-renumbered      graph-ring-red-renumbered-bis

Figure 4: (a) Qubit mapping to nodes if and swap positions. (b) Qubit mapping to nodes if and swap positions.


Figure 5: Quantum circuit in Fig. 1 mapped to architecture graph of Fig. 2b.

While this was a tiny example we can see in microcosm all the key elements of the problem: the need to find a mapping of qubits to nodes; the notion of distance between qubits at the next timestep; and the need to compute the permutation of the nodes to enable the next timestep. It should be clear even from this small example that as the size of the circuit increases the number of candidate swaps increases dramatically. Further, if we have to swap several pairs of qubits at the same time, improving the situation for one pair may worsen the situation for another pair. There is a clear arbitrage to apply to bring all the pairs together as soon as possible.

In the worst case swaps suffice to get from any -node configuration to any other 10.1007/978-3-319-07890-8_31 , although for sufficiently regular graphs much better is possible brierley2015efficient . A recent lower bound result states that the minimum number of swaps is in the worst case herbert2018depth , which is achieved by the cyclic butterfly network brierley2015efficient .

Our goal is to optimise the circuit globally so finding optimal mappings between timesteps is not sufficient. It is necessary to evaluate candidate mappings across multiple timesteps; this is the core of the routing algorithm.

ii.2 SWAP synthesis and routing

In the preceding section we described the routing problem in terms of inserting SWAP gates into the circuit. However not all device technologies offer SWAP as a primitive operation. Superconducting devices, for example, typically offer a single 2-qubit interaction from which all other gates, including the SWAP, must be constructed. As a further complication, these interactions may be asymmetric. For example, in some IBM devices ibm_doc_tokyo , the 2-qubit interaction is a CNOT where one qubit is always the control and the other always the target. The graph representing the machine is therefore directed, as shown in Fig. 6, where the direction indicates the orientation of the gate.


Figure 6: Architecture with one-way connectivity constraint.

This complication is easily removed by the usual trick of inserting Hadamard gates, as Fig. 7. Hence the swap gate can be implemented by three (unidirectional) CNOTS and four Hadamards, as in Fig. 8.


Figure 7: Inverting a CNOT gate for a directed graph.


Figure 8: Representation of a SWAP gate in terms of three consecutive CNOT and its inverted representation for a directed graph.

Consider running our routed quantum circuit on the directed architecture of Fig. 6. As this graph constrains the direction of interactions, the quantum circuit we produced is no longer valid. We account for this using the inversion in Fig. 7, producing the circuit shown in Fig. 9. Many simplifications are possible on the resulting circuit, but care must be taken to ensure that the simplified circuit is still conformant to the architecture digraph.


Figure 9: Quantum circuit in Fig. 1 routed for architecture graph in Fig. 6.

Iii The Routing Procedure

The routing algorithm implemented in guarantees compilation of any quantum circuit to any architecture, represented as simple connected graph. It is therefore completely hardware agnostic. The algorithm proceeds in four stages: decomposing the input circuit into timesteps; determining an initial placement; routing across timesteps; and a final clean-up phase.

iii.1 Slicing the circuit into timesteps

Before routing we partition the quantum circuit into timesteps. The circuit structure provides a natural partial ordering of the gates; thus a greedy algorithm starting from inputs can divide the input circuit into “horizontal” partitions of gates which can be executed simultaneously. We simply traverse the circuit adding the qubits involved in a 2-qubit gate to the current timestep. Since only multiqubit interactions (such as CNOT or CZ gates) constrain the problem, single qubit gates can be ignored444More accurately: while the single qubit gates can be ignored for the purposes of routing, they must be retained for circuit generation; for clarity we ignore them for now.. If a gate requires a qubit already used in the previous timestep, a new timestep is created. This procedure is repeated until all gates are assigned to a timestep. A timestep thus consists of a set of disjoint pairs of (logical) qubits which represent gates scheduled for simultaneous execution.

Applying this method to the example from Fig. 1 would yield the following timesteps.

Note, that this is not the same as the illustrative slicing shown in Fig. 1!

The density of a timestep is a measure of the number of simultaneous gates executed. For an -qubit architecture with single and two qubit gates, the density is

Note that where every qubit is involved in a 2-qubit gate in this timestep; a timestep is sparse when its density is close to zero. In principle, the density could be constrained to make routing easier. In practice this seems to make little difference, and we use this quantity only for the analysis in Section V.1.

iii.2 Initial Mapping

For the routing algorithm to proceed we require an initial mapping of logical qubits (referred to as qubits) and physical qubits (referred to as nodes). In a simple but surprisingly effective procedure is used.

We iterate over the timesteps to construct a graph whose vertices are qubits. At timestep we add the edge to the graph if (i) this pair is present in the timestep and (ii) both qubits and have degree less than 2 in the current graph. Each connected component of the resulting graph is necessarily either a line or a ring; the rings are broken by removing an arbitrarily chosen edge.

Disconnected qubits in this graph correspond either to qubits which never interact at all, or to those whose first interaction is with a qubit whose first two interactions are with others. These disconnected qubits are not included in the initial placement at all; they are added later in the routing procedure.

We then select a subgraph of the architecture with high average degree and low diameter to start from. If the architecture is Hamiltonian connected – all the common architectures are555See Section. IV and Refs. wong1995hamilton ; hwang2000cycles . – then it is possible to map the qubit graph to the architecture as one long line starting from a high degree vertex within this subgraph, and greedily choosing the highest degree available neighbour. This ensures that most of the gates in the first two timesteps can be applied without any swaps; the only exceptions are those gates corresponding to the edges removed when breaking rings.

If the initial mapping cannot be completed as one long line, then the line is split and mapped as several line segments.

iii.3 Routing

The routing algorithm iteratively constructs a new circuit which conforms to the desired architecture, taking the sliced circuit and the current mapping of qubits to nodes as input.

The algorithm compares the current timestep of the input circuit to the current qubit mapping. If a gate in the current timestep requires a qubit which has not yet been mapped, it is allocated to the nearest available node to its partner. All gates which can performed in the current mapping – all 1-qubit gates and those 2-qubit gates whose operands are adjacent – are immediately removed from the timestep and added to the output circuit. If this exhausts the current timestep, we advance to the next; otherwise SWAPs must be added.

We define a distance vector

which approximates the number of SWAPs needed to make timestep executable in the mapping ; these vectors are ordered pointwise. Let denote the current timestep, for its successor, and so on, and write to indicate the action of swap upon the mapping . We compute a sequence of sets of candidate SWAPs as follows:

where denotes all the pertinent swaps available at the initial timestep. The sequence terminates either when or after a predefined cutoff. The selected SWAP is added to the circuit and the mapping is updated accordingly. We now return to the start and continue until the entire input circuit has been consumed.

The pointwise ordering of the distance vectors employed by is strict in the sense that implies that for all pairs of qubits in , the longest of the shortest paths between any two paired qubits in is not longer than the longest of the shortest paths in . In other words, the diameter of the subgraph composed of all pairs of qubits in should decrease strictly under the action of swap on the mapping . In consequence, in some highly symmetric configurations, the algorithm sometimes gets stuck, failing to find any candidate swap. We employ two strategies to overcome this. The first is to attempt the process again with pairs of disjoint swaps instead of individual ones. If this also fails then we resort to brute force: a pair of maximally distant qubits in the current timestep are brought together using a sequence of swaps along their shortest connecting path. This guarantees at least one gate may be performed, and disrupts the symmetry of the configuration, hopefully allowing the algorithm to escape from the bad configuration.


In practice there is no need to slice the circuit in advance, and in fact better results are achieved by computing the timesteps dynamically during routing. The “next slice” is recomputed immediately after each update of the mapping, avoiding any unnecessary sequentialisation.

iii.4 SWAP synthesis and clean-up

If the target hardware does not support SWAP as a primitive operation, then after the circuit has been routed, and the SWAPs in the routed circuit must be replaced with hardware appropriate gates, as per Section II.2. While we assume that the input circuit was already well-optimised before routing, it is usually possible to remove some of the additional gates which are inserted during this process in a final clean-up pass.

The essential criterion here is that any changes to the circuit must respect the existing routing. This can be guaranteed by using any set of rewrite rules between 1- and 2-qubit circuits. The routing procedure will not insert SWAP immediately before a 2-qubit gate on the same two qubits, but it may do so afterwards, so the possibility to, for example, cancel consecutive CNOT gates exists. However such cancellation rules are the only “true” 2-qubit rewrites which can be applied. In addition, uses a small set of rewrites for fusing single qubit gates, and commuting single qubit gates past 2-qubit gates. The particular rewrite rules vary according to supported gates of the hardware.

Iv Graph Representation of Quantum Computers

We represent the architecture of a given quantum computer as a simple connected graph, directed or undirected. We now list some specific architecture graphs used in this work.

  1. The ring, Fig. 2(b). A one-dimensional cyclic graph where each node is connected to its two nearest neighbors.

  2. The cyclic butterfly, Fig. 10(a). A non-planar graph with nodes. Each node is denoted by a pair where is -bit sequence corresponding to one of the rows and represents the column. Two nodes and are connected if and if or and have only one bit difference at position , hence the connectivity is equal to 4 for any node, see Ref. brierley2015efficient .

  3. The square grid, Fig. 10(b). A two-dimensional graph with a square shape where nodes are connected to their four neighbors except at the edges where there can be only two or three neighbors.

  4. The IBM Q 20 Tokyo, Fig. 10(c). The graph supporting the 20-qubit processor produced by IBM is a two-dimensional graph with 20 nodes, it has a rectangle shape with some extra connectivity, see Ref. ibm_doc_tokyo .

  5. The Rigetti 19Q-Acorn, Fig. 10(d). The graph supporting the quantum processor produced by Rigetti is a two-dimensional graph with 20 nodes, see Ref. pyquil_doc_acorn .

In Appendix A Table 3 we present the basic properties of these graphs such as their degree and diameter, and the depth overhead of classical sorting algorithms on these graphs.

graph-butterfly                   graph-square-grid

graph-ibm-tokyo                   graph-rigetti

Figure 10: (a) a cyclic butterfly graph with nodes (the first column is represented twice to improve the readibility of the connectivity), (b) a 2-dimensional square grid with nodes, (c) the IBM Q 20 Tokyo chip (Ref. ibm_doc_tokyo ). and (d) the Rigetti 19Q-Acorn chip (Ref. pyquil_doc_acorn ). The edges represent the allowed interactions between qubits.

V Results

The current generation of quantum computers, the NISQ devices preskill2018quantum

, are characterised by small numbers of qubits and shallow circuit depths. In this setting constant factors are more important than asymptotic analysis, so we present two sets of empirical results on the performance of

’s routing algorithm. In the first set of results we evaluate the scaling behaviour on synthetic inputs of increasing size. In the second we compare the performance of against competing compiler implementations on a set of realistic circuits. Note that while the algorithm is very efficient, we report on the quality of the results rather than the time or memory requirements.

v.1 Scaling

The routing algorithm described in Section III can handle circuits of arbitrary depth, and architectures corresponding to any connected graph. We now evaluate how increasing the circuit depth, and the size and connectivity of the architecture graph influence the depth of the routed circuit.

As described above, routing adds SWAP gates to the circuit increasing both its total gate count and the depth of the circuit. Since the total gate count depends on the particular gate set supported by the architecture, we will consider only the increase in circuit depth here. Therefore a reasonable figure of merit is the depth ratio:

where timesteps are computed as described in Section III.1. We define the mean depth overhead as

For a fair comparison to classical sorting algorithms, we consider that a SWAP gate counts as only one additional gate rather than, for example, three when decomposed into CNOT gates, and hence will induce at most one additional time step.

v.1.1 Scaling with depth

To assess the performance of with respect to increasingly deep circuits we perform the following protocol for each of the selected architectures.

  • We randomly generate 1000 circuits of density and initial timesteps for . Note that requiring implies there are no single qubit gates in the circuit.

  • Use to route the circuit on the chosen architecture

  • Compute for the routed circuit.

We tested using the following five architectures:

  • a ring of size ;

  • a square grid of size ;

  • a cyclic butterfly of size ;

  • the IBM Q 20 Tokyo ();

  • the Rigetti 19Q-Acorn666The Rigetti Acorn has only 20 qubits, but due a manufacturing defect which only 19 are usable. This is not relevant to our testsOtterbach:2017aa . ().

The number of nodes for the ring, square grid and cyclic butterfly architectures is chosen for fair comparison and similarly for the IBM and the Rigetti ones. To eliminate sampling bias, a single set of 64-qubit circuits was generated for the all the architectures, and similarly for the architectures.

Figure 11:

Multiple timesteps measurement and architecture comparison. The mean and standard deviation of the ratio

are represented. The left plot overlaps results for the ring, square grid and cyclic butterfly for 64 nodes. The right plot overlaps results for IBM and Rigetti architectures with 20 nodes. Results generated with random initial (dense) circuits with density equal to unity.

Figure. 11 represents the mean and standard deviation of the ratio for the graphs. The ratio is approximately constant and the effect of circuit depth is dominated by the influence of the architecture’s connectivity. This ratio seems to converge for circuits of depth greater than 5 and we report in Table 2 the values of obtained for the largest number of input timesteps. While the ratios obtained seem rather large, it is worth remembering that circuits are the worst case for routing.

v.1.2 Scaling with architecture size

To evaluate the scaling with respect to the size of the architecture we consider single-timestep random quantum circuits of varying density, which are routed on architectures of increasing size. Initial qubit mapping is disabled for these tests so that only the routing procedure is evaluated. While this is an important part of the algorithm, in this case we are interested in the scaling, to which the initial mapping only provides an initial offset.

  • For each architecture of size generate random circuits of depth one, for each .

  • Generate a random initial mapping of qubits on the architecture.

  • Route the timestep using , using the given mapping.

  • Compute for the routed circuit.

The following architectures were evaluated:

  • Rings of size

  • Square grids of size ,

  • Cyclic butterflies of size , .

The results are shown in Fig. 12 and the best fit parameters are given in Table 1. The prior results for the ring and square grid are determined with a regression in log-log space and the cyclic butterfly in log - log(log) space (represented in the insets for ). In each case we see that the overhead appears to grow with the diameter of the graph, although with an exponent that varies (slightly) with the density.

Figure 12: Variation of depth overhead with architecture size for single timestep random circuits. Plots from left to right the ring, square and cyclic butterfly architectures. The mean and standard deviation of the depth overhead versus number of nodes (or qubits) is represented. The inset plots represent the log-log linear fit for the ring and the square (resp. log-loglog fit for the butterfly) for the data set of density .
Cyclic Butterfly
Table 1: Scaling of the depth overhead with architecture size for single-timestep random circuits.
Graph Depth overhead for single-timestep circuits Ratio output - input timesteps
Square grid
Cyclic butterfly
Rigetti 19Q-Acorn
IBM Q 20 Tokyo
Table 2: Summary of our scaling results for dense circuits ()

v.2 Realistic Benchmarks

Random circuits have an essentially uniform structure, which circuits arising from quantum algorithms typically lack. In certain cases this can make random circuits easier to route – although in the preceding section we have largely avoided this by using circuits of high density. To give a more realistic test we have also evaluated its performance on a standard set of 156 circuits which perform various algorithms. These range in size from 6 to 16 qubits, and 7 to more than half a million gates.

We ran on each circuit of the benchmark set, with the 16-qubit ibmqx5 Rueschlikon, which is a rectangular grid, as the target architecture. We then repeated the same test set using the 20-qubit IBM Tokyo as the target architecture. Since both these architectures have CNOT as their only 2-qubit operation, and since it has lower fidelity than the single qubit operations, we selected figures of merit based on minimising the CNOT count and depth of the output circuit. In this test we do perform SWAP synthesis, to get a more realistic evaluation of the output for these devices. Let be the total number of CNOT gates in circuit , and let be the depth of the circuit counting only the CNOT gates. The two measures of interest are

where and are the input and output circuits respectively. The results are shown in Fig. 13. We can see that achieves approximately linear overhead across the entire test set. The mean of 2.64 and of for ibmqx5, and a mean of 1.73 and of 1.69 for IBM Tokyo.

Figure 13: Performance of on realistic test examples. (left) Mean ratio of output to input CX depth as a function of circuit depth (averaged in bins) (right) Mean ratio of output to input CX count (averaged in bins)

We also compared the performance of to a selection of other freely available quantum compiler systems: IBM’s QISKit qiskit , Project Q project_q , and Rigetti Computing’s Quilc pyquil_doc_acorn 777Since Quilc emits CZ as its preferred 2-qubit gate we computed its figures using and instead.. None of the other compilers was able to complete the test set in the time allotted, despite being given at least an hour of compute time per example on a powerful computer888See Appendix B for more details.. For comparison, completed the entire benchmark set in 15 mins on the same hardware. In addition, Project Q does not support routing for the IBM Tokyo architecture due to its unusual graph structure; therefore it was only tested on the ibmqx5 architecture. Therefore comparison of all four compilers is only available for circuits of fewer than 2000 total gates. The comparative results are shown in Fig. 14. We can see that , Qiskit and Quilc exhibit approximately linear overhead, while Project Q appears somewhat worse than linear. A line of best fit calculated using the least squares method is shown for each compiler in Fig. 14. Quilc and exhibit very similar performance; the others show significantly higher overhead.

Figure 14: Comparison of performance between different compilers. Top row: routing on the ibmqx5 architecture. Bottom row: routing on the IBM Tokyo architecture. Left column: input CX count against output CX count. Right column: input CX depth against output CX depth. The benchmark is done against the test set available on http://iic.jku.at/eda/research/ibm_qx_mapping/ and the results are averaged in bins when the initial count or depth is equal.

Finally, we compared the results to the published data of Zulehner et al. Zulehner:2017aa who use the same benchmark set, but use total gate count and depth as the metric. Since Quilc does not generate the same gate set as the others, it was excluded from this comparison. The algorithm of Zulehner et al. Zulehner:2017aa achieves comparable performance to . The results are presented in Appendix B.

Where to get the test set

The test set we used for this work was published by IBM as part of the QISKit Developer Challenge999https://qx-awards.mybluemix.net/#qiskitDeveloperChallengeAward, a public competition to design a better routing algorithm. The competition was won by Zulehner et al. Zulehner:2017aa . The test circuits are available from http://iic.jku.at/eda/research/ibm_qx_mapping/.

Vi Conclusion

As better NISQ machines with the potential to effectively run quantum algorithms become available, the need for software solutions that allow users to easily run quantum circuits on them becomes more apparent. The routing module is one such solution and provides hardware compatibility with minimal extra gate overhead. It is flexible, general and scalable. In this work we have outlined how the routing procedure works and the figures of merit we use to assess routing performance for different graphs.

Finally, we consider possible extensions of this work. Firstly, we note that reinforcement learning offers an alternative approach to the qubit routing problem

Sherbert . Eventually we foresee implementing several approaches to routing in to best adapt to differing algorithms and architectures.

Secondly, when considering the routing problem, we made the implicit assumption that all gates were equal. In real devices, notably superconducting devices, each gates have its own fidelity and run time and this has to be taken into account. Splitting a quantum circuit into time steps becomes more complex as we introduce the different run times and we also have to ensure that the overhead in the error rate encountered by qubit is as small as possible. Additionally, in real life experiments, it has been observed in tannu2018case and klimov2018fluctuations that even the properties of the qubits can fluctuate intra-days. This calls for a general protocol that could accommodate this constraint. Addressing these different constraints transforms the problem from a routing one to a scheduling one, which we plan to address with . Implementing these constraints and measuring performance on this matter will be the object of future work.

Acknowledgments: We thank Steven Herbert for many helpful conversations and encouragement.


Appendix A Dynamical routing versus sorting networks

The routing problem described in this work can be solved using classical sorting algorithms. One of these is the cyclic odd-even sort for the ring of Fig. 

2b). Starting from an architecture with nodes, one compares sequentially all even and odd labeled edges. After exactly time steps, the input will be sorted regardless of input.

Figure 15: An example of sorting network on 8 inputs : odd-even sort over a ring.

For the ring, square and cyclic butterfly graphs presented in Section IV, we summarize in Table 3 some details on the degree and diameter these graphs and the depth overhead of classical sorting algorithms (precisely the quantity introduced in Section V).

The downside of classical sorting algorithms is that they are unadapative: they compute the same sequence of comparisons regardless of input. As circuits are usually sparse, see Section III.1, this leaves many unecessary comparisons, and would treat quantum circuits as sequences of hard timesteps. Indeed, routing solutions derived from classical sorting algorithms tend to pack a quantum circuit into multiple timesteps and then insert SWAP gates as in between timesteps. Solving the routing problem sequentially timestep by timestep produces a concatenation of locally optimal solutions which can be very far from the globally optimal one. A good solution should be dynamic, consider a SWAP gates influence on multiple timesteps, and optimize the global problem rather than the local one. See Ref. zulehner2018compiling for an additional discussion on this matter. Additional details on sorting networks in quantum computing are available in Ref. brierley2 ; brierley2015efficient .

Graph Degree Diameter
Ring 2
Square grid 4
Cyclic butterfly ( 4
Table 3: Comparison of different networks with nodes.

Appendix B Detailed Benchmark Results

The table rows are the names of the benchmark QASM circuits, which are available from www.github.com/iic-jku/ibm_qx_mapping. Benchmark data for Zulehner et al. is collected from results presented in their paper Zulehner:2017aa – note they do not present data for the complete set of examples. An example Jupyter workbook which demonstrates the benchmarking procedure is found at https://github.com/CQCL/pytket/blob/master/examples/tket_benchmarking.ipynb.

All computations were run on a Google Cloud virtual machine with the following specification: machine type n1-standard-2 (2 vCPUs, 7.5GB Memory), Intel Broadwell, 16GB RAM and Standard Persistent Disk. Each example was run till completion, the computation aborted, or until 60 minutes of real time had passed, whichever came first. Note that Quilc aborts in much less than 60 minutes.

In the tables, indicates the gate count of the circuit; in Table 4 this means all gates; in Table 5 and 6 this means CX count only. The circuit depth is labelled ; in Table 4 this means total depth; in Table 5 and 6 this means CX depth only. The bold values are the best performance on the each row. The “ comparison” column shows the ratio between ’s performance and the best other compiler; values less than 1 indicate that performs better.


The example circuit “ground_state_estimation” gives anomalously low values after routing. This is due to an error in the circuit, which allows the post-routing clean-up pass of to eliminate almost the entire circuit.

b.1 All gates comparison on ibmqx5

Figure 16: Routing comparison on ibmqx5, gate count and depth of the routed circuits when counting all gates. The upper charts are a zoomed in version of the initial segment of the lower charts. The results are averaged in bins when the initial count or depth is equal.
Zulehner et al. CQC’s
[ respect underscore, late after line=
, late after last line=, ]benchmarkQX5Allgate.csv 1=, 2=, 3=, 4=, 5=, 6=, 7=, 8=, 9=, 10=, 11=
: the number of quantum gates (elementary operations), : depth of the quantum circuits,
– are time-outs and * are data not provided by the Zulehner et al.
Table 4: All gates comparison on ibmqx5

b.2 CX only comparison on ibmqx5

Figure 17: Routing comparison on ibmqx5, CX count and CX depth when counting only CX gates. The charts are a zoomed in version of the initial segment of upper charts of Fig. 14.
Project Q
Quilc 1.1.1
Pyquil 2.1.1
[ respect underscore, late after line=
, late after last line=, ]benchmarkQX5CX.csv 1=, 2=, 3=, 4=, 5=, 6=, 7=, 8=, 9=, 10=, 11=, 12=, 13=
: the number of quantum gates (elementary operations),
: depth of the quantum circuits and – are time-outs
Table 5: CX gates only comparison on ibmqx5

b.3 CX only comparison on IBM Tokyo

Figure 18: Routing comparison on IBM Tokyo, CX count and CX depth when counting only CX gates. The charts are a zoomed in version of the initial segment of lower charts of Fig. 14.
Quilc 1.1.1
Pyquil 2.1.1
[ respect underscore, late after line=
, late after last line=, ]benchmarkTokyoCX.csv 1=, 2=, 3=, 4=, 5=, 6=, 7=, 8=, 9=, 10=, 11=
: the number of quantum gates (elementary operations),
: depth of the quantum circuits and – are time-outs.
Table 6: CX gates only comparison on IBM Tokyo