I Introduction
Quantum Computing (QC) is a new computational paradigm that aims to address classically intractable problems with considerably higher efficiency and speed. It has been shown to have exponential or polynomial advantage in various domains such as cryptography [shor1999polynomial], database search [grover1996fast], chemistry [kandala2017hardware, peruzzo2014variational, cao2019quantum] and machine learning [biamonte2017quantum, harrow2009quantum, farhi2014quantum, lloyd2013quantum, rebentrost2014quantum], etc. In the recent two decades, QC hardware has witnessed rapid progress by virtue of breakthroughs in physical implementation technologies.
Despite the exciting advancements, we are still expected to reside in the Noisy Intermediate Scale Quantum (NISQ) stage for multiple years before entering the FaultTolerant era [gottesman2010introduction, bennett1996mixed]. In the NISQ era, quantum computers typically contain tens to hundreds of qubits, which are insufficient for quantum error correction. The qubits and quantum gates also suffer from high error rates of to . Therefore, reducing quantum error is of pressing demand to close the gap between the requirements from the quantum algorithm side and available QC capacity from the hardware side.
A series of works focusing on noiseadaptive quantum program compilation has been proposed to mitigate the noise impact. Noiseadaptive qubit mapping [li2019tackling, murali2019noise, tannu2019not] aims to find the best mapping from logical qubits to physical qubits, which minimizes the gate error and SWAP insertion overhead. Noiseadaptive instruction scheduling and crosstalk mitigation techniques [ding2020systematic, murali2020software] aim to reduce the undesired interqubit interference and the circuit depth. However, those techniques only explore a small design space by optimizing the compilation process with a fixed input quantum circuit. Limited research efforts have been made to explore how to improve the noise resilience of QC via a codesign strategy for searching, training, and compiling quantum circuits.
This work fills this blank by proposing QuantumNAS, a noiseadaptive quantum circuit and qubit mapping cosearch framework to find the most robust quantum circuit and corresponding qubit mapping tailored for a given task on the target quantum device as in Figure 1. We study parameterized quantum circuits since they provide unique opportunities to alter circuit structures while performing the same functionality.
First, we are strongly motivated by the significant impacts of quantum noise on performance. In Figure 2, we show the accuracy of MNIST 4class image classification simulated by the noisefree simulator and measured on the real IBMQYorktown quantum computer. Key observations
: (1) More parameters increase the model capacity, thus increasing noisefree simulation accuracy. Nevertheless, more parameters mean more gates, which introduces more noise, and the accumulated noise quickly offsets the capacity benefit. As a result, the measured accuracy peaks at 45 parameters. (2) To make things worse, quantum noise exacerbates the performance variance. The measured accuracy variance under the same #parameters is much higher than that of noisefree, e.g., [25%, 59%] vs. [67%, 77%] under 45 parameters. The observations both call for the noiseadaptive search for the most robust circuit.
One major challenge for this noiseadaptive search is the algorithmic scalability issue. It is almost intractable to solve the twolevel optimization problem (for quantum circuit and qubit mapping) via iterative circuit sampling, parameter training, and evaluation in the large design space. To address this, we propose to decouple the training and search by introducing a novel SuperCircuitbased search approach (Figure 1). We first construct a SuperCircuit by stacking a sufficient number of layers of predefined parameterized gates to cover a large design space. Then, we train the SuperCircuit by sampling and updating the parameter subsets (SubCircuits) from the SuperCircuit. The performance of a SubCircuit with inherited parameters from the SuperCircuit can provide a reliable relative performance estimation for the individual SubCircuit trained from scratch. In this way, we only pay the training cost once but can evaluate all the SubCircuits fast and efficiently. Hence, the search cost is significantly reduced.
Furthermore, we perform an evolutionary cosearch with noise information in the loop to find the most robust quantum circuit and qubit mapping jointly. In each iteration, the evolution engine samples a population of SubCircuit and qubit mapping pairs. Then the performance of each sampled SubCircuit can be evaluated by an estimator on two types of backends: a noiseaware simulator or a real quantum hardware. The estimator takes the inherited parameters from the SuperCircuit and assigns them to the SubCircuit. With a noiseaware simulator backend, the performance is evaluated with direct noise classical simulation with a realistic device noise model. Alternatively, we can replace the simulator with real quantum hardware. The requirement for such evaluation is no harder than any common variational quantum algorithms. After multiple evolutionary search iterations, we can obtain a pair of robust circuit and qubit mapping and then train the parameters from scratch. SuperCircuitbased search is inspired by the supernet method in classical ML model training [pham2018efficient, guo2020single, cai2019once]. However, we have four major differences: (1) The SuperCircuit is more general than the ML model and can be applied to various parameterized quantum algorithms such as VQE; (2) We cosearch circuit with its qubit mapping; (3) Our search is aware of quantum noise to improve robustness; (4) We propose novel front sampling and restricted sampling specialized for quantum circuits.
Finally, on top of the searched circuit and qubit mapping, we further propose a finegrained pruning to remove redundant parameters and gates and finetuning to recover the performance. We end up with a slimmed circuit with similar noisefree performance but fewer noise sources, which in return improves the final measured performance.
Overall, QuantumNAS can mitigate the impact of quantum noise and delays the accuracy peak as shown in Figure 3. The contributions of QuantumNAS are fivefold: ➊ NoiseAdaptive Quantum Circuit & Qubit Mapping CoSearch to enable noiseresilient QC. ➋ SuperCircuitbased Efficient Search Flow: we propose a scalable quantum circuit search method based on SuperCircuit. Front sampling and restricted sampling are proposed for efficient exploration and stable optimization in the huge design space. ➌ Iterative Quantum Pruning is introduced to remove redundant quantum gates in a finegrained manner. ➍ Extensive Real QC Evaluations: we extensively evaluate QuantumNAS with 12 benchmarks in QML and VQE on 10 quantum computers, observing significant improvements over baselines. ➎ OpenSource QC Library: To facilitate future research in QML and variational quantum simulation, we release QuantumEngine, a PyTorchbased GPUaccelerated library to enable fast training of parameterized quantum circuits (over 200 faster than the PennyLane [bergholm2018pennylane]). It also supports pushthebutton deployment of trained circuits on real quantum devices.
Ii Background and Motivation
Iia Quantum Basics
Qubits – Unlike a conventional bit, a quantum bit (qubit [nielsen2002quantum, ding2020quantum]) can be in a linear combination of the two basis states 0 and 1: for satisfying . The ability to create a “superposition” of basis states allows us to use an qubit system to represent a linear combination of basis states. In contrast, a classical bit register can only store one of the states.
Quantum Circuits – To perform computation on a quantum system, we manipulate the qubits’ state by applying a quantum circuit. A quantum circuit consists of a sequence of operations called quantum gates, which take one quantum state to another through unitary transformations, i.e., , where
is a unitary matrix. Results of a quantum circuit are obtained by qubit readout operations called
measurements, which collapse a qubit state to either or probabilistically according to the amplitudes and .Operational Noises – In real QC, errors occur due to imperfect control signals, unwanted interactions between qubits, or interference from the environment [krantz2019quantum, bruzewicz2019trapped]. Thus, qubits undergo decoherence error (spontaneous loss of its stored information) over time, and quantum gates introduce operation errors (e.g., coherent/stochastic errors) into the system. These systems need to be characterized[magesan2012characterizing] and calibrated[ibm_2021] frequently to mitigate noise impacts. So noiseadaptive techniques in QC algorithms, circuits, and devices are critical.
IiB Variational Quantum Circuits
A variational circuit is a trainable quantum circuit where its quantum gates are parameterized (e.g., by angles in quantum rotation gates). The parameterized quantum circuit is used to prepare a variational quantum state: , where is the input data related to the computation and is a set of free variables for adaptive optimizations. The variational method has shown huge potentials in applications such as quantum machine learning[wittek2014quantum, biamonte2017quantum, schuld2018supervised, benedetti2019parameterized], numerical analysis[lloyd2014quantum, lloyd2016quantum], quantum simulation [peruzzo2014variational, kandala2017hardware, kokail2019self], and optimizations[moll2018quantum, farhi2014quantum].
Typically, the training of variational circuits is performed by first selecting a handdesigned circuit for a computational task and, secondly, finding an optimal set of parameters for the circuit via a hybrid quantumclassical optimization procedure. The optimization is usually an iterative process to search for the best candidates for the parameters in . Whether a variational quantum algorithm is successful depends on how well the circuit can be trained. For example, “barren plateau”[mcclean2018barren] is a phenomenon when the cost function landscape is flat, making a variational circuit untrainable using any gradientbased optimization algorithms.
Quantum Neural Network (QNN) is a promising application of variational quantum circuits. [abbas2021power] shows that from an information geometry point of view, if carefully designed, QNNs have higher effective dimensions and faster training convergence speed over classical NNs. That highly motivates improvements of QNN robustness on real quantum machines.
Figure 4
shows the example circuits we used for QML (QNN) and VQE. For QML tasks such as image classification, we first encode the pixels using rotation gates and then use parameterized trainable quantum gates to process the information. We measure the qubits on Zbasis to obtain classical values, then compute Softmax of those values to get the probability for each class. For VQE, the parameterized circuit is used for state preparation, and the measurement part is constructed according to the molecule. We prepare for the state multiple times for measurements on different qubits and bases, multiply expectation values of qubits, and perform weighted sum. The final result is the expectation value for the ground state energy of the molecule. The parameters can be trained with backpropagation, in which we compute the derivative of each parameter (
) on loss function (
) and update the parameters with a learning rate , .Iii NoiseAdaptive QuantumNAS
Iiia Overview
Figure 5 shows QuantumNAS overview, time cost, and a simple example. Firstly, a SuperCircuit is trained as a fast estimation of SubCircuits performance ranking. Front sampling and restricted sampling are proposed to promote the reliability of estimations. Then a noiseadaptive evolutionary cosearch is performed to find the best circuit and qubit mapping pair. A performance estimator is employed to provide fast and accurate feedback to the evolution engine. Redundant gates with small parameter magnitude are further pruned from the searched circuit. The pruned circuit is finally compiled and deployed on real quantum devices.
IiiB SuperCircuit Construction and Training
It is critical to encompass a large design space to include the most robust circuit. However, training all candidate circuits, evaluating their final performance, and selecting the best one is too costly. We thus propose SuperCircuit to evaluate each circuit in the design space (SubCircuit) without fully training it. Since we only need to find the best circuit, relative performance is sufficient, and can be estimated by the SuperCircuit.
With prespecified basis gates and design space, the SuperCircuit is defined as the circuit with the largest number of gates in the space, whose parameters are trained by iteratively sampling and updating a subset of gates/parameters (SubCircuit). SuperCircuit contains multiple blocks, each with several layers of parameterized gates. A SubCircuit is a subset of the SuperCircuit that can have different number of blocks and gates inside blocks. Figure 6 shows one block of U1+CU1 space containing one U1 layer and one CU1 layer. The SuperCircuit contains all gates, while the SubCircuit only contains gates with solid lines. In one SuperCircuit training step, we sample a SubCircuit and only compute gradients and update that subset of parameters of SuperCircuit. Intuitively, training a SuperCircuit is simultaneously training all SubCircuits in the design space.
SuperCircuit aims to facilitate the lowcost evaluation of SubCircuits in the design space. Given one SubCircuit, it is sufficient to inherit the gate parameters from the SuperCircuit and then perform evaluation without training. That provides an accurate estimation of the relative performance of the SubCircuit. Since the next stage is derivativefree optimization such as evolutionary search, using relative performance between SubCircuits is sufficient to find the best one. In addition, SuperCircuit can be reused for new devices or when noise changes. Thus, we only need to pay the noisefree SuperCircuit training cost for once but can use it for all devices. The number of circuits run for naïve search is ; while that for SuperCircuit search is . The overall search cost is significantly reduced by around times which is in our setting. is the number of quantum devices to execute the circuit. is the number of evaluated circuits during search. / is the number of circuit running iterations in training/evaluation.
A critical challenge in samplingbased SuperCircuit training is the large variance. Naïve random sampling often causes severe trainability issues due to intractable sampling variance from drastic SubCircuit change, leading to unreliable relative performance estimation. To address this, we propose front sampling and restricted sampling.
Front Sampling. In front sampling, only the subsets with the several front blocks and front gates can be sampled. For instance, if the subset contains three blocks, then blocks 0, 1, 2 will be sampled. Inside a block, if two gates are sampled in a layer, then the gates on qubits 0 and 1 will be sampled. Figure 6 shows several valid cases of front sampling. Front sampling helps improve SuperCircuit trainability as SubCircuits share the parameters of front blocks and gates.
Restricted Sampling is another essential technique we propose to boost training stability. We prevent the sampled SubCircuits from changing dramatically between two steps by constraining the maximum number of different layers. Therefore, the training process is stabilized as the sampling variance is under control. As in Figure 7, the upper path is unrestricted sampling where the two SubCircuits differ by 5 layers. In the bottom path, restricted sampling limits the layer differences to 3, bringing much better crossstep consistency.
IiiC NoiseAdaptive Evolutionary CoSearch
SuperCircuit provides highly efficient relative performance estimations. We adopt a derivativefree optimization to explore the joint search space of circuit and qubit mapping.
Evolutionary Search.Genetic algorithm is employed in which the gene vector encodes circuit and qubit mapping. Each element in the circuit subgene represents the circuit width (#gates) in the layer. One additional gene sets the circuit depth (#blocks). Front sampling is also applied here. The qubit mapping subgene encodes the mapping between logical and physical qubits. We concatenate circuit and qubit mapping subgenes as the pair’s gene.
The evolution engine searches for highperformance pairs by keeping a population of pairs. In one iteration, it first evaluates all pairs by querying a performance estimator and selects multiple pairs with the highest performance (the lowest loss/eigenvalue for QML/VQE) as the parent population. Then mutation and crossover are conducted to generate the new population as in Figure 8. Mutation randomly alters several genes with a predefined probability. Crossover first selects two parent samples from the parent population; and then generates a new sample, each gene of which is randomly selected from one of the two parent samples. The new population is the ensemble of parent population, mutations, and crossovers. Then we sort the new population and select the ones with the highest performance as parents and enter the next iteration. The population of the very first iteration is from random sampling. Population size across iterations remains the same. For QML, we use validation set loss as the indicator. The lower the validation loss, the higher the final measured accuracy.
Performance Estimator. Ideally, the performance of circuitqubit mapping pairs is directly evaluated on real quantum devices, which, however, could be extremely slow due to limited resources and queuing. Therefore we apply an estimator to provide fast relative performance with noise. It takes the query pairs from the evolution engine as inputs. It inherits the gate parameters of searched SubCircuit from SuperCircuit and sets the searched qubit mapping as the initial mapping of the compiler. There are two ways of estimation. One way is to use a simulator with a noise model from real devices. Noise models are from calibrations such as randomized benchmarking performed by the IBMQ team. They contain coherence (depolarizing), decoherence (thermal relaxation), and SPAM (readout) errors. The models are updated around twice a day and can be directly accessed with Qiskit API; the second is to use a noisefree simulator and compute the overall success rate with the product of success rates of all gates. Then the augmented loss will be noisefree simulated loss divided by calculated success rate: , where is success rate and is loss. The first method is more accurate but slower, while the second is less accurate but faster. Therefore, in QuantumNAS, small circuits (10 Qubits) apply the first method; large circuits apply the second.
The estimator has two approximations. The first uses the performance of one SubCircuit with inherited parameters to estimate the performance of the same SubCircuit with parameters trained from scratch. The second uses the simulation results, either with noise model or success rate, to estimate the performance on real devices. Since we only care about relative performance, the twolevel approximation still maintains enough reliability for the search engine. Figure 9 shows the effectiveness of the first approximation with five tasks in two design spaces. For each point, the xaxis value is the performance (loss) with inherited parameters from SuperCircuit; the yaxis value is that with parameters trained from scratch. The average Spearman’s correlation score is 0.75, showing strong positive correlations thus accurate relative performance. Figure 10 further shows the final estimated loss and the real loss for MNIST4 on IBMQYorktown. The correlation between them is 0.76, which indicates a significant positive relation. Therefore, the estimated performance is reliable enough to search for the best circuitmapping pair.
IiiD Iterative Quantum Pruning
We further propose to remove redundant quantum gates to reduce the noise. The motivations are threefold. First, the suboptimality of the evolutionary search stage leaves room for further optimization of the searched circuit by reducing the number of gates. Second, even with the same circuit, there exist multiple parameter sets to achieve similar noisefree performance. Some sets contain more parameters with a magnitude close to zero, which can be safely removed with iterative pruning and finetuning. Third, some gates, such as U3, contain multiple parameters. Partially removing the parameters can also bring benefits. We find that the number of compiled gates of U3(), U3(), U3(), U3(), U3() and U3() are 5, 1, 4, 4, 1, 1, respectively. Therefore, having one or two parameters as zeros in the U3 gates can reduce up to 80% gates compiled to the basis gate set (CNOT, SX, RZ).
Therefore, we propose iterative pruning as in Figure 11 to remove the circuit parameters in a finegrained manner. Specifically, we first train the searched circuit from scratch to convergence. We rank all the normalized rotation angles and remove part of angles that are closest to 0. Then we finetune the rest parameters to recover the accuracy. We iteratively increase the pruning ratio and finetune the circuit parameters until achieving the desired ratio. In practice, we adopt polynomial pruning ratio decay [zhu2017prune]: where is pruning ratio and is training step. For final pruning ratio selection, we make sure that the noisefree simulation performance is not degraded compared with the unpruned circuit. Thus, due to fewer gates after compilation, the accuracy of the circuit can be further increased by up to 9%.
IiiE QuantumEngine
To accelerate parameterized quantum circuit training in this work, we build a PyTorch library named QuantumEngine. Its APIs are implemented similarly to existing operations in PyTorch. So it makes quantum circuit construction as easy as a standard neural network model. It supports all common quantum gates. The state vector and unitary matrix of each gate are implemented with a native torch.Tensor data type. The simulations are achieved with complexvalued differentiable matrix multiplication operators such as torch.bmm.
Compared with existing training frameworks such as PennyLane [bergholm2018pennylane], it has several unique advantages: (1) It supports both dynamic and static computational graphs. Dynamic mode simulates each gate individually, so the state vector after each gate can be obtained for easy debugging. Static mode optimizes tensor network simulation by fusing unitary of multiple gates before applying to the state vector, reducing the computation amount. (2) It supports training in batch mode, which is important for QML tasks, while PennyLane cannot support batched training because it processes batch with the ’For’ loop. (3) All simulations can be accelerated with PyTorch’s GPU acceleration support. (4) PyTorch’s native automatic differentiation can be applied to train the gate parameters. Furthermore, QuantumEngine supports pushthebutton conversion between PyTorch quantum circuit and IBM Qiskit QuantumCircuit, such that we can support convenient endtoend trainingtodeployment flow. It contains many readytouse templates, e.g., random and stronglyentangled layers. Parameter shift is also supported for gradient computations.
Figure 12 shows the training speed of 10qubit parameterized quantum circuits containing 100 RX and 100 CRY gates vs. PennyLane. Since PennyLane processes batch with the ’For’ loop, the training speed reduces linearly with the batch size. QuantumEngine supports tensorized batch processing on CPU/GPU, so the speed is not severely influenced. The training speed is to 10 times faster than PennyLane.
Iv Evaluation
Iva Evaluation Methodology
Benchmarks. We conduct experiments on 6 QML and 6 VQE tasks. QML are classification tasks including MNIST [726791] 10, 4class (0, 1, 2, 3), 2class (3, 6); Vowel [deterding1989speaker] 4class (hid, hId, had, hOd); Fashion [xiao2017fashion] 4class (tshirt/top, trouser, pullover, dress), and 2class (dress, shirt). MNIST and Fashion use 95% images in ‘train’ split as the training set and 5% as the validation set. Due to the limited real QC resources, we randomly sample 300 images from the ‘test’ split as our test set and report their accuracy on the real quantum devices. However, we find 300 images can already have comparable accuracy to the whole testing set: on four circuits, the whole testing set acc/300sample acc are 0.505/0.497, 0.284/0.283, 0.564/0.547, 0.272/0.287. The input images are . We centercrop them to and downsample them to 44 for 2 and 4 classifications; and 6
6 for MNIST10, both with average pooling. Vowel4 dataset (990 samples) is separated to train:validation:test = 6:1:3 and test with the whole test set. We perform principal component analysis (PCA) for the vowel features and take 10 most significant dimensions.
MNIST2(4) and Fashion2(4) use four logical qubits. MNIST10 uses 10. To embed the classical images and vowel features to the quantum domain, we flatten them and encode them with rotation gates. For downsampled images, we use 4 qubits and 4 layers with 4 RY, 4 RZ, 4 RX, and 4 RY gates on each layer, respectively. There are total 16 gates to encode the 16 classical values as the rotation phases. For images, we use 10 qubits and four layers with 10 RY, 10 RZ, 10 RX, and 6 RY gates on each layer, respectively. For 10 vowel features, we use 4 qubits and 3 layers with 4 RY, 4 RZ and 2 RX gates on each layer for encoding. For readout, we measure the expectation values on PauliZ basis and obtain a value [1, 1] from each qubit. For 2class, we sum the qubit 0 and 1, 2 and 3 respectively to get two values, which will be processed by Softmax to get probabilities. For 4 and 10class, we use Softmax on expectation values to obtain probabilities.
For VQE, the goal is to find the lowenergy eigenvalue of a target molecule by repeated measurements of the expectation value of the Hamiltonian of the molecule (as detailed in Section V). The molecules we study in this work contain H, LiH, HO, CH, and BeH. We use the BravyiKitaev mapping [bravyi2002fermionic] to transform a molecule Hamiltonian from its fermionic form to the qubit form. H, HO, LiH, and CH6Q, CH10Q, BeH, need 2, 6, 6, 6, 10, 15 logical qubits, respectively. VQE circuits are searched and trained on classical machines then deployed on real QC to obtain the eigenvalues.
Quantum Devices and Compiler Configurations. We use IBMQ quantum computers via Qiskit [ibmq] APIs. We study 10 devices, with #qubits from 5 to 65 and Quantum Volume from 8 to 128. We also employ Qiskit for compilation. The optimization level is set to 2 except for level 3 for Noiseadaptive and Sabre baselines in Figure 13 and Table III. For searched qubit mapping, we set it as the ‘initial_layout’ of the compiler. QML/VQE experiments run 8192/2048 shots.
Depth  #Gates (#1Q+#CNOT)  #Params  Acc.  

NoiseUnaware  237  365 (299+66)  120  0.48 
Random  45  100 (94+6)  36  0.86 
Human  64  135 (124+11)  36  0.88 
QuantumNAS  70  133 (123+10)  36  0.89 
+ Pruning  59  116 (106+10)  22  0.92 
Circuit Design Spaces. We select six circuit design spaces, four from previous QML work, and name them with gates:

[leftmargin=*]

‘U3+CU3’ – One block has a U3 layer with one U3 gate on each qubit, and a CU3 layer with ring connections, e.g., CU3(0, 1), CU3(1, 2), CU3(2, 3), CU3(3, 0).

‘ZZ+RY’ [lloyd2020quantum] – One block contains one layer of ZZ gate, also with ring connections, and one RY layer.

‘RXYZ’ [mcclean2018barren] – One block has four layers: RX, RY, RZ, and CZ. There is one layer before the blocks.

‘ZX+XX’ [farhi2018classification] – according to their MNIST circuit design, one block has two layers: ZX and XX.

‘RXYZ+U1+CU3’ [henderson2020quanvolutional] – according to their random circuit basis gate set, we design SuperCircuit in which one block has 11 layers in the order of RX, S, CNOT, RY, T, SWAP, RZ, H, , U1 and CU3.

‘IBMQ Basis’ [ibm_2021] – we design SuperCircuit with basis gate set of IBMQ devices, in which one block has 6 layers in the order of RZ, X, RZ, SX, RZ, CNOT.
The SuperCircuits for space 1 to 4 contain 8 blocks; space 5 has 4 blocks; space 6 has 20 and does not have front sampling. The design spaces contain numerous SubCircuits, e.g.: RXYZ+U1+CU3 contains SubCircuits.
Baselines. We have six baselines: (1) Noiseunaware search: the SubCircuits are searched with noisefree simulation. No noise information is involved. (2) Random generation: with the same gate set, we generate random circuits and constrain their #parameters the same as the QuantumNAS searched circuit for fair comparisons. We generate three different circuits and report the best. (3) Human design: we also make sure the same #parameters. For U3+CU3, RXYZ+U1+CU3 and IBMQ Basis spaces, human design has full width in the several front blocks. For ZZ+RY, RXYZ, and ZX+XX spaces, we stack multiple blocks introduced in the original paper. (4) Human design+noiseadaptive mapping: the circuit has the same #parameters with QuantumNAS. The qubit mapping is optimized with stateoftheart technique [murali2019noise]. (5) Human design+Sabre mapping: the circuit has the same #parameters with QuantumNAS, the qubit mapping is optimized with Sabre [li2019tackling]. (6) Human design(1/2 #Param)+Sabre mapping: similar to (6) with half #parameters. (7) For VQE, we have an additional UCCSD [bartlett2007coupled] baseline. For UCCSD of CH10Q and BeH, the original circuit cannot be successfully run on IBMQ machines because of too many gates (10,000), so we only take the front 1,000 gates.
SuperCircuit and SubCircuit Training Setups.
For all searched SubCircuits and baselines, we use the same training setting for fair comparisons. We use Adam optimizer with initial learning rate 5e3 and weight decay 1e4, cosine learning rate scheduler. We train 200 epochs with batch size 256 for QML tasks; 1000 steps for VQE tasks with batch size 1. For QML, the objective is to minimize training loss, while VQE minimizes the eigenvalue.
SuperCircuits training has the same settings with SubCircuits, except adding a linear learning rate warmup from 0 to 5e3 in the first 30 epochs for QML and 150 steps for VQE. Restricted sampling is applied during the whole training process. We set the largest number of different layers as seven. An additional technique is to progressively shrink the lower bound of possible sampled SubCircuit #blocks to stabilize training. We use Nvidia TITAN RTX 2080 GPU. The time cost is shown in Figure 5.
Run on Searched for  Yorktown  Belem  Santiago 

Yorktown  0.85  0.60  0.54 
Belem  0.67  0.77  0.43 
Santiago  0.82  0.81  0.85 
NoiseAdaptive Evolutionary CoSearch Setups. The evolutionary search is conducted with inherited gate parameters on the validation set of QML tasks. For QML and VQE, the evolution engine searches 40 iterations with a population of 40, parents population 10, mutation population 20 with 0.4 mutation probability, and crossover population 10. The noise model is obtained from IBM’s calibration data for the performance estimator, and the noise simulator is the Qiskit QASM simulator. We also run 8192 shots on simulators.
Iterative Pruning Setups. The searched SubCircuit is firstly trained from scratch. In pruning, we set five final pruning ratios, 0.1, 0.2, 0.3, 0.4, and 0.5. The starting ratio is 0.05. Pruning starts at step 0 and ends at half of total steps. We report the highest measured accuracy among the five ratios.
IvB Experimental Results
Results on Four and Two Classifications. Figure 13 shows the measured accuracy on IBMQYorktown (5Q) of QuantumNAS and 6 baselines on 5 QML tasks in 6 different design spaces. QuantumNAS achieves over 85% 4 and 95% 2class accuracy and consistently outperforms baselines except for Vowel4 in ZZ+RY space and MNIST4 in ZX+ XX space. The statistics for Fashion2 U3+CU3 space are in Table I. The noiseunaware search only optimizes noisefree accuracy, which results in a deep circuit (237 depth) with low measured accuracy. U3+CU3, RXYZ, RXYZ+U1+U3 and IBM Basis are better spaces as they always outperform the remaining two design spaces, and thus they are considered more noiseresilient. In addition, pruning brings an average of 2% for 4class and 1% improvement for 2class tasks. When the searched circuits contain only a small number of parameters, such as 7 in Vowel4 ZZ+RY, removing any parameter will hurt the accuracy. For circuits with more parameters, such as 36 for MNIST4 U3+CU3, the pruning ratio can be 50% while increasing accuracy by 4%. For IBMQ Basis, although its space is larger than U3+CU3, the accuracy is sometimes lower. Hence, a larger design space does not necessarily bring better final performance because of the higher search difficulty.
Method  Optimization Level 2  Optimization Level 3  

York.  Bel.  Qui.  Ath.  Sant.  York.  Bel.  Qui.  Ath.  Sant.  
Est.  0.85  0.77  0.84  0.82  0.77  0.69  0.63  0.82  0.84  0.86 
Real QC  0.66  0.80  0.76  0.77  0.73  0.70  0.54  0.72  0.84  0.85 
Results on Different Quantum Devices and Noise. Figure 14 shows QuantumNAS performance on various devices. For one task, QuantumNAS SubCircuits for each device are searched with the same SuperCircuit, but with noise models tailored for each device. For the machine with the smallest noise, IBMQSantiago, QuantumNAS also delivers 5% better accuracy on average. Additionally, we show the accuracy of QuantumNAS tested 3 weeks after search, which is slightly lower than tested immediately but still much higher than baselines. Therefore, even the noise characteristics change on a machine, QuantumNAS circuits are still noiseresilient. The results on Athens are unavailable since it is retired. Table II shows performance of circuits searched and run on different devices. Best performance is achieved when two devices are the same, which shows the necessity of devicespecific circuits.
Scalability. We further show QuantumNAS results on larger machines with larger circuits. We search for circuits with 15, 16, 21, 21 qubits in U3+CU3 space for machines with 15, 16, 27, 65 qubits in Figure 15. For the 21 qubit model, the SuperCircuit contains 1 block. QuantumNAS can achieve over 5% better accuracy. For even larger circuits for which classical simulations are infeasible, we can move the whole pipeline to quantum machines. Super/Subcircuit training can be done with parameter shift, and evolutionary search can directly evaluate SubCircuits on quantum machines. We add experiments on using real QC devices to evaluate SubCircuits in search as in Table III and compare with using noisy simulator. We experiment with Qiskit optimization levels 2 and 3. Due to queuing, we can only afford 20 search iterations which take 3 days. The accuracy of using real QC is similar to using simulators. In addition, the opt. level 3 cannot consistently improve accuracy over level 2 because QuantumNAS has already found a good mapping that is hard to optimize further.
Results on VQE Tasks. Figure 16 shows the VQE performance for H in different spaces, measured on the IBMQYorktown. The theoretical optimal value is 1.85. Estimated eigenvalues obtained by QuantumNAS are consistently lower than any other baselines. The UCCSD ansatz baseline is far from the optimal value as it is not adapted to the hardware noises. Pruning removes 50% parameters for all five circuit design spaces and can steadily reduce eigenvalues. Thus VQE circuits have a higher degree of redundancy over QML ones, making them more amenable to pruning. Figure 17 further shows the comparison results of QuantumNAS and UCCSD on LiH (6Q), HO (6Q), CH(4Q and 10Q) and BeH(15Q) on machines with 7Q, 15Q, and 27Q. Besides achieving lower measured expectation values, QuantumNAS can also reduce the theoretically trained values. For HO, the UCCSD noisefree trained expectation value is 49.6 while QuantumNAS has 52.4, indicating that QuantumNAS ansatz adapts to both the device and the molecule – a hybrid device and problem ansatz.
IvC Performance Analysis
Accuracy Improvement Breakdown. We select five tasks and five design spaces to show the breakdown of accuracy improvements in Figure 18. We compare the QuantumNAS cosearch to three baselines: (1) human baseline with no circuit or qubit mapping search, (2) noiseadaptive mapping search only, and (3) noiseadaptive circuit search only. Only searching circuit has larger accuracy improvements than only searching qubit mapping, as the space for circuit search is much larger, echoing our motivation. The codesign of both aspects can further unlock 9% accuracy gain on average. Figure 3 further shows the #parameters vs. accuracy curves. QuantumNAS can mitigate the negative effect of gate errors, and delays the accuracy peak, enabling more effective circuits.
Effect of Front and Restricted Sampling. Figure 19 shows the measured performance of SubCircuits on five tasks in ZX+XX and RXYZ+U1+CU3 spaces. Since the front and restricted sampling improve the reliability of estimated relative performance, the searched SubCircuit is closer to the optimal one and achieves on average 12% higher final accuracy.
Effect of qubit topology/error rate/qubit mapping to performance and design choice. Figure 20 shows MNIST4 and VQE performance on devices with various topologies and error rates. We have observations: (1) Comparing Santiago, Rome, and Athens, with the same topology, a lower error rate brings better performance. (2) Comparing Rome (‘–’) and Lima (‘T’), Quito (‘T’) and Yorktown(‘+’), under similar error rates, ‘T’ topology brings better performance than the other two. (3) For qubit choices (mapping), the cosearched mapping can consistently outperform the naïve mapping. (4) For design choices of cosearch, the average convergence iteration is 13.5, 14, 9.2 for ‘T’, ‘+’, and ‘–’ respectively. Therefore, we need a relatively larger iteration number for cosearch on topology ‘T’ and ‘+’ machines. That may be due to their more complicated connections than ‘–’ topology.
Search in Small Design Space. We construct a small U3+CU3 space that does not break into multiple blocks. All SubCircuits can be arbitrarily sampled without front sampling. The circuit depth is around 40. Comparisons with larger space with multiple blocks are shown in Table IV. Small space has consistently worse accuracy: although small circuits have less noise, it also has smaller learning capacity. QuantumNAS can find a better tradeoff between noise and capacity.
Device  Space  MNIST4  Fashion4  Vowel4  MNIST2  Fashion2  

D  Acc.  D  Acc.  D  Acc.  D  Acc  D  Acc  
Santiago  Small  50  0.55  58  0.56  35  0.27  28  0.94  30  0.87 
Ours  73  0.77  107  0.85  116  0.47  191  0.95  74  0.91  
Belem  Small  29  0.54  30  0.57  35  0.27  28  0.94  30  0.87 
Ours  50  0.58  68  0.77  77  0.46  81  0.94  62  0.90  
Yorktown  Small  29  0.60  30  0.56  39  0.27  51  0.91  30  0.89 
Ours  71  0.71  82  0.85  119  0.40  83  0.93  70  0.89 
Random Search vs. Evolutionary Search. Multiple candidate algorithms are applicable for the search stage. We compare the evolutionary with random search in Figure 21. The best performance of random search quickly saturates, while evolutionary can find SubCircuit and qubit mapping pair with lower loss, which delivers higher accuracy.
Effect of Pruning Ratios. Figure 22 shows the effect of different final pruning ratio for MNIST2 ZZ+RY space and Fashion2 U3+CU3 spaces. As the final ratio increases, there exists a sweet spot where the positive effect of gate error reduction can overcome the negative effect of smaller circuit capacity so we can observe a peak accuracy.
V Related Work
Va Quantum Machine Learning
Quantum machine learning (QML) [biamonte2017quantum, lloyd2020quantum, lloyd2016quantum, lloyd2014quantum, havlivcek2019supervised, liang2021can, wang2021exploration]
explores the training and evaluation of ML models on quantum devices. They have been shown to have potential speedup over their classical counterparts in various tasks, including metric learning [lloyd2020quantum], data analysis[lloyd2016quantum], and principal component analysis[lloyd2014quantum]
. In modern designs, QML models use variational quantum circuits with trainable parameters – quantum neural networks (QNNs). Various theoretical formulations for QNN have been proposed, e.g., quantum classifier
[farhi2018classification], quantum convolution [henderson2020quanvolutional], and quantum Boltzmann machine
[amin2018quantum], etc. Most prior works are exploratory and rely on classical simulation of small quantum systems[farhi2018classification]. Several works also propose to search circuits [zhang2020differentiable, zhang2021neural, du2020quantum, lu2020markovian] but they neither perform noiseadaptive cosearch of the circuit and qubit mapping nor have extensive evaluations on real QC devices as in QuantumNAS.VB NoiseAdaptive Quantum Compiling
A quantum compiler translates a quantum program written in highlevel programming languages to hardware instructions. For NISQ systems [preskill2018quantum], such translation needs to be noiseadaptive. As such, many noiseadaptive quantum compilers have been proposed. For example, various gate errors can be suppressed by dynamical decoupling [hahn1950spin, viola1999dynamical, biercuk2009optimized, lidar2014review], composite pulses[merrill2014progress, low2014optimal, brown2004arbitrarily], randomized compiling[wallman2016noise], hidden inverses[zhang2021hidden], qubit mapping[murali2019noise, tannu2019not, li2019tackling], instruction scheduling[murali2020software, wu2020tilt], and frequency tuning[versluis2017scalable, helmer2009cavity, ding2020systematic]. Typically, the key to these techniques is to find opportunities for local error cancellation within a quantum circuit. Instead, we propose to search for a quantum circuit and its qubit mapping pair that is the most resilient to noise. The flexibility in changing the quantum circuit itself gives us more freedom to build robustness into the quantum algorithms.
VC Quantum Simulation
Beyond ML, variational circuits can also be used to explore challenging quantum manybody physics problems. The first implementation of variational circuits was the Variational Quantum Eigensolver (VQE)[peruzzo2014variational, mcclean2016theory, kandala2017hardware] for quantum simulation of physical systems. Prior work showed that finding such an ansatz and learning their parameters is challenging. Different classes of ansatz designs have been proposed: (1) Problem ansatz[peruzzo2014variational, o2016scalable] is adapted to a target problem. E.g., UCCSD ansatz [bartlett2007coupled] is a design based on the structures in a quantum system using computational chemistry models. (2) Hardware ansatz[kandala2017hardware] is adapted to the properties of the computing hardware. Problem ansatz is shown to be typically more resilient to barren plateau than hardware ansatz[mcclean2018barren]. In this work, our QuantumNAS aims to find a balanced and robust ansatz design via SuperCircuitbased search.
Vi Conclusion
We propose QuantumNAS, a framework for noiseadaptive cosearch for the most robust variational circuit and qubit mapping. We leverage the SuperCircuit based search to explore in a ample design space efficiently. Iterative pruning is further leveraged to remove redundant gates in the searched circuits. Extensive experiments on QML and VQE tasks demonstrate the higher robustness and performance of QuantumNAS searched circuits over baseline designs. We also opensource our circuit training library, serving as a convenient infrastructure for future variational quantum research.