Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits

by   Gian Giacomo Guerreschi, et al.

Classical simulation of quantum computers will continue to play an essential role in the progress of quantum information science, both for numerical studies of quantum algorithms and for modelings noise and errors. Here we introduce the latest release of the Intel Quantum Simulator (IQS), formerly known as qHiPSTER. The high-performance computing (HPC) capability of the software allows users to leverage the available hardware resources provided by supercomputers, as well as available public cloud computing infrastructure. To take advantage of the latter platform, together with the distributed simulation of each separate quantum state, IQS offers the possibility of simulating a pool of related circuits in parallel. We highlight the technical implementation of the distributed algorithm and details about the new pool functionality. We also include some basic benchmarks (up to 42 qubits) and performance results obtained using HPC infrastructure. Finally, we use IQS to emulate a scenario in which many quantum devices are running in parallel to implement the quantum approximate optimization algorithm, using particle swarm optimization as the classical subroutine. The results demonstrate that the hyperparameters of this classical optimization algorithm depends on the total number of quantum circuit simulations one has the bandwidth to perform. The Intel Quantum Simulator has been released open-source with permissive licensing and is designed to simulate a large number of qubits, to emulate multiple quantum devices running in parallel, and/or to study the effects of decoherence and other hardware errors on calculation results.



There are no comments yet.


page 3

page 13


Fast simulation of quantum algorithms using circuit optimization

Classical simulators play a major role in the development and benchmark ...

Establishing the Quantum Supremacy Frontier with a 281 Pflop/s Simulation

Noisy Intermediate-Scale Quantum (NISQ) computers aim to perform computa...

HybridQ: A Hybrid Simulator for Quantum Circuits

Developing state-of-the-art classical simulators of quantum circuits is ...

Cache Blocking Technique to Large Scale Quantum Computing Simulation on Supercomputers

Classical computers require large memory resources and computational pow...

Faster Schrödinger-style simulation of quantum circuits

Recent demonstrations of superconducting quantum computers by Google and...

Long-time simulations with high fidelity on quantum hardware

Moderate-size quantum computers are now publicly accessible over the clo...

A C++17 Thread Pool for High-Performance Scientific Computing

We present a modern C++17-compatible thread pool implementation, built f...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In the past decade there has been steady progress toward building a viable quantum computer that can be used to solve problems that classical computers cannot. Because quantum hardware is still in its infancy, the simulation of quantum algorithms on classical computers will continue to be an important and useful endeavor. This is because many technologically and scientifically relevant questions are too difficult or impractical to be answered analytically. Though many useful conclusions can be demonstrated analytically, such as the proven speedup of Shor’s algorithm compared to classical factoring algorithms Shor1999 or Grover’s algorithm on unstructure database search Grover1997 , most algorithmic analysis does benefit from numerical experiments.

The first area where numerical simulation is useful is to evaluate the performance of parameters and hyper-parameters used in quantum algorithms. For instance, the simulation of physics and chemistry problems involved many choices regarding how to encode the problem to a set of qubits Sawaya2019a

, for which it is usually not obvious which approach will be most efficient without performing numerics. Further, most variational algorithms — whether the variational quantum eigensolver for finding Hamiltonian eigenvalues

Peruzzo2014 or variants of the quantum approximate optimization algorithm for solving combinatorial problems Farhi14_qaoa_orig

— involve a classical heuristic optimization routine that, for all intents and purposes, must be analyzed numerically.

The other primary reason for numerical simulation of quantum algorithms is to study the effects of errors. Despite enormous progress in reducing the effects of environmental noise and in perfecting the fidelities of gate operations, it appears certain that all near-term quantum devices will exhibit errors that cannot be corrected without sophisticated error correction schemes such as the surface code Kitaev2003 . This highlights the need for numerical simulations of quantum algorithms running on error-prone hardware Sagastizabal2019 ; Guerreschi2019 . Such simulations not only help draw conclusions about the robustness of particular algorithmic choices, but can also guide hardware design and gate compilation Tannu2019 ; Lao2019a , since different choices may lead to qualitatively different errors.

Quantum circuits are hard to simulate classically since the computational cost scales exponentially with the number of qubits. Notably, though there are classes of algorithms that scale more favorably for some set of quantum circuits — such as tensor network

Markov2008_tn ; Fried2018_qtorch ; McCaskey2016_tnqvm ; gray2018_quimb ; Villalonga2019

or path integral methods — these methods still scale exponentially in the general case. Several high performance quantum circuit simulators have been reported, including full state vector codes built for CPUs

Niwa2002 ; DeRaedt2007 ; Smelyanskiy2016 ; Haner2017 ; Khammassi2017 ; LaRose2018 ; Jones2019 ; DeRaedt2019_48qub and/or graphics processing units Gutierrez2010 ; Amariutei2011 ; Zhang2015_gpu ; Haner2017 ; Savran2018_shor_gpu , and those that use a mix of algorithm types Pednault2017 ; Chen2018_64qubit .

In this manuscript we present a new version of the Intel Quantum Simulator (IQS) and use it to emulate a hybrid quantum-classical algorithm. Also known as qHiPSTER or the Quantum High Performance Software Testing Environment, IQS is a massively parallel simulator of quantum algorithms expressed in the form of quantum circuits. Its original version was coded for High Performance Computing environments, with the goal of allowing large scale simulations. In 2016 it was used to simulate the full state of 40 qubits Smelyanskiy2016 .

The second version of Intel Quantum Simulator has just been released, open source, at the address While preserving the HPC core of the original implementation, it includes several new features that extend its application to cloud computing environments. In particular, Intel Quantum Simulator can divide the allocated computational resources into groups

, each dedicated to the simulation of a distinct quantum circuit. The communication between these group of processes is minimal, and each separate group uses the distributed implementation from the original release to store and manipulate its quantum state. We expect that at least two use cases will profit massively from this extension. First, when multiple circuits or variants of the same circuit have to be run in parallel (think for example of variational algorithms in conjunction with classical optimizers like the genetic algorithm or particle swarm optimizers). And second, when stochastic methods are used to include noise and decoherence in the simulation. One can easily envision situations as those just mentioned in which a pool of hundreds or thousands of states is required to accelerate simulation. Cloud computing platforms, with (tens of) thousands of nodes, are an ideal choice to run these workloads.

In addition to the new features just discussed, the new release includes robust unit testing to verify the proper installation and functioning of IQS and, for developers, to test the compatibility of novel features with the released code. Finally, we focused on lowering the user’s learning barrier with an automatic installation process and extended tutorials, and improving the simplicity of use by providing Python integration and a Docker container option. Our goal in releasing this version of the software is that IQS may be used as a standalone program or as a backend to other quantum computing frameworks like Xanadu’s Pennylane Pennylane , IBM’s Qiskit Qiskit , Rigetti’s Forest Forest , Google’s Cirq Cirq , Microsoft’s Azure Quantum Azure , ProjectQ Projectq , Zapata’s Orquestra Orquestra , Amazon’s Braket Braket , and others.

In this article, we begin by describing the basic usage of IQS and its software structure. In section III, we present benchmarks for large scale simulations of up to 42 qubits. Section IV describe two situations that take advantage of simulating a pool of circuits: the emulation of a variational protocol that uses many quantum processors in parallel and the simulation of circuits exposed to noise. Finally we draw conclusion and provide an outlook.

Ii Software description

Intel Quantum Simulator, both in its initial version and latest release, takes advantage of the full resources of an HPC system, due to the shared and distributed memory implementation. The first situation is when several processors, or a processor with multiple computing cores, have access to the same memory and the operations need to be performed without a specific sequential order. This opportunity for parallelism is best exploited with OpenMP. The second opportunity for parallelism arises when a relatively small amount of memory requires a lot of computation or, as in the case of storing quantum states, a large amount of memory cannot fit in a single machine or node. In this case, one needs to explicitly consider the communication pattern between the different processes and adopting the Message Passing Interface is a necessity.

In the new release of IQS we have set the MPI environment for allowing multiple quantum circuits to be simulated in parallel. We divide the computing processes into groups, each dedicated to storing a single quantum state and update it according to the action of a specific circuit. Each new state can still profit from the shared and the distributed implementation of the code (MPI+OpenMP), which has been previously implemented in the original version of the simulator. The use cases are illustrated in Fig. 1 where the graphics clarifies that a single state can be stored using all nodes, a subset of them, or even part of a single computing node. In this section we first introduce the basic methods that allow IQS to initialize, evolve and extract information from quantum states, then we discuss the distributed implementation of a single state and finally explain the parallel simulation of multiple circuits.

Figure 1: Depending on the nature of the research question being considered, there are several ways to run IQS. If one needs to simulate a single instance of a circuit for the highest possible qubit count, then one would distribute the quantum state across all available computing nodes or sockets (left panel). If one needs to consider a pool of states for either simulating many circuits in parallel or describe noise effects via stochastic simulations (or possibly both situations at the same time), then one may simulate multiple distributed (center panel) or local (right panel) quantum states in parallel.

Scope and basic use

IQS is designed to simulate the dynamics of multi-qubit states in quantum circuits. The state is assumed to be pure and its unitary dynamics a consequence of the application of one- and two-qubit operations. IQS provides methods to extract information from the state at any intermediate point or at the end of quantum circuits. There are three fundamental parts in the simulation of quantum circuits: state initialization, evolution, and measurement. We illustrate the basic methods of IQS with short snippets of code which may have been extracted from a longer C++ program. The same syntax applies to the Python interface of IQS111The Python interface of IQS is currently limited to single-process execution (i.e. no MPI)..

The central object on IQS is called backcolorQubitRegister and can be thought as the quantum state of the qubits composing the system of interest. In its declaration, we need to specify the number of qubits in order to allocate a sufficient amount of memory to describe their state. Then we can initialize the state to any computational basis state, uniquely identified by its index. Other methods allow for initializing the state randomly or, for example, as the balanced superposition of all computational states.

1  int num_qubits = 4;
2  QubitRegister<std::complex<double>> psi(num_qubits);
3  int index = 0;
4  psi.Initialize(num_qubits, ”base”, index);

The dynamics is generated by applying one- and two-qubit gates. IQS gives the option of defining custom gates, but also provides a large choice of the most common gates like single-qubit rotations or the Hadamard gate. The two qubit gates are in the form of controlled one-qubit gates, meaning that the desired operation is applied to the target qubit conditionally on the control qubit being in . The special case of the conditional Pauli X, also called CNOT gate, clarifies why there is no need of arbitrary two-or multi-qubit gates: any unitary evolution can be approximated to arbitrary precision by a sequence of one-qubit and CNOT gates. IQS is suitable for implementing multi-qubit operations222For example, the latest IQS release include methods specialized to the emulation of QAOA circuits which are reduced to one-qubit gates and a single global operation per QAOA step., but the definition of custom multi-qubit gates requires a very good understanding of its internal implementation (see next subsections).

5  // One qubit gates: Pauli X on qubit 0 and Hadamard on qubits 2 and 3
6  psi.ApplyPauliX(0);
7  psi.ApplyHadamard(2);
8  psi.ApplyHadamard(3);
9  // Two-qubit gate of conditional form: apply Pauli X on qubit 1 conditioned on qubit 2
10  int control_qubit = 2;
11  int target_qubit = 1;
12  psi.ApplyCPauliX(control_qubit,target_qubit);

Finally the qubits can be measured in the computational basis one at a time or in larger groups. Differently from actual realization of the quantum experiment in which the measurement output only one of the possible outcomes, with simulators one can compute the full statistics of outcomes without the need of re-running the experiment. For example, IQS provides methods to compute the probability of finding a certain qubit in state

or evaluate the expectation value of multi-qubit observables like products of Pauli matrices (not shown below).

15  // Get the probability of observing qubit 1 in state |1>
16  int measured_qubit = 1;
17  double probability = psi.GetProbability(measured_qubit);
18  // The expectation value of <Z1> can be computed from the above probability
19  double expectation = -1 * probability + 1 * (1-probability);

Distributed implementation

Here we describe how the quantum state is defined and stored inside the IQS objects. The current algorithm is based of the original implementation of the Intel Quantum Simulator Smelyanskiy2016 , so here we summarize it in order to provide a self-contained description of our simulator.

To fully describe an arbitrary state of qubits, one needs to store complex numbers corresponding to the probability amplitudes with respect to the computational basis. For as low as 30, simply storing all the amplitudes fills up  GB of memory ( bytes per double-precision number and a factor since the probability amplitudes are complex). To enable the fast simulation of circuits involving more than 30 qubits, one needs to divide the state between multiple processes, each with its own dedicated memory. IQS assumes that processes are used, each storing amplitudes and satisfying . If is not a power of two, IQS considers an effective number of nodes equal to with .

As we explain in the next paragraphs, all operations involving only the first qubits do not require communication between processes, while MPI communication is needed when performing operations on the last qubits. Therefore we refer to the qubits with index as “local” and those with index as “global”. However it is important to realize that even the partial state of a local qubit can be fully known only by accessing all amplitudes distributed among all processes.

Figure 2: Left panel: Quantum state of three qubits stored as a vecotor of complex amplidutes : . The state is distributed over 2 processes, each storing half of the amplitudes. Right panel: Illustration of the computation scheme to simulate one-qubit gates. Observe the qualitative change depending on the qubit involved in the operation: At a critical qubit number, communication between memory spaces (i.e. processes or MPI ranks) is required. In this 3-qubit case, communication between memory spaces is required for but not for .

It is informative to analyze how one-qubit gates are implemented in IQS. Any quantum state can be written as a vector with complex entries , so it is convenient to express the index in binary notation as with and such that . In this way it is straightforward to obtain both the process number and index of the local memory corresponding to the -th amplitude :


Consider the one-qubit gate acting on qubit and defined by the unitary matrix :


Its action on the quantum state can be written as


where refer to any bit value. The expression above means that the entries are updated in pairs independently of the values of the other amplitudes. From eq. (1) it is clear that when the connected pairs of entries are stored in the memory of the same process and can be updated without inter-process communication, as illustrated in Fig. 2.

The situation differs when . In this case the two entries belong to the memory of two distinct processes, specifically to those with index and respectively (here is such that ). Inter-process communication is therefore required and we adopt the same scheme as in the original IQS implementation Smelyanskiy2016 ; Trieu2009 . It is briefly summarized in Fig. 3 and its caption.

Figure 3: Communication scheme for implementing a one-qubit gate on qubit , with . Time flows from left to right. Step 1: Each MPI task sends half of its local memory to its communication partner, identified by an index difference of . Step 2: The computation is equally split between the two processes and involves only local memory. Step 3: At the end the updated information is sent back and the state is updated. This follows the original IQS implementation Smelyanskiy2016 that was first described in Trieu2009 .

In addition to one-qubit gates, IQS implements distributed two-qubit gates of the controlled form, meaning that the one-qubit gate is applied to target qubit conditionally on the control qubit being in state . The communication pattern depends on the control and target qubits being local or global and if or not.

Pool of multiple states

The most consequential change in the IQS implementation compared to its original release is the ability of dividing the processes into groups using the MPI function backcolorMPI_Comm_create_group. Each group can be used to store a quantum state, possibly in a distributed way if the group itself is composed by more than one process. Now, when a backcolorQubitRegister object is created, it actually initializes a state in each group of processes: we call “pool” the collection of such states. In addition, when a method of the form backcolorApplyGate is called, the gate is actually applied to each and every state of the pool.

We clarify this concept with a concrete example. Consider that we have 80 processes and want to work with 10 quantum states. One option is to group all processors together and declare 10 backcolorQubitRegister objects: here, each state is distributed over processes (since 80 is not a power of 2) and the circuit’s gates must be specified for each of the 10 states separately. The second option is to divide the processes in 10 groups of 8 processes each and create a single backcolorQubitRegister object: here, each state is distributed over processes and each gate is by default applied to every state.

There are two important observations: the first one is that defining a non-trivial pool of states may take advantage of the available processes in a more effective way. The second is that each state of the pool is naturally subjected to the same quantum circuit. The latter characteristic, if strictly enforced, would make the simulations redundant: we would simulate over and over the same identical evolution. However it is possible to differentiate the applied circuit for each of the state in the pool and the relevant commands are discussed in appendix D. Moreover, there are cases in which simulating closely related circuits is required and what seemed a limitation actually becomes a beneficial feature.

Here we discuss two of these situations and present the corresponding results in section IV. Code snippets are discussed in appendix D.

  • In Variational Quantum Algorithms (VQA) a quantum circuit composed of parametric gates is optimized to prepare states with desired properties, often related to having large overlap with the ground state of certain observables. During the optimization, the same circuit is simulated over and over with the only difference being the value of its parameters (think of them as the angle of one-qubit rotations). Within the pool functionality, it is easy to assign different parameter values to the circuit simulated by the distinct states in the pool. This approach greatly speed-up the overall simulations of VQA protocols based on several classes of optimizers, like genetic algorithm, swarm particle optimization, or gradient-based methods.

  • IQS is a simulator of unitary dynamics in which each state is pure. Nonetheless it is possible to use IQS to simulate the effect of noise and decoherence during the circuit by mean of introducing stochastic perturbations to the ideal circuit and averaging over the ensemble of “perturbed” circuits Smelyanskiy2016 . Formally, this approach is based on the unraveling of master equations into stochastic Schröedinger equations in the circuit-model formalism Bassi2008 and corresponds to the introduction of additional “noise gates” in the form of one-qubit rotations with stochastic rotation angles. IQS provides specialized methods to apply these noise gates that automatically varies their rotation angles over the pool’s states.

Iii Scaling Experiments

In Ref. Smelyanskiy2016 , Smelyanskiy and coauthors analyzed the simulator performance on the distributed system Stampede provided by the Texas Advanced Computing Center (TACC). They demonstrated the weak and the strong scalability of the code up to 40 qubits using

compute nodes. They also performed single and multi-node performance measurements of one- and two-qubit gates, and of a complete quantum circuit, namely the Quantum Fourier Transform. Since the core implementation of IQS did not change for the latest release, we consider those results still valid.

In the current release of IQS, we provided sample codes to run one-qubit gate benchmark on HPC systems for analyzing the strong and the weak scaling of the simulator. We have used such sample scripts to run the following experiments launched on the SuperMUC-NG333 HPC system hosted by the Leibniz Supercomputing Center of the Bavarian Academy of Science (LRZ). SuperMUC-NG consists of Intel® Xeon® Scalable Processor 8174 CPU cores and 719 terabytes of total main memory. Each single node is a two sockets system of cores each with GB of shared memory. In the next sections, we present strong and weak scaling results up to nodes, which corresponds to cpu cores and TB of total distributed memory.

Strong scaling

In the strong scaling analysis, we fixed the problem size (the number of qubits to run) and scaled up the computational resources. The total speedup is limited by the fraction of the serial part of the code that is not amenable to parallelization. In Fig. 4 we show the speedup of single-qubit operations for simulations of 32-qubit systems. We have implemented a random single qubit gate.

Figure 4: Strong scaling of 32-qubit simulations using from 32 to 512 MPI processes. Each MPI process is running on one socket of SuperMUC-NG, which means there are 2 MPI tasks per node. Each socket is fully populated by 24 OpenMP threads. Left panel: Time to execute a random one-qubit gate as a function of the qubit involved. Different colors correspond to different number of MPI processes. When the gate is executed on qubit (with being the number of processes and ), the communication between the MPI tasks is happening within the nodes (intra-node) and not between the nodes (inter-node). For the later qubits the communication is mostly inter-node. This reflects the peak behavior we observe in our measurement, which has been confirmed also by our MPI ping-pong test benchmark for messages of that size. Right panel: Time to execute a random one-qubit gate on the first () or last () qubit. The computation time difference is due to the additional communication required for the last qubit to be updated.

Weak scaling

In section II we described the internal representation of quantum states used by IQS and highlighted the fact that simulating an extra qubit implies doubling the allocated memory. This consideration can be included in the numerical analysis of the so-called weak scaling. The idea is that, while the number of processes increase, the simulation size also increases and in such a way that the memory amount and computing effort per process stays (ideally) constant.

Figure 5: Weak scaling study of the time needed to apply one-qubit gates depending on the involved qubit. Different colors correspond to different number of MPI processes and number of qubits. For any 2 qubits more in the simulation, we increase the computational resources by a factor of 4. For small qubit indices, we observe exactly the same computational time using larger numbers of qubits and resources. For larger qubit indices, the difference is mostly due to the communication overhead. The peak we observe at qubit index 30 is due to the intra-node communication between the 2 sockets of one node.

We launched simulations of systems from 32 to 42 qubits using from 4 to 4096 processes. The largest job used 2048 nodes of the SuperMUC-NG system. The expectation of a scale-invariant behavior is confirmed by our study and presented in Fig. 5.

Iv Particle Swarm / QAOA Simulation

As quantum hardware improves, it will be possible to run many small-scale quantum computers at the same time. Though there may not be entanglement across the devices, one may run the devices in a parallel fashion in order to more quickly solve variational quantum problems. Each quantum device would be calculating an objective function for a different set of parameters, with a classical optimization step using the results of all the devices.

QAOA with swarm particle optimization

We used IQS to perform this task of simulating many quantum computers running in parallel. The simulation demonstrates one example of the variety of simulation types that may be performed with the software. The variational quantum algorithm we chose is the quantum approximate optimization algorithm (QAOA) Farhi14_qaoa_orig for the Max-Cut problem on 3-regular graphs, an extensively studied problem in the quantum algorithms community Farhi14_qaoa_orig ; Farhi14_qaoaboundoccur ; Farhi16_qsuprqaoa ; Wecker16_training ; Lin16_qaoa ; Guerreschi17_qaoa_opt ; Guerreschi2019 . For the classical optimization procedure we use the particle swarm optimization (PSO) algorithm Kennedy1995 ; Pedersen2010 , where we implement each ‘particle’ as one virtual quantum device.

Figure 6: Results of varying particle count, while running PSO for QAOA/Max-Cut on 3-regular graphs of 18 vertices (18 qubits). Results are averaged from 300 different graphs and random initial conditions. The horizontal axis gives the number of function evaluations, i.e. the number of quantum circuit emulations and evaluations run on IQS. The vertical axis gives the approximation ration, equal to

divided by the exact MaxCut solution. The higher the particle count, the larger the number of function evaluations per PSO time step. Note that the appropriate choice of particle count depends on how many function evaluations one wishes to perform. Standard deviations (not shown in figure) calculated over the set of graphs are large compared to typical difference between means, often higher than 0.05.

The particular PSO implementation we used was taken from reference Pedersen2010 . For each particle we first set random initial positions drawn uniformly from , where these positions are each a set of parameter vectors for particles. Each unique position produces a unique output from the objective function , where is the Max-Cut Hamiltonian. Each particle is given an initial velocity also drawn uniformly from . The particles are propagated for one time step based on their velocities, after the velocities have been updated with the formula


where , , and are arbitrary constants; and

are random numbers drawn each step from uniform distribution [0,1];

is the best position that particle has discovered thus far; and is the best position discovered by the entire swarm. For this work, we set the parameters to , , and , as there is numerical evidence that these parameters perform well for some optimization landscapes and hyperparameters Pedersen2010 . The algorithm is embarrassingly parallel, as the only cooperation between quantum devices involves broadcasting each device’s in order to modify the velocities. The formula shows that after each time step, each new velocity is determined by three terms: a damping term, an acceleration term based on the best previous position of particle , and a second acceleration term based on the best position found so far by the entire swarm.

In the procedure described above, we make the implicit assumption that systematic error does not differ greatly between devices. One feature of variational quantum algorithms is that they are robust to any systematic constant errors inherent to a given device. This is because , i.e. the maximum of the objective function is independent of any systematic error. It will not in general be true that a collection of quantum devices will have nearly equal , but as hardware improves the difference in error between devices is likely to decrease.

Figure 7: Results of varying particle count, while running PSO for QAOA/Max-Cut on 3-regular graphs of 18 vertices (18 qubits). To compare directly between particle counts, results are shown for a fixed number of function evaluations, i.e. a fixed number of quantum circuit emulations run on IQS. Results are averaged from 300 different graphs and random initial conditions. Error bars show standard deviation of the distribution (not uncertainty in the mean). If one is limited in the total number of functional evations one has the bandwidth to perform, then the utility of adding additional particles might be non-monotonic. Notably, standard deviations show substantial overlap between different particle counts.

As the number of total function evaltuations increases, Figure 6 shows the improvement in the Max-Cut approximation ratio, equal to the quotient of and the exact MaxCut solution . denotes the quantum observable that counts the number of cuts for a given graph bipartition and we are therefore interested in finding its largest eigenvalue. Results for several particle counts between 4 and 64 are shown, with the horizontal axis giving the number of function evaluations. Note that the number of functions evaluations per PSO step is equal to the number of PSO particles, meaning that having more particles results in fewer swarm steps for the same number of total function evaluations.

The results show mean approximation values at each step, averaged over 300 random 3-regular graphs of 18 vertices (qubits), each with randomly selected initial conditions for the swarm. A QAOA depth Farhi14_qaoa_orig of was used for all circuits, leading to parameters and hence an 8-dimensional position vector. The standard deviations (from the 300 random graph instances) are omitted in Figure 6, though they are appreciable (often higher than 0.05) and usually larger than the difference between means. Note that we are referring to the standard deviation of the distribution of over many graphs instances, not the uncertainty in the mean. For two snapshots taken at 512, and 5000 total function evaluations, Figure 7 shows the mean as well as the standard deviations of the distribution.

Though the best strategy is to choose a number of particles with the best mean behavior for a given number of function evaluations, the large overlaps between the standard deviations suggest that a clearly superior particle count choice would appear only after many problem instances. The initial guess, on average, tends to favor higher particle counts, which is expected since this amounts to have more initial random guesses. Interestingly, this trend roughly reverses after just a few function evaluations, before being restored again.

Both Figures 6 and 7 show that, if one is limited by the total number of function evaluations, the optimal number of particles is not necessarily the largest number. This non-monotonic behavior is present because, though more particles allow for exploring more of the parameter space, this comes at the cost of needing to calculate more function evaluations per swarm step. However, as the number of total function evaluations increases, more particles appear to strictly produce better approximations to the solution. Figure 7 shows more clearly that the optimal particle count depends on how many function evaluations one is able to perform in the optimization. For instance, if one may only perform 500 evaluations, 10-14 particles are best, but the optimal number of particles increases as the number of allowed function evaltionas increases. Stated differently, the fewer function evaluations are available, the less utility is gained from adding more particles. This may provide a useful heuristic for determining the hyper-parameter of particle count in using PSO for QAOA, though more study is needed on a broader class of graph instances and graph theory problems.

Figure 8: Convergence study of an ensemble of stochastic simulations for simulating noise. The circuit corresponds to a QAOA instance of Max-Cut on 3-regular graphs ( qubits and 4 QAOA steps). We compiled the circuit for a device with all-to-all connectivity via a simple greedy scheduler. The data are taken for the decoherence timescale and , where is a typical gate duration. Upper panel: Convergence of to its noisy value by means of incoherent average over an increasing number of states in the ensemble. Each ensemble state is obtained by simulating the circuit with the addition of the noise gates according to the schedule. The circuit’s parameters have been optimized with the PSO method. The different lines show the same averaging procedure but with different streams of random numbers and suggest that convergence requires hundreds of states in the ensemble, for the parameters used here. For reference, the red dashed line indicates the expectation value for noiseless simulations. Lower panel: As above but with circuit’s parameters initialized randomly from a uniform distribution.

Convergence of noisy simulations

QAOA is one of the leading candidates to achieve quantum advantage on noisy near-term quantum devices. To evaluate its performance in realistic experiments, it is fundamental to include the effect of noise and decoherence into its protocol. There are two distinct effects of noise that combine in QAOA protocols: the first one is that the expected state

is not achieved in practice but a hopefully related mixed state is obtained. The estimate of

on this state may differ from that of the noiseless case. Second, the imprecise value of is transmitted to the classical optimization loop and affects the next choice of the circuit’s parameters. Figure 8 illustrates the utility of simulating multiple states in parallel to speed up noisy simulations. The QAOA/MaxCut simulations were performed on 3-regular graphs of 16 vertices, with a QAOA depth of . We used and , where is the gate time and and are respectively the time constants related to relaxation and dephasing. In this example, an optimized set of parameters reached a converged value more quickly than a randomly selected set of parameters.

V Conclusion and outlook

We have demonstrated the functionality of the Intel Quantum Simulator (IQS), a high-performance software package for simulating quantum algorithms on single work stations, supercomputers, or the cloud. Depending on the platform of choice and the problem at hand, IQS can take advantage of three operation modes: (1) all resources devoted to simulating the highest possible number of qubits, (2) processes divided into separate groups to simulate a pool of distinct circuits, or (3) using the pool of states as the stochastic ensemble needed to model noise and decoherence.

In this work we explored all three operation modes: first we launched 42-qubit simulations on the SuperMUC-NG supercomputer at LRZ and characterized the strong and weak scaling of the one-qubit gate execution. Then, in order to study the performance of many quantum devices operating in parallel, we used IQS to investigate the performance of particle swarm optimization (PSO) for the quantum approximate optimization algorithm (QAOA). Analyzing the results allowed us to estimate the optimal number of PSO particles for the class of problem instances studied. Finally, we performed a convergence study using hundreds of ensembles of stochastic circuits to describe the noise effects for systems of dimension . The relatively small overhead provides a remarkable advantage over methods based on density matrix simulations, which require quadratically more memory. We conclude by emphasizing that the two applications just discussed are particularly suitable to run on cloud platforms, where they can take full advantage of the tens of thousands of available nodes without being limited by the communication bandwidth or latency. Whether one simulates one state of many qubits or a pool of many smaller states, IQS is suitable as a standalone program or as a back-end to other quantum simulation software.

The authors thank Aastha Grover who helped in releasing the Intel Quantum Simulator open source at The authors acknowledge the Leibniz Supercomputing Center of the Bavarian Academy of Science (LRZ) for providing HPC resources and Luigi Iapichino for useful discussions about the results.

Appendix A Installation

The Intel Quantum Simulator builds as a static library and it can be used via an API, which corresponds to the quantum gate operations. The following packages are required to be present to the system, before installing the library:

  • CMake tools version 3.15+

  • MPICH3 for distributed communication

  • optional: Intel Math Kernel Libary (MKL) for distributed random number generation

  • optional: PyBind11 (installed via conda, not pip) required by the Python binding of Intel-QS

The code is hosted as open-source project to the public GitHub repository and it can be cloned via:

  git clone
  cd Intel-QS

The preferred installation for best performance takes advantage of the Intel Parallel Studio compilers and is documented in the GitHub page of the project IQS . Here we provide instructions how to use the standard GNU toolchain. The installation follows the out-of-source building and requires the creation of the directory build. This directory is used to collect all the files generated by the build process. The appropriate makefile is generated with CMake:

  mkdir build
  cd build
  CXX=g+ cmake -DIqsMPI=OFF -DIqsUTest=ON ..

By default, MKL is not required when GNU compilers are used. The command above install the single-node version of IQS, while to install the distributed version one must set the option backcolor-DIqsMPI=ON instead. In this case, it is required at least the version 3.1 of MPICH for the build to be successful.

The result of the building process is twofold: on one hand the static C++ library of IQS is created as backcolorbuild/lib/libintel_qs.a, and on the other hand the executables of the unit test and several examples are saved in the folder backcolorbuild/bin/.

Appendix B Python bindings

In the last few years, the scientific community has adopted Python as a central language for numerical tools. In the field of quantum computing, several of the most popular frameworks have a Python interface Pennylane ; Qiskit ; Forest ; Cirq ; Azure ; Projectq ; Braket . To facilitate the integration with those and other tools, we provide Python bindings of the IQS code for the single-node implementation. By default, whenever MPI is disabled, the building process create a Python library containing the classes and methods of IQS. The library can be found in backcolorbuild/lib/ or in an equivalent file. The binding code itself uses the Pybind11 library which needs to be installed via conda (not simply with pip) to include the relevant information in CMake. To disable the Python wrapper, even without MPI, set the CMake option selection to backcolor-DIqsPython=OFF.

Appendix C Docker file

The number of qubits that it is possible to simulate with the Intel-QS platform is constrained by the amount of memory available to hold the quantum state vector.

Cloud computing platforms make it possible for high performance computing applications to be run on small temporary clusters of compute nodes that provide more memory than is possible on a single user’s laptop or workstation. One simply allocates the compute nodes with the required memory sizes, configures them to talk to each other in a cluster, and then uses Kubernetes, SWARM, SLURM, BEEs, or some other container or cluster job orchestration package to allocate and use a Docker container or Singularity instance on each node.

In order to facilitate the use of Intel-QS in a Cloud computing multi-node configuration, we provide a Docker file that can be built to create an image that can be run on each compute node. This Docker build file downloads all of the required software packages including a host OS necessary to build and execute Intel-QS software platform.

To use the multi-node capability of Intel-QS in the cloud environment, it is necessary to compile the Dockerfile to run as a Singularity image. This is due to restrictions on using SSH from within a native Docker image. Compiling down to a Singularity instance works around this restriction.

The Docker container also provides a pre-built environment for compiling and building Intel-QS if researchers do not have access to the correct OS version and software tools required by the platform.

Appendix D Parallel simulations of a pool of states

The multi-state functionality is better described by providing a practical example. Here we want to simulate a 10-qubit system exposed to dissipation and decoherence, characterized by the and time respectively. The effect of noise is included by performing multiple simulations of the circuit with the addition of stochastic noise gates Bassi2008 ; Sawaya2016 .

The code below is a simplified version of the tutorial provided in the IQS repository (released under the Apache 2 license) and requires MPI. The circuit is trivial: for each qubit a rotation around the X axis is performed, by angles that were randomly chosen. One has control on the gate schedule and we assume that all gates are performed sequentially starting from qubit 0 until qubit 9. This is controlled by increasing the second argument of the noise gates corresponding to the duration of the simulated noise.

1#include ”qureg.hpp” // IQS header file (additional header files may need to be included)
3int main(int argc, char **argv)
5  // Create the MPI environment, passing the same argument to all the processes.
6  qhipster::mpi::Environment env(argc, argv);
7  // One pool state per process. For accurate noise effects, they should be hundreds.
8  int num_pool_states;
9  MPI_Comm_size(MPI_COMM_WORLD, &num_pool_states);
10  // Partition the MPI environment into groups of processes. One group per pool state.
11  env.UpdateStateComm(num_pool_states);
12  // Number of qubits, here 10.
13  int num_qubits = 10;
14  // Random number generator, provided by IQS.
15  qhipster::RandomNumberGenerator<double> rng;
16  rng.SetSeedStreamPtrs( 777777 );
17  // Choose the angles of the circuit, randomly in [0,pi[.
18  // They have the same value across all pool states.
19  std::vector<double> angles(num_qubits);
20  rng.UniformRandomNumbers(, angles.size(), 0., M_PI, ”pool”);
21  // Initialize the qubit register state to |00…0>
22  QubitRegister<std::complex<double>> psi(num_qubits);
23  psi.Initialize(”base”,0);
24  // Associate the random number generator to the qubit register.
25  // This is required by the stochastic noise gates.
26  psi.SetRngPtr(&rng);
27  // Set T_1 and T_2 timescale.
28  double T_1=30. , T_2=15. ;
29  psi_slow.SetNoiseTimescales(T_1, T_2);
30  //  Circuit simulation.
31  // Each gate is preceded and followed by noise gates according to the schedule.
32  for (int qubit=0; qubit<num_qubits; ++qubit)
33  {
34    double duration = double(1+qubit);
35    psi_slow.ApplyNoiseGate (qubit, duration);
36    psi_slow.ApplyRotationX (qubit, angles[qubit]);
37    duration = double(num_qubits-qubit);
38    psi_slow.ApplyNoiseGate (qubit, duration);
39  }
40  // Compute the probability of qubit 0 to be in |1>.
41  double probability = psi.GetProbability(0);
42  // Incoherent average across the pool to get the noisy expectation.
43  probability = env.IncoherentSumOverAllStatesOfPool<double> (probability);
44  probability /= double(num_pool_states);
45  return 0;

Notice that in line 15 we declare the (psudo-)random number generator included in the IQS software. It is advantageous for various scenarios that it can generate three kinds of random numbers:


different for each pool rank (not used in the code above)


common to all ranks of the same state (automatically used by the noise gates)


common to all ranks of the pool (used for the rotation angles of the circuit)


  • [1] Peter W. Shor. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM Review, 41(2):303–332, 1999.
  • [2] Lov K. Grover. Quantum mechanics helps in searching for a needle in a haystack. Physical Review Letters, 79(2):325–328, 1997.
  • [3] Nicolas P. D. Sawaya, Tim Menke, Thi Ha Kyaw, Sonika Johri, Alán Aspuru-Guzik, and Gian Giacomo Guerreschi. Resource-efficient digital quantum simulation of d-level systems for photonic, vibrational, and spin-s Hamiltonians. arXiv:1909.12847, 2019.
  • [4] Alberto Peruzzo, Jarrod R. McClean, Peter J. Shadbolt, Man-Hong Yung, Xiao-Qi Zhou, Peter J. Love, Alán Aspuru-Guzik, and Jeremy L. O’Brien. A variational eigenvalue solver on a photonic quantum processor. Nature Communications, 5(May):4213, jul 2014.
  • [5] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. A quantum approximate optimization algorithm. arXiv:1411.4028, 2014.
  • [6] A. Yu. Kitaev. Fault-tolerant quantum computation by anyons. Annals of Physics, 303:2–30, 2002.
  • [7] R. Sagastizabal, X. Bonet-Monroig, M. Singh, M. Adriaan Rol, C. C. Bultink, Xiang Fu, C. H. Price, V. P. Ostroukh, N. Muthusubramanian, A. Bruno, M. Beekman, N. Haider, Thomas E. O’Brien, and Leo DiCarlo. Experimental error mitigation via symmetry verification in a variational quantum eigensolver. Physical Review A, 100(1):010302(R), 2019.
  • [8] Gian Giacomo Guerreschi and Anne Y. Matsuura. QAOA for Max-Cut requires hundreds of qubits for quantum speed-up. Scientific Reports, 9:6903, 2019.
  • [9] Swamit S Tannu and Moinuddin K Qureshi. Not All Qubits Are Created Equal A Case for Variability-Aware Policies for NISQ-Era Quantum Computers. ASPLOS19, pages 987–999, 2019.
  • [10] Lingling Lao, Daniel M. Manzano, Hans van Someren, Imran Ashraf, and Carmina G. Almudever. Mapping of quantum circuits onto NISQ superconducting processors. arXiv:1908.04226, 2019.
  • [11] Igor L. Markov and Yaoyun Shi. Simulating quantum computation by contracting tensor networks. SIAM Journal on Computing, 38(3):963–981, jan 2008.
  • [12] E. Schuyler Fried, Nicolas P. D. Sawaya, Yudong Cao, Ian D. Kivlichan, Jhonathan Romero, and Alán Aspuru-Guzik. qTorch: The quantum tensor contraction handler. PLOS ONE, 13(12):e0208510, dec 2018.
  • [13] Alexander J. McCaskey. Tensor Network Quantum Virtual Machine (TNQVM), 2016.
  • [14] Johnnie Gray. quimb: a python library for quantum information and many-body calculations. Journal of Open Source Software, 3(29):819, 2018.
  • [15] Benjamin Villalonga, Sergio Boixo, Bron Nelson, Christopher Henze, Eleanor Rieffel, Rupak Biswas, and Salvatore Mandrà. A flexible high-performance simulator for verifying and benchmarking quantum circuits implemented on real hardware. npj Quantum Information, 5(1):1–16, 2019.
  • [16] Jumpei Niwa, Keiji Matsumoto, and Hiroshi Imai. General-purpose parallel simulator for quantum computing. Physical Review A, 66(6), dec 2002.
  • [17] K. De Raedt, K. Michielsen, H. De Raedt, B. Trieu, G. Arnold, M. Richter, Th. Lippert, H. Watanabe, and N. Ito. Massively parallel quantum computer simulator. Computer Physics Communications, 176(2):121–136, jan 2007.
  • [18] Mikhail Smelyanskiy, Nicolas P. D. Sawaya, and Alán Aspuru-Guzik. qhipster: The quantum high performance software testing environment, 2016.
  • [19] Thomas Häner and Damian S. Steiger. 0.5 petabyte simulation of a 45-qubit quantum circuit. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC 17. ACM Press, 2017.
  • [20] N. Khammassi, I. Ashraf, X. Fu, C.G. Almudever, and K. Bertels. QX: A high-performance quantum computer simulation platform. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017. IEEE, mar 2017.
  • [21] Ryan LaRose. Distributed memory techniques for classical simulation of quantum circuits, 2018.
  • [22] Tyson Jones, Anna Brown, Ian Bush, and Simon C. Benjamin. QuEST and high performance simulation of quantum computers. Scientific Reports, 9(1), jul 2019.
  • [23] Hans De Raedt, Fengping Jin, Dennis Willsch, Madita Willsch, Naoki Yoshioka, Nobuyasu Ito, Shengjun Yuan, and Kristel Michielsen. Massively parallel quantum computer simulator, eleven years later. Computer Physics Communications, 237:47–61, apr 2019.
  • [24] Eladio Gutiérrez, Sergio Romero, María A. Trenas, and Emilio L. Zapata. Quantum computer simulation using the CUDA programming model. Computer Physics Communications, 181(2):283–300, feb 2010.
  • [25] A. Amariutei and S. Caraiman. Parallel quantum computer simulation on the gpu. In 15th International Conference on System Theory, Control and Computing, pages 1–6, Oct 2011.
  • [26] Pei Zhang, Jiabin Yuan, and Xiangwen Lu. Quantum computer simulation on multi-GPU incorporating data locality. In Algorithms and Architectures for Parallel Processing, pages 241–256. Springer International Publishing, 2015.
  • [27] I. Savran, M. Demirci, and A.H. Yılmaz. Accelerating shor’s factorization algorithm on gpus. Canadian Journal of Physics, 96(7):759–761, 2018.
  • [28] Edwin Pednault, John A. Gunnels, Giacomo Nannicini, Lior Horesh, Thomas Magerlein, Edgar Solomonik, Erik W. Draeger, Eric T. Holland, and Robert Wisnieff. Breaking the 49-qubit barrier in the simulation of quantum circuits, 2017.
  • [29] Zhao-Yun Chen, Qi Zhou, Cheng Xue, Xia Yang, Guang-Can Guo, and Guo-Ping Guo. 64-qubit quantum circuit simulation. Science Bulletin, 63(15):964 – 971, 2018.
  • [30] Pennylane. Accessed: 2020-01-10.
  • [31] Qiskit. Accessed: 2020-01-10.
  • [32] Forest. Accessed: 2020-01-10.
  • [33] Cirq.
  • [34] Azure quantum. Accessed: 2020-01-10.
  • [35] Projectq. Accessed: 2020-01-10.
  • [36] Orquestra. Accessed: 2020-01-10.
  • [37] Braket. Accessed: 2020-01-10.
  • [38] Doan Binh Trieu. Large-scale simulations of error-prone quantum computation devices. PhD thesis, University of Wuppertal, 2009.
  • [39] Angelo Bassi and Dirk André Deckert. Noise gates for decoherent quantum circuits. Physical Review A, 77:032323, 2008.
  • [40] E. Farhi, J. Goldstone, and S. Gutmann. A quantum approximate optimization algorithm applied to a bounded occurrence constraint problem. arXiv:1412.6062, 2014.
  • [41] E. Farhi and A. W Harrow. Quantum supremacy through the quantum approximate optimization algorithm. arXiv:1602.07674, 2016.
  • [42] D. Wecker, M. B. Hastings, and M. Troyer. Training a quantum optimizer. Phys. Rev. A, 94(2):022309, 2016.
  • [43] C. Yen-Yu Lin and Y. Zhu. Performance of qaoa on typical instances of constraint satisfaction problems with bounded degree. arXiv:1601.01744, 2016.
  • [44] G. Giacomo Guerreschi and M. Smelyanskiy. Practical optimization for hybrid quantum-classical algorithms. arXiv:1701.01450, 2017.
  • [45] James Kennedy and Russell Eberhart. Particle Swarm Optimization.

    Proceedings of the IEEE International Conference on Neural Networks

    , pages 1942–1945, 1995.
  • [46] Magnus Erik Pedersen. Good parameters for particle swarm optimization. Technical Report HL1001, Hvass Laboratories, HL1001, 2010.
  • [47] Intel quantum simulator. Accessed: 2020-01-10.
  • [48] Nicolas P. D. Sawaya, Mikhail Smelyanskiy, Jarrod R. McClean, and Alán Aspuru-Guzik. Error sensitivity to environmental noise in quantum circuits for chemical state preparation. Journal of Chemical Theory and Computation, 12(7):3097–3108, 2016.