I Introduction
In the past decade there has been steady progress toward building a viable quantum computer that can be used to solve problems that classical computers cannot. Because quantum hardware is still in its infancy, the simulation of quantum algorithms on classical computers will continue to be an important and useful endeavor. This is because many technologically and scientifically relevant questions are too difficult or impractical to be answered analytically. Though many useful conclusions can be demonstrated analytically, such as the proven speedup of Shor’s algorithm compared to classical factoring algorithms Shor1999 or Grover’s algorithm on unstructure database search Grover1997 , most algorithmic analysis does benefit from numerical experiments.
The first area where numerical simulation is useful is to evaluate the performance of parameters and hyperparameters used in quantum algorithms. For instance, the simulation of physics and chemistry problems involved many choices regarding how to encode the problem to a set of qubits Sawaya2019a
, for which it is usually not obvious which approach will be most efficient without performing numerics. Further, most variational algorithms — whether the variational quantum eigensolver for finding Hamiltonian eigenvalues
Peruzzo2014 or variants of the quantum approximate optimization algorithm for solving combinatorial problems Farhi14_qaoa_orig— involve a classical heuristic optimization routine that, for all intents and purposes, must be analyzed numerically.
The other primary reason for numerical simulation of quantum algorithms is to study the effects of errors. Despite enormous progress in reducing the effects of environmental noise and in perfecting the fidelities of gate operations, it appears certain that all nearterm quantum devices will exhibit errors that cannot be corrected without sophisticated error correction schemes such as the surface code Kitaev2003 . This highlights the need for numerical simulations of quantum algorithms running on errorprone hardware Sagastizabal2019 ; Guerreschi2019 . Such simulations not only help draw conclusions about the robustness of particular algorithmic choices, but can also guide hardware design and gate compilation Tannu2019 ; Lao2019a , since different choices may lead to qualitatively different errors.
Quantum circuits are hard to simulate classically since the computational cost scales exponentially with the number of qubits. Notably, though there are classes of algorithms that scale more favorably for some set of quantum circuits — such as tensor network
Markov2008_tn ; Fried2018_qtorch ; McCaskey2016_tnqvm ; gray2018_quimb ; Villalonga2019or path integral methods — these methods still scale exponentially in the general case. Several high performance quantum circuit simulators have been reported, including full state vector codes built for CPUs
Niwa2002 ; DeRaedt2007 ; Smelyanskiy2016 ; Haner2017 ; Khammassi2017 ; LaRose2018 ; Jones2019 ; DeRaedt2019_48qub and/or graphics processing units Gutierrez2010 ; Amariutei2011 ; Zhang2015_gpu ; Haner2017 ; Savran2018_shor_gpu , and those that use a mix of algorithm types Pednault2017 ; Chen2018_64qubit .In this manuscript we present a new version of the Intel Quantum Simulator (IQS) and use it to emulate a hybrid quantumclassical algorithm. Also known as qHiPSTER or the Quantum High Performance Software Testing Environment, IQS is a massively parallel simulator of quantum algorithms expressed in the form of quantum circuits. Its original version was coded for High Performance Computing environments, with the goal of allowing large scale simulations. In 2016 it was used to simulate the full state of 40 qubits Smelyanskiy2016 .
The second version of Intel Quantum Simulator has just been released, open source, at the address https://github.com/intel/IntelQS. While preserving the HPC core of the original implementation, it includes several new features that extend its application to cloud computing environments. In particular, Intel Quantum Simulator can divide the allocated computational resources into groups
, each dedicated to the simulation of a distinct quantum circuit. The communication between these group of processes is minimal, and each separate group uses the distributed implementation from the original release to store and manipulate its quantum state. We expect that at least two use cases will profit massively from this extension. First, when multiple circuits or variants of the same circuit have to be run in parallel (think for example of variational algorithms in conjunction with classical optimizers like the genetic algorithm or particle swarm optimizers). And second, when stochastic methods are used to include noise and decoherence in the simulation. One can easily envision situations as those just mentioned in which a pool of hundreds or thousands of states is required to accelerate simulation. Cloud computing platforms, with (tens of) thousands of nodes, are an ideal choice to run these workloads.
In addition to the new features just discussed, the new release includes robust unit testing to verify the proper installation and functioning of IQS and, for developers, to test the compatibility of novel features with the released code. Finally, we focused on lowering the user’s learning barrier with an automatic installation process and extended tutorials, and improving the simplicity of use by providing Python integration and a Docker container option. Our goal in releasing this version of the software is that IQS may be used as a standalone program or as a backend to other quantum computing frameworks like Xanadu’s Pennylane Pennylane , IBM’s Qiskit Qiskit , Rigetti’s Forest Forest , Google’s Cirq Cirq , Microsoft’s Azure Quantum Azure , ProjectQ Projectq , Zapata’s Orquestra Orquestra , Amazon’s Braket Braket , and others.
In this article, we begin by describing the basic usage of IQS and its software structure. In section III, we present benchmarks for large scale simulations of up to 42 qubits. Section IV describe two situations that take advantage of simulating a pool of circuits: the emulation of a variational protocol that uses many quantum processors in parallel and the simulation of circuits exposed to noise. Finally we draw conclusion and provide an outlook.
Ii Software description
Intel Quantum Simulator, both in its initial version and latest release, takes advantage of the full resources of an HPC system, due to the shared and distributed memory implementation. The first situation is when several processors, or a processor with multiple computing cores, have access to the same memory and the operations need to be performed without a specific sequential order. This opportunity for parallelism is best exploited with OpenMP. The second opportunity for parallelism arises when a relatively small amount of memory requires a lot of computation or, as in the case of storing quantum states, a large amount of memory cannot fit in a single machine or node. In this case, one needs to explicitly consider the communication pattern between the different processes and adopting the Message Passing Interface is a necessity.
In the new release of IQS we have set the MPI environment for allowing multiple quantum circuits to be simulated in parallel. We divide the computing processes into groups, each dedicated to storing a single quantum state and update it according to the action of a specific circuit. Each new state can still profit from the shared and the distributed implementation of the code (MPI+OpenMP), which has been previously implemented in the original version of the simulator. The use cases are illustrated in Fig. 1 where the graphics clarifies that a single state can be stored using all nodes, a subset of them, or even part of a single computing node. In this section we first introduce the basic methods that allow IQS to initialize, evolve and extract information from quantum states, then we discuss the distributed implementation of a single state and finally explain the parallel simulation of multiple circuits.
Scope and basic use
IQS is designed to simulate the dynamics of multiqubit states in quantum circuits. The state is assumed to be pure and its unitary dynamics a consequence of the application of one and twoqubit operations. IQS provides methods to extract information from the state at any intermediate point or at the end of quantum circuits. There are three fundamental parts in the simulation of quantum circuits: state initialization, evolution, and measurement. We illustrate the basic methods of IQS with short snippets of code which may have been extracted from a longer C++ program. The same syntax applies to the Python interface of IQS^{1}^{1}1The Python interface of IQS is currently limited to singleprocess execution (i.e. no MPI)..
The central object on IQS is called backcolorQubitRegister and can be thought as the quantum state of the qubits composing the system of interest. In its declaration, we need to specify the number of qubits in order to allocate a sufficient amount of memory to describe their state. Then we can initialize the state to any computational basis state, uniquely identified by its index. Other methods allow for initializing the state randomly or, for example, as the balanced superposition of all computational states.
The dynamics is generated by applying one and twoqubit gates. IQS gives the option of defining custom gates, but also provides a large choice of the most common gates like singlequbit rotations or the Hadamard gate. The two qubit gates are in the form of controlled onequbit gates, meaning that the desired operation is applied to the target qubit conditionally on the control qubit being in . The special case of the conditional Pauli X, also called CNOT gate, clarifies why there is no need of arbitrary twoor multiqubit gates: any unitary evolution can be approximated to arbitrary precision by a sequence of onequbit and CNOT gates. IQS is suitable for implementing multiqubit operations^{2}^{2}2For example, the latest IQS release include methods specialized to the emulation of QAOA circuits which are reduced to onequbit gates and a single global operation per QAOA step., but the definition of custom multiqubit gates requires a very good understanding of its internal implementation (see next subsections).
Finally the qubits can be measured in the computational basis one at a time or in larger groups. Differently from actual realization of the quantum experiment in which the measurement output only one of the possible outcomes, with simulators one can compute the full statistics of outcomes without the need of rerunning the experiment. For example, IQS provides methods to compute the probability of finding a certain qubit in state
or evaluate the expectation value of multiqubit observables like products of Pauli matrices (not shown below).Distributed implementation
Here we describe how the quantum state is defined and stored inside the IQS objects. The current algorithm is based of the original implementation of the Intel Quantum Simulator Smelyanskiy2016 , so here we summarize it in order to provide a selfcontained description of our simulator.
To fully describe an arbitrary state of qubits, one needs to store complex numbers corresponding to the probability amplitudes with respect to the computational basis. For as low as 30, simply storing all the amplitudes fills up GB of memory ( bytes per doubleprecision number and a factor since the probability amplitudes are complex). To enable the fast simulation of circuits involving more than 30 qubits, one needs to divide the state between multiple processes, each with its own dedicated memory. IQS assumes that processes are used, each storing amplitudes and satisfying . If is not a power of two, IQS considers an effective number of nodes equal to with .
As we explain in the next paragraphs, all operations involving only the first qubits do not require communication between processes, while MPI communication is needed when performing operations on the last qubits. Therefore we refer to the qubits with index as “local” and those with index as “global”. However it is important to realize that even the partial state of a local qubit can be fully known only by accessing all amplitudes distributed among all processes.
It is informative to analyze how onequbit gates are implemented in IQS. Any quantum state can be written as a vector with complex entries , so it is convenient to express the index in binary notation as with and such that . In this way it is straightforward to obtain both the process number and index of the local memory corresponding to the th amplitude :
(1) 
Consider the onequbit gate acting on qubit and defined by the unitary matrix :
(2) 
Its action on the quantum state can be written as
(3) 
where refer to any bit value. The expression above means that the entries are updated in pairs independently of the values of the other amplitudes. From eq. (1) it is clear that when the connected pairs of entries are stored in the memory of the same process and can be updated without interprocess communication, as illustrated in Fig. 2.
The situation differs when . In this case the two entries belong to the memory of two distinct processes, specifically to those with index and respectively (here is such that ). Interprocess communication is therefore required and we adopt the same scheme as in the original IQS implementation Smelyanskiy2016 ; Trieu2009 . It is briefly summarized in Fig. 3 and its caption.
In addition to onequbit gates, IQS implements distributed twoqubit gates of the controlled form, meaning that the onequbit gate is applied to target qubit conditionally on the control qubit being in state . The communication pattern depends on the control and target qubits being local or global and if or not.
Pool of multiple states
The most consequential change in the IQS implementation compared to its original release is the ability of dividing the processes into groups using the MPI function backcolorMPI_Comm_create_group. Each group can be used to store a quantum state, possibly in a distributed way if the group itself is composed by more than one process. Now, when a backcolorQubitRegister object is created, it actually initializes a state in each group of processes: we call “pool” the collection of such states. In addition, when a method of the form backcolorApplyGate is called, the gate is actually applied to each and every state of the pool.
We clarify this concept with a concrete example. Consider that we have 80 processes and want to work with 10 quantum states. One option is to group all processors together and declare 10 backcolorQubitRegister objects: here, each state is distributed over processes (since 80 is not a power of 2) and the circuit’s gates must be specified for each of the 10 states separately. The second option is to divide the processes in 10 groups of 8 processes each and create a single backcolorQubitRegister object: here, each state is distributed over processes and each gate is by default applied to every state.
There are two important observations: the first one is that defining a nontrivial pool of states may take advantage of the available processes in a more effective way. The second is that each state of the pool is naturally subjected to the same quantum circuit. The latter characteristic, if strictly enforced, would make the simulations redundant: we would simulate over and over the same identical evolution. However it is possible to differentiate the applied circuit for each of the state in the pool and the relevant commands are discussed in appendix D. Moreover, there are cases in which simulating closely related circuits is required and what seemed a limitation actually becomes a beneficial feature.
Here we discuss two of these situations and present the corresponding results in section IV. Code snippets are discussed in appendix D.

In Variational Quantum Algorithms (VQA) a quantum circuit composed of parametric gates is optimized to prepare states with desired properties, often related to having large overlap with the ground state of certain observables. During the optimization, the same circuit is simulated over and over with the only difference being the value of its parameters (think of them as the angle of onequbit rotations). Within the pool functionality, it is easy to assign different parameter values to the circuit simulated by the distinct states in the pool. This approach greatly speedup the overall simulations of VQA protocols based on several classes of optimizers, like genetic algorithm, swarm particle optimization, or gradientbased methods.

IQS is a simulator of unitary dynamics in which each state is pure. Nonetheless it is possible to use IQS to simulate the effect of noise and decoherence during the circuit by mean of introducing stochastic perturbations to the ideal circuit and averaging over the ensemble of “perturbed” circuits Smelyanskiy2016 . Formally, this approach is based on the unraveling of master equations into stochastic Schröedinger equations in the circuitmodel formalism Bassi2008 and corresponds to the introduction of additional “noise gates” in the form of onequbit rotations with stochastic rotation angles. IQS provides specialized methods to apply these noise gates that automatically varies their rotation angles over the pool’s states.
Iii Scaling Experiments
In Ref. Smelyanskiy2016 , Smelyanskiy and coauthors analyzed the simulator performance on the distributed system Stampede provided by the Texas Advanced Computing Center (TACC). They demonstrated the weak and the strong scalability of the code up to 40 qubits using
compute nodes. They also performed single and multinode performance measurements of one and twoqubit gates, and of a complete quantum circuit, namely the Quantum Fourier Transform. Since the core implementation of IQS did not change for the latest release, we consider those results still valid.
In the current release of IQS, we provided sample codes to run onequbit gate benchmark on HPC systems for analyzing the strong and the weak scaling of the simulator. We have used such sample scripts to run the following experiments launched on the SuperMUCNG^{3}^{3}3https://doku.lrz.de/display/PUBLIC/Hardware+of+SuperMUCNG HPC system hosted by the Leibniz Supercomputing Center of the Bavarian Academy of Science (LRZ). SuperMUCNG consists of Intel^{®} Xeon^{®} Scalable Processor 8174 CPU cores and 719 terabytes of total main memory. Each single node is a two sockets system of cores each with GB of shared memory. In the next sections, we present strong and weak scaling results up to nodes, which corresponds to cpu cores and TB of total distributed memory.
Strong scaling
In the strong scaling analysis, we fixed the problem size (the number of qubits to run) and scaled up the computational resources. The total speedup is limited by the fraction of the serial part of the code that is not amenable to parallelization. In Fig. 4 we show the speedup of singlequbit operations for simulations of 32qubit systems. We have implemented a random single qubit gate.
Weak scaling
In section II we described the internal representation of quantum states used by IQS and highlighted the fact that simulating an extra qubit implies doubling the allocated memory. This consideration can be included in the numerical analysis of the socalled weak scaling. The idea is that, while the number of processes increase, the simulation size also increases and in such a way that the memory amount and computing effort per process stays (ideally) constant.
We launched simulations of systems from 32 to 42 qubits using from 4 to 4096 processes. The largest job used 2048 nodes of the SuperMUCNG system. The expectation of a scaleinvariant behavior is confirmed by our study and presented in Fig. 5.
Iv Particle Swarm / QAOA Simulation
As quantum hardware improves, it will be possible to run many smallscale quantum computers at the same time. Though there may not be entanglement across the devices, one may run the devices in a parallel fashion in order to more quickly solve variational quantum problems. Each quantum device would be calculating an objective function for a different set of parameters, with a classical optimization step using the results of all the devices.
QAOA with swarm particle optimization
We used IQS to perform this task of simulating many quantum computers running in parallel. The simulation demonstrates one example of the variety of simulation types that may be performed with the software. The variational quantum algorithm we chose is the quantum approximate optimization algorithm (QAOA) Farhi14_qaoa_orig for the MaxCut problem on 3regular graphs, an extensively studied problem in the quantum algorithms community Farhi14_qaoa_orig ; Farhi14_qaoaboundoccur ; Farhi16_qsuprqaoa ; Wecker16_training ; Lin16_qaoa ; Guerreschi17_qaoa_opt ; Guerreschi2019 . For the classical optimization procedure we use the particle swarm optimization (PSO) algorithm Kennedy1995 ; Pedersen2010 , where we implement each ‘particle’ as one virtual quantum device.
The particular PSO implementation we used was taken from reference Pedersen2010 . For each particle we first set random initial positions drawn uniformly from , where these positions are each a set of parameter vectors for particles. Each unique position produces a unique output from the objective function , where is the MaxCut Hamiltonian. Each particle is given an initial velocity also drawn uniformly from . The particles are propagated for one time step based on their velocities, after the velocities have been updated with the formula
(4) 
where , , and are arbitrary constants; and
are random numbers drawn each step from uniform distribution [0,1];
is the best position that particle has discovered thus far; and is the best position discovered by the entire swarm. For this work, we set the parameters to , , and , as there is numerical evidence that these parameters perform well for some optimization landscapes and hyperparameters Pedersen2010 . The algorithm is embarrassingly parallel, as the only cooperation between quantum devices involves broadcasting each device’s in order to modify the velocities. The formula shows that after each time step, each new velocity is determined by three terms: a damping term, an acceleration term based on the best previous position of particle , and a second acceleration term based on the best position found so far by the entire swarm.In the procedure described above, we make the implicit assumption that systematic error does not differ greatly between devices. One feature of variational quantum algorithms is that they are robust to any systematic constant errors inherent to a given device. This is because , i.e. the maximum of the objective function is independent of any systematic error. It will not in general be true that a collection of quantum devices will have nearly equal , but as hardware improves the difference in error between devices is likely to decrease.
As the number of total function evaltuations increases, Figure 6 shows the improvement in the MaxCut approximation ratio, equal to the quotient of and the exact MaxCut solution . denotes the quantum observable that counts the number of cuts for a given graph bipartition and we are therefore interested in finding its largest eigenvalue. Results for several particle counts between 4 and 64 are shown, with the horizontal axis giving the number of function evaluations. Note that the number of functions evaluations per PSO step is equal to the number of PSO particles, meaning that having more particles results in fewer swarm steps for the same number of total function evaluations.
The results show mean approximation values at each step, averaged over 300 random 3regular graphs of 18 vertices (qubits), each with randomly selected initial conditions for the swarm. A QAOA depth Farhi14_qaoa_orig of was used for all circuits, leading to parameters and hence an 8dimensional position vector. The standard deviations (from the 300 random graph instances) are omitted in Figure 6, though they are appreciable (often higher than 0.05) and usually larger than the difference between means. Note that we are referring to the standard deviation of the distribution of over many graphs instances, not the uncertainty in the mean. For two snapshots taken at 512, and 5000 total function evaluations, Figure 7 shows the mean as well as the standard deviations of the distribution.
Though the best strategy is to choose a number of particles with the best mean behavior for a given number of function evaluations, the large overlaps between the standard deviations suggest that a clearly superior particle count choice would appear only after many problem instances. The initial guess, on average, tends to favor higher particle counts, which is expected since this amounts to have more initial random guesses. Interestingly, this trend roughly reverses after just a few function evaluations, before being restored again.
Both Figures 6 and 7 show that, if one is limited by the total number of function evaluations, the optimal number of particles is not necessarily the largest number. This nonmonotonic behavior is present because, though more particles allow for exploring more of the parameter space, this comes at the cost of needing to calculate more function evaluations per swarm step. However, as the number of total function evaluations increases, more particles appear to strictly produce better approximations to the solution. Figure 7 shows more clearly that the optimal particle count depends on how many function evaluations one is able to perform in the optimization. For instance, if one may only perform 500 evaluations, 1014 particles are best, but the optimal number of particles increases as the number of allowed function evaltionas increases. Stated differently, the fewer function evaluations are available, the less utility is gained from adding more particles. This may provide a useful heuristic for determining the hyperparameter of particle count in using PSO for QAOA, though more study is needed on a broader class of graph instances and graph theory problems.
Convergence of noisy simulations
QAOA is one of the leading candidates to achieve quantum advantage on noisy nearterm quantum devices. To evaluate its performance in realistic experiments, it is fundamental to include the effect of noise and decoherence into its protocol. There are two distinct effects of noise that combine in QAOA protocols: the first one is that the expected state
is not achieved in practice but a hopefully related mixed state is obtained. The estimate of
on this state may differ from that of the noiseless case. Second, the imprecise value of is transmitted to the classical optimization loop and affects the next choice of the circuit’s parameters. Figure 8 illustrates the utility of simulating multiple states in parallel to speed up noisy simulations. The QAOA/MaxCut simulations were performed on 3regular graphs of 16 vertices, with a QAOA depth of . We used and , where is the gate time and and are respectively the time constants related to relaxation and dephasing. In this example, an optimized set of parameters reached a converged value more quickly than a randomly selected set of parameters.V Conclusion and outlook
We have demonstrated the functionality of the Intel Quantum Simulator (IQS), a highperformance software package for simulating quantum algorithms on single work stations, supercomputers, or the cloud. Depending on the platform of choice and the problem at hand, IQS can take advantage of three operation modes: (1) all resources devoted to simulating the highest possible number of qubits, (2) processes divided into separate groups to simulate a pool of distinct circuits, or (3) using the pool of states as the stochastic ensemble needed to model noise and decoherence.
In this work we explored all three operation modes: first we launched 42qubit simulations on the SuperMUCNG supercomputer at LRZ and characterized the strong and weak scaling of the onequbit gate execution. Then, in order to study the performance of many quantum devices operating in parallel, we used IQS to investigate the performance of particle swarm optimization (PSO) for the quantum approximate optimization algorithm (QAOA). Analyzing the results allowed us to estimate the optimal number of PSO particles for the class of problem instances studied. Finally, we performed a convergence study using hundreds of ensembles of stochastic circuits to describe the noise effects for systems of dimension . The relatively small overhead provides a remarkable advantage over methods based on density matrix simulations, which require quadratically more memory. We conclude by emphasizing that the two applications just discussed are particularly suitable to run on cloud platforms, where they can take full advantage of the tens of thousands of available nodes without being limited by the communication bandwidth or latency. Whether one simulates one state of many qubits or a pool of many smaller states, IQS is suitable as a standalone program or as a backend to other quantum simulation software.
Acknowledgements.
The authors thank Aastha Grover who helped in releasing the Intel Quantum Simulator open source at https://github.com/intel/IntelQS. The authors acknowledge the Leibniz Supercomputing Center of the Bavarian Academy of Science (LRZ) for providing HPC resources and Luigi Iapichino for useful discussions about the results.
Appendix A Installation
The Intel Quantum Simulator builds as a static library and it can be used via an API, which corresponds to the quantum gate operations. The following packages are required to be present to the system, before installing the library:

CMake tools version 3.15+

MPICH3 for distributed communication

optional: Intel Math Kernel Libary (MKL) for distributed random number generation

optional: PyBind11 (installed via conda, not pip) required by the Python binding of IntelQS
The code is hosted as opensource project to the public GitHub repository and it can be cloned via:
The preferred installation for best performance takes advantage of the Intel Parallel Studio compilers and is documented in the GitHub page of the project IQS . Here we provide instructions how to use the standard GNU toolchain. The installation follows the outofsource building and requires the creation of the directory build. This directory is used to collect all the files generated by the build process. The appropriate makefile is generated with CMake:
By default, MKL is not required when GNU compilers are used. The command above install the singlenode version of IQS, while to install the distributed version one must set the option backcolorDIqsMPI=ON instead. In this case, it is required at least the version 3.1 of MPICH for the build to be successful.
The result of the building process is twofold: on one hand the static C++ library of IQS is created as backcolorbuild/lib/libintel_qs.a, and on the other hand the executables of the unit test and several examples are saved in the folder backcolorbuild/bin/.
Appendix B Python bindings
In the last few years, the scientific community has adopted Python as a central language for numerical tools. In the field of quantum computing, several of the most popular frameworks have a Python interface Pennylane ; Qiskit ; Forest ; Cirq ; Azure ; Projectq ; Braket . To facilitate the integration with those and other tools, we provide Python bindings of the IQS code for the singlenode implementation. By default, whenever MPI is disabled, the building process create a Python library containing the classes and methods of IQS. The library can be found in backcolorbuild/lib/intelqs_py.cpython36mx86_64linuxgnu.so or in an equivalent file. The binding code itself uses the Pybind11 library which needs to be installed via conda (not simply with pip) to include the relevant information in CMake. To disable the Python wrapper, even without MPI, set the CMake option selection to backcolorDIqsPython=OFF.
Appendix C Docker file
The number of qubits that it is possible to simulate with the IntelQS platform is constrained by the amount of memory available to hold the quantum state vector.
Cloud computing platforms make it possible for high performance computing applications to be run on small temporary clusters of compute nodes that provide more memory than is possible on a single user’s laptop or workstation. One simply allocates the compute nodes with the required memory sizes, configures them to talk to each other in a cluster, and then uses Kubernetes, SWARM, SLURM, BEEs, or some other container or cluster job orchestration package to allocate and use a Docker container or Singularity instance on each node.
In order to facilitate the use of IntelQS in a Cloud computing multinode configuration, we provide a Docker file that can be built to create an image that can be run on each compute node. This Docker build file downloads all of the required software packages including a host OS necessary to build and execute IntelQS software platform.
To use the multinode capability of IntelQS in the cloud environment, it is necessary to compile the Dockerfile to run as a Singularity image. This is due to restrictions on using SSH from within a native Docker image. Compiling down to a Singularity instance works around this restriction.
The Docker container also provides a prebuilt environment for compiling and building IntelQS if researchers do not have access to the correct OS version and software tools required by the platform.
Appendix D Parallel simulations of a pool of states
The multistate functionality is better described by providing a practical example. Here we want to simulate a 10qubit system exposed to dissipation and decoherence, characterized by the and time respectively. The effect of noise is included by performing multiple simulations of the circuit with the addition of stochastic noise gates Bassi2008 ; Sawaya2016 .
The code below is a simplified version of the tutorial provided in the IQS repository (released under the Apache 2 license) and requires MPI. The circuit is trivial: for each qubit a rotation around the X axis is performed, by angles that were randomly chosen. One has control on the gate schedule and we assume that all gates are performed sequentially starting from qubit 0 until qubit 9. This is controlled by increasing the second argument of the noise gates corresponding to the duration of the simulated noise.
Notice that in line 15 we declare the (psudo)random number generator included in the IQS software. It is advantageous for various scenarios that it can generate three kinds of random numbers:
 local

different for each pool rank (not used in the code above)
 state

common to all ranks of the same state (automatically used by the noise gates)
 pool

common to all ranks of the pool (used for the rotation angles of the circuit)
References
 [1] Peter W. Shor. Polynomialtime algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM Review, 41(2):303–332, 1999.
 [2] Lov K. Grover. Quantum mechanics helps in searching for a needle in a haystack. Physical Review Letters, 79(2):325–328, 1997.
 [3] Nicolas P. D. Sawaya, Tim Menke, Thi Ha Kyaw, Sonika Johri, Alán AspuruGuzik, and Gian Giacomo Guerreschi. Resourceefficient digital quantum simulation of dlevel systems for photonic, vibrational, and spins Hamiltonians. arXiv:1909.12847, 2019.
 [4] Alberto Peruzzo, Jarrod R. McClean, Peter J. Shadbolt, ManHong Yung, XiaoQi Zhou, Peter J. Love, Alán AspuruGuzik, and Jeremy L. O’Brien. A variational eigenvalue solver on a photonic quantum processor. Nature Communications, 5(May):4213, jul 2014.
 [5] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. A quantum approximate optimization algorithm. arXiv:1411.4028, 2014.
 [6] A. Yu. Kitaev. Faulttolerant quantum computation by anyons. Annals of Physics, 303:2–30, 2002.
 [7] R. Sagastizabal, X. BonetMonroig, M. Singh, M. Adriaan Rol, C. C. Bultink, Xiang Fu, C. H. Price, V. P. Ostroukh, N. Muthusubramanian, A. Bruno, M. Beekman, N. Haider, Thomas E. O’Brien, and Leo DiCarlo. Experimental error mitigation via symmetry verification in a variational quantum eigensolver. Physical Review A, 100(1):010302(R), 2019.
 [8] Gian Giacomo Guerreschi and Anne Y. Matsuura. QAOA for MaxCut requires hundreds of qubits for quantum speedup. Scientific Reports, 9:6903, 2019.
 [9] Swamit S Tannu and Moinuddin K Qureshi. Not All Qubits Are Created Equal A Case for VariabilityAware Policies for NISQEra Quantum Computers. ASPLOS19, pages 987–999, 2019.
 [10] Lingling Lao, Daniel M. Manzano, Hans van Someren, Imran Ashraf, and Carmina G. Almudever. Mapping of quantum circuits onto NISQ superconducting processors. arXiv:1908.04226, 2019.
 [11] Igor L. Markov and Yaoyun Shi. Simulating quantum computation by contracting tensor networks. SIAM Journal on Computing, 38(3):963–981, jan 2008.
 [12] E. Schuyler Fried, Nicolas P. D. Sawaya, Yudong Cao, Ian D. Kivlichan, Jhonathan Romero, and Alán AspuruGuzik. qTorch: The quantum tensor contraction handler. PLOS ONE, 13(12):e0208510, dec 2018.
 [13] Alexander J. McCaskey. Tensor Network Quantum Virtual Machine (TNQVM), 2016.
 [14] Johnnie Gray. quimb: a python library for quantum information and manybody calculations. Journal of Open Source Software, 3(29):819, 2018.
 [15] Benjamin Villalonga, Sergio Boixo, Bron Nelson, Christopher Henze, Eleanor Rieffel, Rupak Biswas, and Salvatore Mandrà. A flexible highperformance simulator for verifying and benchmarking quantum circuits implemented on real hardware. npj Quantum Information, 5(1):1–16, 2019.
 [16] Jumpei Niwa, Keiji Matsumoto, and Hiroshi Imai. Generalpurpose parallel simulator for quantum computing. Physical Review A, 66(6), dec 2002.
 [17] K. De Raedt, K. Michielsen, H. De Raedt, B. Trieu, G. Arnold, M. Richter, Th. Lippert, H. Watanabe, and N. Ito. Massively parallel quantum computer simulator. Computer Physics Communications, 176(2):121–136, jan 2007.
 [18] Mikhail Smelyanskiy, Nicolas P. D. Sawaya, and Alán AspuruGuzik. qhipster: The quantum high performance software testing environment, 2016.
 [19] Thomas Häner and Damian S. Steiger. 0.5 petabyte simulation of a 45qubit quantum circuit. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on  SC 17. ACM Press, 2017.
 [20] N. Khammassi, I. Ashraf, X. Fu, C.G. Almudever, and K. Bertels. QX: A highperformance quantum computer simulation platform. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017. IEEE, mar 2017.
 [21] Ryan LaRose. Distributed memory techniques for classical simulation of quantum circuits, 2018.
 [22] Tyson Jones, Anna Brown, Ian Bush, and Simon C. Benjamin. QuEST and high performance simulation of quantum computers. Scientific Reports, 9(1), jul 2019.
 [23] Hans De Raedt, Fengping Jin, Dennis Willsch, Madita Willsch, Naoki Yoshioka, Nobuyasu Ito, Shengjun Yuan, and Kristel Michielsen. Massively parallel quantum computer simulator, eleven years later. Computer Physics Communications, 237:47–61, apr 2019.
 [24] Eladio Gutiérrez, Sergio Romero, María A. Trenas, and Emilio L. Zapata. Quantum computer simulation using the CUDA programming model. Computer Physics Communications, 181(2):283–300, feb 2010.
 [25] A. Amariutei and S. Caraiman. Parallel quantum computer simulation on the gpu. In 15th International Conference on System Theory, Control and Computing, pages 1–6, Oct 2011.
 [26] Pei Zhang, Jiabin Yuan, and Xiangwen Lu. Quantum computer simulation on multiGPU incorporating data locality. In Algorithms and Architectures for Parallel Processing, pages 241–256. Springer International Publishing, 2015.
 [27] I. Savran, M. Demirci, and A.H. Yılmaz. Accelerating shor’s factorization algorithm on gpus. Canadian Journal of Physics, 96(7):759–761, 2018.
 [28] Edwin Pednault, John A. Gunnels, Giacomo Nannicini, Lior Horesh, Thomas Magerlein, Edgar Solomonik, Erik W. Draeger, Eric T. Holland, and Robert Wisnieff. Breaking the 49qubit barrier in the simulation of quantum circuits, 2017.
 [29] ZhaoYun Chen, Qi Zhou, Cheng Xue, Xia Yang, GuangCan Guo, and GuoPing Guo. 64qubit quantum circuit simulation. Science Bulletin, 63(15):964 – 971, 2018.
 [30] Pennylane. https://pennylane.ai/. Accessed: 20200110.
 [31] Qiskit. https://qiskit.org/. Accessed: 20200110.
 [32] Forest. https://www.rigetti.com/forestapi. Accessed: 20200110.
 [33] Cirq. https://cirq.readthedocs.io/en/stable/.
 [34] Azure quantum. https://azure.microsoft.com/enus/services/quantum/. Accessed: 20200110.
 [35] Projectq. https://projectq.ch/. Accessed: 20200110.
 [36] Orquestra. https://www.zapatacomputing.com/orquestra. Accessed: 20200110.
 [37] Braket. https://aws.amazon.com/braket/. Accessed: 20200110.
 [38] Doan Binh Trieu. Largescale simulations of errorprone quantum computation devices. PhD thesis, University of Wuppertal, 2009.
 [39] Angelo Bassi and Dirk André Deckert. Noise gates for decoherent quantum circuits. Physical Review A, 77:032323, 2008.
 [40] E. Farhi, J. Goldstone, and S. Gutmann. A quantum approximate optimization algorithm applied to a bounded occurrence constraint problem. arXiv:1412.6062, 2014.
 [41] E. Farhi and A. W Harrow. Quantum supremacy through the quantum approximate optimization algorithm. arXiv:1602.07674, 2016.
 [42] D. Wecker, M. B. Hastings, and M. Troyer. Training a quantum optimizer. Phys. Rev. A, 94(2):022309, 2016.
 [43] C. YenYu Lin and Y. Zhu. Performance of qaoa on typical instances of constraint satisfaction problems with bounded degree. arXiv:1601.01744, 2016.
 [44] G. Giacomo Guerreschi and M. Smelyanskiy. Practical optimization for hybrid quantumclassical algorithms. arXiv:1701.01450, 2017.

[45]
James Kennedy and Russell Eberhart.
Particle Swarm Optimization.
Proceedings of the IEEE International Conference on Neural Networks
, pages 1942–1945, 1995.  [46] Magnus Erik Pedersen. Good parameters for particle swarm optimization. Technical Report HL1001, Hvass Laboratories, HL1001, 2010.
 [47] Intel quantum simulator. https://github.com/intel/IntelQS. Accessed: 20200110.
 [48] Nicolas P. D. Sawaya, Mikhail Smelyanskiy, Jarrod R. McClean, and Alán AspuruGuzik. Error sensitivity to environmental noise in quantum circuits for chemical state preparation. Journal of Chemical Theory and Computation, 12(7):3097–3108, 2016.
Comments
There are no comments yet.