Establishing the Quantum Supremacy Frontier with a 281 Pflop/s Simulation

Noisy Intermediate-Scale Quantum (NISQ) computers aim to perform computational tasks beyond the capabilities of the most powerful classical computers, thereby achieving "Quantum Supremacy", a major milestone in quantum computing. NISQ Supremacy requires comparison with a state-of-the-art classical simulator. We report HPC simulations of hard random quantum circuits (RQC), sustaining an average performance of 281 Pflop/s (true single precision) on Summit, currently the fastest supercomputer in the world. In addition, we propose a standard benchmark for NISQ computers based on qFlex, a tensor-network-based classical high-performance simulator of RQC, which are considered the leading proposal for Quantum Supremacy.



There are no comments yet.


page 3

page 5


K-Means Clustering on Noisy Intermediate Scale Quantum Computers

Real-time clustering of big performance data generated by the telecommun...

Closing the "Quantum Supremacy" Gap: Achieving Real-Time Simulation of a Random Quantum Circuit Using a New Sunway Supercomputer

We develop a high-performance tensor-based simulator for random quantum ...

mpiQulacs: A Distributed Quantum Computer Simulator for A64FX-based Cluster Systems

Quantum computer simulators running on classical computers are essential...

HybridQ: A Hybrid Simulator for Quantum Circuits

Developing state-of-the-art classical simulators of quantum circuits is ...

Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits

Classical simulation of quantum computers will continue to play an essen...

The Argument against Quantum Computers, the Quantum Laws of Nature, and Google's Supremacy Claims

My 2018 lecture at the ICA workshop in Singapore dealt with quantum comp...

Memristor-based cryogenic programmable DC sources for scalable in-situ quantum-dot control

Current quantum systems based on spin qubits are controlled by classical...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction and Motivations

As we approach the end of Moore’s law, there is growing urgency to develop alternative computational models. Examples are Beyond Von Neumann architectures and Beyond CMOS hardware. Since Richard Feynman envisioned the possibility of using quantum physics for computation [1, 2], many theoretical and technological advances have been made. Quantum computing is both Beyond Von Neumann and Beyond CMOS. It represents the most potentially transformative technology in computing.

Recent experiments have underscored the remarkable potential for quantum computing to offer new capabilities for computation. Experiments using superconducting electronic circuits have shown that control over quantum mechanics can be scaled to modest sizes with relatively good accuracy [3, 4, 5]. By exercising control over quantum physical systems, quantum computers are expected to accelerate the time to solution and/or accuracy of a variety of different applications [6, 7, 8, 1, 9, 10, 11, 12], while simultaneously reducing power consumption. For instance, important encryption protocols such as RSA, which is widely used in private and secure communications, can theoretically be broken using quantum computers [13]. Moreover, quantum simulation problems, including new material design and drug discovery, could be accelerated using quantum processors [1, 9, 10, 11, 12, 14].

However, building a universal, fault tolerant quantum computer is, at the moment, a long-term goal driven by strong theoretical

[15, 16, 17] and experimental [3, 4, 18]

evidence. Nonetheless, despite the lack of error correction mechanisms to run arbitrary quantum algorithms, Noisy Intermediate-Scale Quantum (NISQ) devices of about 50-100 qubits are expected to perform certain tasks which surpass the capabilities of classical high-performance computer systems

[19, 20, 17, 21, 16, 22, 23, 24, 25, 26, 27, 18, 28, 29, 30, 31], therefore achieving Quantum Supremacy. Examples of NISQ devices are the 50-qubit IBM QPU [32], and the 72-qubit Google Bristlecone QPU [33]. Both NISQ devices are “digital” universal quantum computers, namely they perform a universal set of discrete operations (“gates”) on qubits. Quantum algorithms are translated into quantum circuits and run on the NISQ device. The “depth” of quantum circuits is defined as the number of clock cycles, where each clock cycle is composed of gate operations executed on distinct qubits. Given the noisy nature of NISQ devices, only shallow or low depth quantum circuits can be run.

An extensive body of work in complexity theory in the context of ”Quantum Supremacy” and random circuit sampling (RCS) supports the conclusion that ideal quantum computers can break the Strong Church-Turing thesis by performing computations that cannot be polynomially reduced to any classical computational model [34, 35, 17, 16, 36, 24, 37, 38].

Here we present a state-of-the-art classical high-performance simulator of large random quantum circuits (RQCs), called qFlex, and propose a systematic benchmark for Noisy Intermediate-Scale Quantum (NISQ) devices [20]. qFlex makes use of efficient single-precision matrix-matrix multiplications and an optimized tensor transpose kernel on NVIDIA GPUs. It utilizes efficient asynchronous pipelined task-based execution of tensor contractions implemented in its portable computational backend (TAL-SH library) that can run on both multicore CPU and NVIDIA GPU architectures. Our GPU implementation of qFlex reaches a peak performance of 92% and delivers a sustained absolute performance of 68% in simulations spanning the entire Summit supercomputer (281 Pflop/s single precision without mixed-precision acceleration). Due to our communication-avoiding algorithm, this performance is stable with respect to the number of utilized nodes, thus also demonstrating excellent strong scaling.

I-a RCS protocol

While these demonstrations show progress in the development of quantum computing, it remains challenging to compare alternative quantum computing platforms and to evaluate advances with respect to the current dominant (classical) computing paradigm. The leading proposal to fulfill this critical need is quantum random circuit sampling (RCS) [36, 24, 38, 16], that is, to approximately sample bitstrings from the output distribution generated by a random quantum circuit. This is a ”hello world” program for quantum computers, because it is the task they are designed to do. This proposal also counts with the strongest theoretical support in complexity theory for an exponential separation between classical and quantum computation [36, 17]. Classically, it requires a simulation of the quantum computation, and the computational cost increases exponentially in the number of qubits for quantum circuits of sufficient depth. Storing the wave function from the output of a random quantum circuit with 40 qubits requires 8.8 TB at single precision (

bytes), and 50 qubits require 9 PB. Although new algorithms such as qFlex are designed to avoid these memory requirements, calculating the necessary large number of probabilities for the output bitstrings of quantum circuits of sufficient depth with 40 qubits or more can only be achieved with HPC resources.

The RCS protocol to benchmark NISQ computers consists of the following steps:

  1. Generate random circuits fitted to a given NISQ architecture,

  2. Execute random circuits, collecting bitstrings, time to solution, and energy consumption of the quantum computer,

  3. Calculate probabilities of collected bitstrings using a classical simulation (qFlex) to measure the NISQ performance or fidelity (see below),

  4. Perform a classical noisy simulation (qFlex) based on calculated (point 3) fidelity to collect equivalent classical computational metrics of relevance – total floating point operations/time-to-solution/power consumption.

circuit = RandomCircuit() size = 10**6 # Run on real hardware. samples = Sample(circuit, size) # Determine probabilities with simulation p = 1 for s in samples: # EXPENSIVE! p *= Probability(circuit, s) # Derive cross entropy. cross_entropy = -log(p) / size

Fig. 1: Cross-entropy benchmarking (XEB) pseudo-code.

The NISQ fidelity (item 3 above) of the RCS protocol can be estimated using qFlex to calculate the probabilities of the bitstrings obtained in the NISQ experiment. This is done by using the cross-entropy benchmarking (XEB), see Fig. 

1 and Ref. [16]. Then, the so-called heavy output generation (HOG) score is also calculated with the same data [36]. An alternative benchmark for NISQ devices called “Quantum Volume” [39] is also based on RCS with additional requirements in the model circuits (item 1 above).

I-B Motivation for RCS as a Standard Benchmark

The purpose of benchmarking quantum computers is twofold: to objectively rank NISQ computers as well as to understand quantum computational power in terms of equivalent “classical computing” capability. Therefore, to achieve the above objectives, a good benchmark for NISQ computers must have the following properties:

  • [label=-, leftmargin=]

  • A good predictor of NISQ algorithm performance. In the classical computing world, Linpack is used as a benchmark for HPC Top-500 and serves as a relatively good predictor of application performance. Similarly, RCS serves as a good “multi-qubit” benchmark because:

    • [label=, leftmargin=20pt]

    • The computation is well defined,

    • The computation requires careful control of multiple qubits,

    • It measures the fidelity of a NISQ computer using cross-entropy benchmarking (XEB) [16, 18]

    • Works for any set of universal gates.

  • Estimating the equivalent classical worst-case computation resources with commonly used metrics such as time-to-solution speedup, equivalent floating point operations, and equivalent power usage metrics. Typically, advancements in classical algorithms could invalidate a quantum benchmark result. A key feature of RCS is that there is a large body of theory in computational complexity against an efficient classical algorithm [34, 35, 17, 16, 36, 24, 37, 38]. Furthermore, qFlex for RCS requires computational resources proportional to the sampling fidelity, taking into account NISQ errors [29, 30]. Lastly, quantum random circuits produce quasi-maximal entanglement in the ideal case.

  • Architectural neutrality. Different architectures (e.g. ion traps vs superconducting qubits) have very different performance capabilities in terms of:

    • [label=, leftmargin=20pt]

    • Number of qubits

    • Types of gates and fidelity

    • Connectivity

    • Number of qubits per multi-qubit operation

    • Number of parallel executions

    RCS allows a fair comparison across architectures by estimating the classical computational power of each given sampling task.

  • Useful for calibration of NISQ devices. Critical issues such as cross-talk [3, 4, 18] only come up when scaling multi-qubit computations. XEB with RCS allows for calibration on different sets of qubits at different depths.

Other techniques for benchmarking NISQ devices, such as randomized benchmarking [40, 41, 42], cycle benchmarking [43] or preparing entangled GHZ states [5] are restricted to non-universal gate sets and easy to simulate quantum circuits. This is an advantage for extracting circuit fidelities of non-universal gate sets, but does not measure computational power.

I-C Energy Advantages of Quantum Computing

The largest casualty as we approach the end of Moore’s law has been the energy efficiency gains due to Dennard scaling, also known as Koomey’s law. Today’s HPC data centers are usually built within the constraints of available energy supplies, rather than the constraints in hardware costs. For example, the Summit HPC system at Oak Ridge National Laboratory has a total power capacity of 14 MW available to achieve a design specification of 200 Pflop/s double-precision performance. To scale such a system by 10x would require 140 MW of power, which would be prohibitively expensive. Quantum computers on the other hand have the opportunity to drastically reduce power consumption [44, 45].

To compare the energy consumption needed for a classical computation to the corresponding needs of a quantum computation, it is important to understand the sources of energy consumption for quantum computers. For superconducting quantum computers such as Google’s (Fig. 2), the main sources of energy consumption are:

  1. [label=-, leftmargin=]

  2. Dilution Refrigerator - because quantum computers operate at around the 15 mK range, a large dilution refrigerator that typically consumes 10 kW is necessary.

  3. Electronic racks - these typically consume around 5 kW of power. They consist of oscilloscopes, microwave electronics, Analog-Digital converters, and clocks.

Fig. 2: A side by side view of a dilution refrigerator and electronic rack for Google’s quantum computer.

Therefore a typical quantum computer will consume approximately 15 kW of energy for a single QPU of 72 qubits. Even as qubit systems scale up, this amount is unlikely to significantly grow. The main reasons are: dilution refrigerators are not expected to scale up; the power for the microwave electronics on a per-qubit basis will decrease as quantum chips scale in the number of qubits. The last component needed to perform a fair comparison between classical and quantum computing is the rate at which quantum computers can execute a circuit. In the case of superconducting qubits, the circuit execution rate is approximately 10 kHz to 100 kHz.

For classical computers, different architectures can result in different energy consumption rates. In this paper, we compared the consumption of NASA’s supercomputer Electra and ORNL’s supercomputer Summit. On ORNL’s Summit, each node consumes around 2.4 kW of energy and consists of 2 IBM Power9 CPUs and 6 NVIDIA V100 GPUs, with a total of 4608 nodes (1.8 kW per node is consumed by 6 NVIDIA Volta GPUs). On NASA’s Electra supercomputer, each of the 2304 40-core Skylake nodes consumes about 0.52 kW, and each of the 1152 28-core Broadwell nodes consumes about 0.38 kW.

Ii Current State-of-the-art

With the advent of NISQ devices, different classes of classical simulators have blossomed to tackle the demanding problem of simulating large RQCs. To date, classical simulators of RQCs can be classified into three main classes (plus a fourth category of “hybrid” algorithms):

  • [label=-, leftmargin=]

  • Direct evolution of the quantum state. This approach consists in the sequential application of quantum gates to the full representation of an qubit state. While this method does not have any limitation in terms of gates and topology of RQCs, the memory required to store the full quantum state quickly explodes as , and taking into account the non-negligible overhead given by the node-to-node communication to update the full quantum state. Thus, this approach becomes impractical for the simulation and benchmark of NISQ devices with qubits. Beyond 40 qubits it already requires fast internode connectivity [19, 21, 22, 25, 31].

  • Perturbation of stabilizer circuits. It is known that stabilizer circuits, i.e. quantum circuits made exclusively of Clifford gates, can be efficiently simulated and sampled on classical computers [46]. An intriguing idea consists in representing RQCs as a linear combination of different stabilizer circuits [47, 48, 49]. Therefore, the computational complexity to simulate RQCs becomes proportional to the number of stabilizer circuits required to represent them. Unfortunately, doubles every time a non-Clifford gate (for instance, a rotation) is used. Furthermore, the number of non-Clifford gates grows quickly with the depth of RQCs [16] and, in general, much faster than the number of qubits. Consequently, stabilizer circuits are not suitable to benchmark NISQ devices using RQCs.

  • Tensor network contractions. The main idea of this approach is based on the fact that any given quantum circuit can always be represented as a tensor network, where one-qubit gates are rank-2 tensors (tensors of 2 indexes with dimension 2 each), two-qubit gates are rank-4 tensors (tensors of 4 indexes with dimension 2 each), and in general -qubit gates are rank- tensors [50, 51] (this approach should not be confused with another application of tensor network theory to quantum circuit simulations where tensor networks are used for wave-function compression [52]). In general, the computational and memory cost for contracting such networks is (at least) exponential with the number of open indexes (corresponding to the input state and output state, respectively). Therefore, for large enough circuits, the network contraction is impractical. Nonetheless, once the input and output states are specified through rank-1 Kronecker projectors, the computational complexity drastically reduces. More precisely, the computational and memory costs are dominated by the largest tensor during the contraction [50]. The size of the largest contraction can be estimated by computing the treewidth of the underlying tensor network’s line graph which is, for most of the NISQ devices, proportional to the depth of the RQCs. This representation of quantum circuits gives rise to an efficient simulation technique to simulate RQCs for XEB. Among the most relevant tensor-based simulators, it is worth to mention the following:

    • [label=]

    • Undirected graphical model. An approach based on algorithms designed for undirected graphical models, closely related to tensor network contractions, was introduced in [53]. Later, the Alibaba group used the undirected graphical representation of RQCs to take advantage of underlying weaknesses of the design [28]. More precisely, by carefully projecting the state of a few nodes in the graph, the Alibaba group was able to reduce the computational complexity to simulate large RQCs. However, it is important to stress that [28] reports the computational cost to simulate a class of RQCs which is much easier to simulate than the class of RQCs reported in Ref. [16]. Indeed, Chen et al. fail to include the final layer of Hadamard gates in their RQCs and use more gates in detriment of non-diagonal gates at the beginning of the circuit. For these reasons, we estimate that such class is about easier to simulate than the RQCs available in [54], as discussed in Ref. [30]. The latter are the circuits simulated in this submission, as well as in [29, 30].

    • Quantum teleportation-inspired algorithms. Recently, Chen et al. have proposed a related approach for classical simulation of RQCs inspired by quantum teleportation [55]. Such approach allows to “swap” space and time to take advantage of low-depth quantum circuits (equivalent to a Wick rotation in the imaginary time direction). While the approach is interesting per se, the computational performance is lower than that one achieved by qFlex [30].

  • Hybrid algorithms. Several works have explored algorithms based on splitting grids of qubits in smaller sub-circuits, which are then independently simulated [26, 29]. As an example, every time a -gate crosses between sub-circuits, the number of independent circuits to simulate is duplicated. Therefore, the computational cost is exponential in the number of entangling gates in a cut. MFIB [29] also introduced a technique to “match” the target fidelity of the NISQ device, which actually reduces the classical computation cost by a factor . By matching the fidelity of a realistic quantum hardware (), MFIB was able to simulate and grids with depth by numerically computing amplitudes in respectively and core hours on Google cloud. However, MFIB becomes less efficient than qFlex for grids beyond qubits because of memory requirements. In principle, one could mitigate the memory requirements by either using distributed memory protocols like MPI, or by partitioning the RQCs in more sub-circuits. Nevertheless, this has been impractical so far. Moreover, the low arithmetic intensity intrinsic to this method (which relies on the direct evolution of subcircuits) makes it not scalable for flop-oriented heterogeneous architectures like Summit.

Ii-a qFlex

In 2018, NASA and Google implemented a circuit simulator – qFlex – to compute amplitudes of arbitrary bitstrings and ran it on NASA’s Electra and Pleiades supercomputers[30]. qFlex novel algorithmic design emphasizes communication avoiding and minimization of memory footprint, and it can also be reconfigured to optimize for local memory bandwidth. The computational resources required by qFlex for simulating RCS are proportional to the fidelity of the NISQ device [29, 30]. qFlex provides a fast, flexible and scalable approach to simulate RCS on CPUs (see Section III). In this work, qFlex has been redesigned and reimplemented in collaboration with the Oak Ridge National Laboratory in order to be able to utilize the GPU-accelerated Summit HPC architecture efficiently (see Section III). The computational results presented here, achieving more than 90% of peak performance efficiency and 68% sustained performance efficiency on the World’s most powerful classical computer, define the frontier for quantum computing to declare Quantum Supremacy. Moreover, given configurability of qFlex as a function of RAM and desired arithmetic intensity, it can be used to benchmark not only NISQ computers, but also classical high-end parallel computers.

Iii Innovation Realized

qFlex is a flexible RQC simulator based on an innovative tensor contraction algorithm for classically simulating quantum circuits that were beyond the reach for previous approaches [30]. More precisely, qFlex is by design optimized to simulate the generic (worst) case RQCs implemented on real hardware, which establish the baseline for the applications of near-term quantum computers. Moreover, qFlex is agnostic with respect to the randomness in the choice of single-qubit gates of the RQCs. Therefore, it presents no fluctuations in performance from one circuit to another in a given ensemble.

For the sake of simplicity, from now on we will focus on planar NISQ devices where qubits are positioned on a grid, with the only entangling gates being gates between two adjacent qubits. As RQCs, we use the prescription described in Refs. [16, 54], which has been shown to be hard to simulate.

Unlike other approaches, the first step of the qFlex algorithm consists of contracting the RQC tensor network in the “time” direction first. This step allows us to reduce the RQC to a regular 2D grid of tensors that are then contracted to produce a single quantum amplitude bitstring (see Fig. 3

). The regularity of the resulting tensors is highly advantageous for exploiting modern classical computer hardware architectures characterized by multi-level cache systems, large vector lengths and high level of parallelism, which requires high arithmetic intensity. The chosen order of tensor contractions is extremely important for minimization of the dimension of the largest tensor during the contraction.

In addition, qFlex expands on a technique introduced in [26, 27, 28, 29], which is based on systematic tensor slicing via fine-grained “cuts” that enable us to judiciously balance memory requirements with the number of concurrent computations. Furthermore, by computing an appropriate fraction of “paths”, it is possible to control the “fidelity” of the simulated RCS, and therefore to “mimic” the sampling of NISQ devices. Our simulator can produce a sample of bitstrings with a target fidelity at the same computational cost as to compute noiseless amplitudes. More importantly, the use of systematic tensor slicing in our tensor contraction algorithm results in complete communication avoiding between MPI processes, thus making our simulator embarrassingly scalable by design. The number of tensor “cuts” can be properly tuned to fit each independent computation into a single node. Therefore, all the computations required are run in parallel with no communication between compute nodes. This is an important design choice that allowed us to avoid the communication bottleneck that would otherwise highly degrade the performance of qFlex, in particular when GPUs are used.

Finally, compared to other approaches, qFlex is the most versatile and scalable simulator of large RCS. In our recent paper [30], qFlex has been used to benchmark the Google Bristlecone QPU of 72-qubits. To date, our Bristlecone simulations are the largest numerical computation in terms of sustained Pflop/s and the number of nodes utilized ever run on NASA Ames supercomputers, reaching a peak of Pflop/s (single precision), corresponding to an efficiency of of the Pleiades and Electra supercomputers. On Summit, we were able to achieve a sustained performance of 281 Pflop/s (single precision) over the entire supercomputer, which corresponds to an efficiency of 68%, as well as a peak performance of 92%, simulating circuits of 49 and 121 qubits on a square grid.

To achieve the performance reported here, we have combined several innovations. Some of them have appeared previously in the literature, as is explained in detail in their corresponding sections (see Ref. [30] for more details on the techniques and methods used here).

Iii-a Systematic tensor slicing technique: Communication avoiding and memory footprint reduction

Fig. 3: (Top Left): Example of gates (applied in the time direction) for a single qubit. (Top Right): RQCs representation as a 3D grid of tensors. Each cube is a rank-6 tensor and two adjacent cubes share a single bond of dimension 2. (Bottom Left): 2D network of tensors after contracting in “time” first. (Bottom Right): Example of cuts applied to RQCs. The tensor network is divided into tensors of at most rank 30; once tensors , , and are contracted with each other (with high-arithmetic-intensity) onto tensor , is contracted with the individual tensors on region , one at a time.. After tensor is formed, the dotted qubits are used for the fast sampling technique (see Section III-B).

For RCS with sufficient depth, the contraction of the corresponding 2D grid of tensors will inevitably lead to the creation of a temporary tensor whose size exceeds the total amount of memory available in a single node. For instance, for a RQC of depth , the memory footprint exceeds the terabytes of memory. To limit the memory footprint and allow the possibility to contract RQCs on single nodes without communication, we further expand and optimize the technique of using “cuts” [26, 27, 28, 29].

Given a tensor network with tensors and a set of indexes to contract , , we define a cut over index as the decomposition of the contraction into . This results in the contraction of up to many tensor networks of lower complexity, namely those covering all slices over index of the tensors involving that index. The resulting tensor networks can be contracted independently, which results in a computation that is embarrassingly parallelizable. It is possible to make more than one cut on a tensor network, in which case refers to a multi-index, as well as to adaptively choose the size of the slices. The contribution to the final sum of each of the contractions (each of the slices over the multi-index cut) is a “path”, and the final value of the contraction is the sum of all path contributions.

Fig. 3 depicts the cut we used for RQCs of size and depth . The chosen cut allows us to reduce the amount of memory to fit six contractions on each Summit node (one FPR each GPU). Moreover, as explained in the next sections, such cuts will greatly improve the sampling of amplitudes for a given RQC.

Iii-B Fast sampling technique

Sampling is a critical feature of random circuit sampling (RCS) and, therefore, it must be optimized to reduce the time-to-solution. To this end, we use a frugal rejection sampling which is known to have a low overhead for RCS [29]. In addition, as explained in Ref. [30], a faster sampling can be achieved by recycling most of the large contraction for a given amplitude calculation. More precisely, if different tensors and must be contracted to get the amplitude of bitstring , our fast sampling technique first contracts all tensors (for a given ) to get a tensor . Then, is recycled and contracted to for many different . Since the most computationally demanding part is the calculation of , many amplitudes can be computed with different (but the same ). Since RQCs are extremely chaotic, arbitrary close bitstrings are uncorrelated [16]. These amplitudes are used to perform the frugal rejection sampling of one single output bitstring in the output sample, and the computation is restarted with a new random to output the next bitstring. For deep enough RQCs and by carefully choosing the size of , one can reduce the sampling time by an order of magnitude, as shown in [30].

Iii-C Noisy simulation

Given the noisy nature of NISQ devices, simulating noisy RCS is of utmost importance. Intuitively, one would expect that noisier RCS would be easier to simulate. Indeed, as shown in Ref. [29], independent tensor contractions to compute a single amplitude can be seen as orthogonal “Feynman paths” (or simply “paths”). Since RQCs are very chaotic and Feynman paths typically contribute equally to the final amplitude, computing only a fraction of paths is equivalent to computing noisy amplitudes of fidelity . The computational cost for a typical target fidelity of for NISQ devices is therefore reduced by a factor of  [29, 30].

While computing only a fraction of paths allows for a speedup in the simulation of noisy devices, computing fewer amplitudes with higher fidelity is an alternative method to achieve a similar speedup, as we introduced in Ref. [30]. In particular, computing a fraction of perfect fidelity amplitudes from a target amplitude set lets us simulate RCS with fidelity . In general, the fraction of amplitudes computed times their fidelity equals the fidelity of the simulated sampling.

Iii-D Optimization of the tensor contraction order for optimal time-to-solution

The optimization of the number and position of the cuts and the subsequent tensor contraction ordering is fundamental for minimizing the total flop count and the time-to-solution in evaluating a given tensor network and to simulate sampling. Note that on modern flop-oriented computer architectures the minimal flop count does not necessarily guarantee the minimal time-to-solution, making the optimization problem even more complicated. Indeed, different tensor contraction orderings can result in performances that differ by orders of magnitude. In general, finding the optimal cuts and tensor contraction order is an NP-Hard problem (closely related to the tree-decomposition problem [50]). However, for planar or quasi-planar tensor network architectures produced by most of the NISQ devices, such tensor cuts can be found by carefully splitting the circuits into pieces that minimize the shared boundary interface; furthermore, we find that choosing a contraction ordering that prioritizes a few large, high arithmetic intensity contractions over many smaller, low arithmetic intensity contractions, often provides large speedups, and therefore better time-to-solution.

Iii-E Out-of-core asynchronous execution of tensor contractions on GPU

The novel tensor slicing technique used by the qFlex algorithm removes the necessity of the inter-process MPI communication and controls the memory footprint per node, thus paving the way to scalability. In order to achieve high utilization of the hardware on a heterogeneous Summit node, we implemented an out-of-core GPU tensor contraction algorithm in the TAL-SH library [56], which serves as a computational backend in qFlex. The TAL-SH library is a numerical tensor algebra library capable of executing basic tensor algebra operations, most importantly tensor contraction, on multicore CPU and NVIDIA GPU hardware. The key features of the TAL-SH library that allowed us to achieve high-performance on GPU-accelerated node architectures are: (a) fast heterogeneous memory management; (b) fully asynchronous execution of tensor operations on GPU; (c) fast tensor transpose algorithm; (d) out-of-core algorithm for executing large tensor contractions on GPU (for tensor contractions that do not fit into individual GPU memory).

The fast CPU/GPU memory management, and in general fast resource management, is a necessary prerequisite for achieving high level of asynchronism in executing computations on CPU and GPU. TAL-SH provides custom memory allocators for Host and Device memory. These memory allocators use pre-allocated memory buffers acquired during library initialization. Within such a buffer, each memory allocator implements a simplified version of the “buddy” memory allocator used in Linux. Since the memory allocation/deallocation occurs inside pre-allocated memory buffers, it is fast and it is free from highly unwanted side effects associated with regular CUDA malloc/free, for example serialization of asynchronous CUDA streams.

All basic tensor operations provided by the TAL-SH library can be executed asynchronously with respect to the CPU Host on any GPU device available on the node. The execution of a tensor operation consists of two phases: (a) scheduling (either successful or unsuccessful, to be re-tried later); (b) checking for completion, either testing for completion or waiting for completion (each scheduled tensor operation is a TAL-SH task with its associated TAL-SH task handle that is used for completion checking). All necessary resources are acquired during the scheduling phase, thus ensuring an uninterrupted progress of the tensor operation on a GPU accelerator via a CUDA stream. The tensor operation scheduling phase includes scheduling the necessary data transfers between different memory spaces, which are executed asynchronously as well. The completion checking step also includes consistency control for images of the same tensor in different memory spaces. All these are automated by TAL-SH.

The tensor contraction operation in TAL-SH is implemented by the general Transpose-Transpose-GEMM-Transpose (TTGT) algorithm, with an optimized GPU tensor transpose operation [57] (see also Ref. [58]). In the TTGT algorithm, the tensor contraction operation is converted to the matrix-matrix multiplication via permuting tensor indexes. The performance overhead associated with the tensor transpose steps can be as low as few percent, or even lower for highly arithmetically intensive tensor contractions, when using the optimized implementation. Also, all required Host-to-Device and Device-to-Host data transfers are asynchronous, and they overlap with kernels execution in other concurrent CUDA streams, thus minimizing the overhead associated with CPU-GPU data transfers. Additionally, TAL-SH provides an automated tensor image consistency checking mechanism, enabling tensor presence on multiple devices with transparent data consistency control (tensor image is a copy of the tensor on some device).

In this work, we used the regular complex FP32 CGEMM implementation provided by the cuBLAS library, without tensor core acceleration. The possibility of reducing the input precision to FP16 in order to use Volta’s tensor cores is strongly conditioned by the precision needed for simulating large quantum circuits. The average squared output amplitude of a quantum circuit of qubits is , which for (simulated in this work) is of the order of and for (also simulated here) is of the order of . We avoid the possibility of underflowing single precision by normalizing the input tensors, and renormalizing them back at the end of the computation.

Finally, an important extension to the basic TAL-SH functionality, which was the key to achieving high performance in this work, is the out-of-core algorithm for executing large tensor contractions on GPU, that is, for executing tensor contractions which do not fit into individual GPU memory. Although the Summit node has IBM AC922 Power9-Volta architecture with unified CPU-GPU coherent memory efficiently synchronized via NVLink, we did not rely on the unified memory abstraction provided by the CUDA framework for two reasons. First, although the hardware assisted memory page migration is very efficient on this architecture, we believe that the explicitly managed heterogeneous memory and data transfers deliver higher performance for tensor contractions (this is an indirect conclusion based on the performance of similar algorithms on Summit). Second, the TAL-SH library prioritizes performance portability among other things, that is, it is meant to deliver high performance on other accelerated HPC systems as well, many of which do not have hardware-assisted coherence mechanism between CPU and GPU.

The out-of-core tensor contraction algorithm implemented in TAL-SH is based on recursive decomposition of a tensor operation (tensor contraction in this case) into smaller tensor operations (tensor contractions) operating on slices of the original tensors, following the general philosophy presented in Ref. [59]. During each decomposition step, the largest tensor dimension associated with the largest matrix dimension in the TTGT algorithm is split in half (in TTGT algorithm multiple tensor dimensions are combined into matrix dimensions, thus matricizing the tensors). This ensures maximization of arithmetic intensity of individual derived tensor contractions, which is important for achieving high performance on modern flop-oriented computer architectures, like the NVIDIA Volta GPU architecture employed in this work (arithmetic intensity is the flop-to-byte ratio of a given operation). The recursive decomposition is repeated until all derived tensor operations fit within available GPU resources. Then the final list of derived tensor operations is executed by TAL-SH using a pipelined algorithm in which multiple tensor operations are progressed concurrently, thus overlapping CPU and GPU elementary execution steps. In general, TAL-SH allows tuning of the number of concurrently progressed tensor operations, but in this work we restricted it to 2. Each tensor operation has 5 stages of progress:

  1. Resource acquisition on the execution device (GPU);

  2. Loading input tensor slices on Host (multithreaded);

  3. Asynchronous execution on GPU: Fast additional resource acquisition/release, asynchronous data transfers, asynchronous tensor transposes, asynchronous GEMM;

  4. Accumulating the output tensor slice on Host (multithreaded);

  5. Resource release on the execution device (GPU).

The pipelined progressing algorithm is driven by a single CPU thread and is based on the “asynchronous yield” approach, that is, each active (concurrent) tensor operation proceeds through its consecutive synchronous stages uninterrupted (unless there is a shortage in available resources, in which case it is interrupted), but it “yields” to the next concurrent tensor operation once an asynchronous stage starts. The asynchronous GPU execution step, mediated by the CUDA runtime library, involves asynchronous Host-to-Device and Device-to-Host data transfers, asynchronous (optimized) tensor transpose kernels and asynchronous (default) CGEMM cuBLAS calls. Since each NVIDIA Volta GPU has multiple physical data transfer engines, all incoming and outgoing data transfers are overlapped with the CUDA kernels execution in different CUDA streams, thus almost completely removing CPU-GPU data transfer overhead. Overall, this algorithm, combined with fast tensor transpose and efficient cuBLAS GEMM implementation, results in highly efficient execution of large tensor contractions on GPU, as demonstrated by our results (up to 96% of Volta’s theoretical flop/s peak).

Iv How Performance Was Measured

We used an analytical approach to measuring the number of floating point operations performed by our algorithm. Our computational workload consists of a series of tensor contractions operating on dense tensors of rank 1 to 10. Thus, the total number of Flops executed by an MPI process is the sum of the flop counts of each individual tensor contraction executed by that MPI process. Each individual tensor contraction has a well defined flop count:


where , , are the volumes of the tensors participating in the tensor contraction (volume is the number of tensor elements in a tensor), and the factor of 8 shows up due to the use of the complex multiply-add operation (4 real multiplications plus 4 real additions). All tensors used in our simulations are of complex single precision (complex FP32), SP.

For the timing of the simulations, we do not include job launching, MPI_Init(), nor the initialization of the TAL-SH library on each MPI process, as well as the final time taken to write the results to file by the master MPI process (scheduler). In order to do so, the two timings are reported after two MPI_Barrier() synchronization calls.

Based on the analytical flop count calculation, we have computed multiple performance metrics in this work. The average SP flop/s count, , is computed by dividing the total flop count of our simulation by its execution time. The peak SP flop/s count, , is computed by dividing the total flop count of the largest tensor contraction by the sum of its execution times on each node, times the number of nodes. Additionally, we have computed the energy efficiency of the Summit hardware with respect to our simulation. The average flop/s/watt value, , is computed by dividing the total flop count of our simulation by the total energy consumed by Summit during the simulation. This energy consumption value is actually the upper bound because it includes all components of Summit, some of which, like disk storage, were not actively used by our simulation. The Summit HPC system at Oak Ridge National Laboratory has a total power capacity of 14 MW available to achieve a design specification of slightly more than 200 Pflop/s peak double-precision performance, and more than 400 Pflop/s single-precision performance. This energy powers 4608 GPU-enabled compute nodes. Each Summit node contains 2 Power9 CPU, 6 NVIDIA Volta V100 GPUs, 512 GB of DDR4 RAM, 16 GB of HBM2 on-chip GPU memory in each GPU, 1.6 TB of non-volatile NVMe storage, and 2 physical network interface cards, with a total power cap of approximately 2400W, of which up to 1800W are consumed by the 6 NVIDIA Volta GPUs.

V Performance Results

Fig. 4: Power consumption of Summit, at a given time, during our simulations. Note that most of the power is consumed by the GPUs. (Top) over 4600 nodes. (Bottom) over 4550 nodes.
PFlop/s Efficiency (%)
Circuit Size Nodes Used Runtime (h) Peak Sust. Peak Sust. Power (MW) Energy Cost (MWh) PFlop/s/MW
2300 4.84 191 142 92.0 68.5 - - -
4600 2.44 381 281 92.1 68.0 8.075 21.1 34.8
4550 0.278 368 261 89.8 63.7 7.3 2.32 35.8
TABLE I: Performance of the simulation of our three runs on Summit. For the circuit we computed million amplitudes with fidelity on each run, while for the circuit we computed 325000 amplitudes of fidelity , which equivalently allows us to sample with fidelity (see Section III-C). The performance is compared against the theoretical peak performance, where a maximum of single-precision 15 Tflop/s is considered for each NVIDIA Volta GPU, and then extrapolated to the number of nodes used (with 6 GPUs per node). The results show a stable sustained performance between the simulation of the circuit on 2300 (50% of Summit) and 4600 nodes (100% of Summit). The (average) power consumption reported was measured for the entire Summit, and is therefore not considered when running on half of the nodes.

We have simulated RCS on two large hard RQCs running on either half or the entire Summit supercomputer. Both circuits belong to the class of revised, worst-case circuits on planar graphs (no design weaknesses exploited) [29] and can be found in [54] (inst_7x7_41_0.txt and inst_11x11_25_0.txt). See Table I for a summary of our performance results.

V-a GPU results and peak performance

Most of the simulation time is spent in arithmetically intensive, out-of-core contractions of large tensors (with volumes as large as ) that do not necessarily fit in the GPU memory. On these contractions, we achieve a performance efficiency of over with respect to the theoretical single-precision peak of 15 Tflop/s for the NVIDIA Volta V100 GPUs, and this efficiency goes as high as 96%. For this reason, we compute our peak performance as the average performance of these largest tensor contractions times the number of GPUs corresponding to the number of nodes used in each simulation. We therefore achieve a peak performance of 92% (381 Pflop/s when running on 100% of Summit) in the simulation, and 90% (367 Pflop/s when running on 99% of Summit) in the simulation.

V-B Sustained performance

While we successfully slice (“cut”) the tensors in our circuits and contract them in an ordering that leaves most of the computation to large arithmetically intensive tensor contractions, it is inevitable to deal with a non-negligible number of less arithmetically intensive tensor contractions, which achieve a suboptimal performance. Averaged over the entire simulation time, we reach a sustained performance efficiency of about 68% (281 Pflop/s on 100% of Summit) for the simulation, and about 64% (261 Pflop/s on 99% of Summit) for the simulation.

V-C Scaling

Due to the communication-avoiding design of our algorithm, the impact of communication on scaling is negligible, and therefore the performance is stable as a function of the number of nodes used, demonstrating an excellent strong scaling (see Table I).

V-D Energy consumption

We report the energy consumed by Summit in our full scale simulations. We achieve a rate of 34.8 Pflop/s/MW for the simulation, and 35.8 Pflop/s/MW for the simulation. Note that the power consumed should be taken as an upper bound, since it takes into account the consumption of the entire machine, including components that were not used in the simulations. Furthermore, we report the energy consumed by the entire job, including job launch and initialization of the TAL-SH library, which again slightly lifts the upper bound; this might be part of the reason for obtaining a slightly larger flop per energy rate on the shorter simulation (, see Fig. 4), where the actual computing time is smaller in relative terms.

Runtime (hours) Energy Cost (MWh)
Circuit Size Target Fidelity (%) Electra Summit QPU Electra Summit QPU
0.5 59.0 2.44 0.028 96.8 21.1
TABLE II: Estimated runtimes and energy cost for the sampling of amplitudes with fidelity close to on NASA Electra, ORNL Summit, and a superconducting NISQ device (QPU). For the QPU, we assume a sampling rate of 10 kHz and a power consumption of 15kW (see Section I-C). Note that the large improvements on time-to-solution and energy consumption of the QPU as compared to state-of-the-art classical supercomputers do not extrapolate to other applications. However, the “hello world” nature of the sampling problem establishes the baseline for the applicability of quantum computers.

Vi Implications and Conclusions

As we explore technologies beyond Moore’s law, it is important to understand the potential advantages of quantum computing both in terms of time-to-solution and energy consumption. The research presented here also provides a comparison of different classical computing architectures, comparing Random Circuit Sampling implementations on both CPU and GPU based supercomputers (see Table II).

In conclusion, the implications of the proposed research are the following:

  • [label=-, leftmargin=]

  • Establishes minimum hardware requirements for quantum computers to exceed available classical computation. qFlex is able to calculate the required fidelity, number of qubits, and gate depth required for a quantum computer to exceed the computational ability of the most powerful available supercomputer for at least one well defined computational task: RCS.

  • Objectively compares quantum computers in terms of computational capability. Different architectures (e.g. ion traps vs superconducting qubits) vary in:

    1. [leftmargin=20pt]

    2. Number of qubits

    3. Types of gates

    4. Fidelity

    5. Connectivity

    6. Number of qubits per multi-qubit operation

    7. Number of parallel executions

    RCS allows comparisons across different qubit architectures estimating the equivalent amount of classical computation with commonly used metrics.

  • Establishes a multi-qubit benchmark for non-Clifford gates. Prior to XEB, there was no benchmarking proposal to measure multiqubit fidelity for universal (non-Clifford) quantum gate sets. qFlex enables XEB on a large number of qubits and large circuit depths by efficiently using classical computing resources. XEB allows quantum hardware vendors to calibrate their NISQ devices.

  • Objectively compares classical computers for simulating large quantum many-body systems. Using RCS as a benchmark for classical computers, one can compare how different classical computers perform simulation of large quantum many-body systems across different classical computational architectures. In this paper, we specifically compare NASA’s Electra supercomputer, which is primarily powered by Intel Skylake CPUs, with ORNL’s Summit supercomputer, which is primarily powered by NVIDIA Volta GPUs.

  • Objectively compares computer energy consumption requirements across a variety of quantum and classical architectures for one specific computational task. We estimated the energy cost to produce a sample of bistrings for 2 different circuits across classical CPU, GPU, and superconducting quantum computer architectures (see Table II). The classical computers would take 96.8 MWh and 21.1 MWh for NASA’s Electra and ORNL’s Summit, respectively. The factor of almost 5 of improvement of Summit over Electra is due to Summit’s GPU architecture. Comparing this to a superconducting quantum computer with a sampling rate of 10 kHz, a quantum computer would have a 5 orders of magnitude improvement (see Table II). This separation in energy consumption performance is much greater than the time-to-solution performance improvement of 2 orders of magnitude. Furthermore, we expect the separation in energy performance to continue to grow faster than the time-to-solution performance. We emphasize that RCS is a computational task particularly favorable to quantum computers and an advantage is presently not achievable in practice for most other computational problems.

  • Guides near-term quantum algorithm and hardware design. The simulation capabilities of qFlex can be harnessed for such tasks as verifying that a hardware implementation of an algorithm is behaving as expected, evaluating relative performance of quantum algorithms that could be run on near-term devices, and suggesting hardware architectures most suitable for experimenting with specific quantum algorithms in the near term. The specific capabilities of qFlex are particularly useful for evaluating quantum circuits large enough that the full quantum state cannot be produced but for which qFlex can still compute the probability of individual outputs of interest. In the case of RCS, it suffices to compute the amplitudes of random bistrings to perform rejection sampling in order to produce a sample of bistrings [29, 30]

    . In other cases, it might be possible to use classical heuristics to determine what probabilities to calculate.


We are grateful for support from NASA Ames Research Center, NASA Advanced Exploration systems (AES) program, NASA Earth Science Technology Office (ESTO), and NASA Transformative Aeronautic Concepts Program (TACP), and also for support from AFRL Information Directorate under grant F4HBKC4162G001. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. We would like to thank Jack Wells, Don Maxwell, and Jim Rogers for their help in making Summit simulations possible and for providing hardware utilization statistics. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan. (