I Introduction and Motivations
As we approach the end of Moore’s law, there is growing urgency to develop alternative computational models. Examples are Beyond Von Neumann architectures and Beyond CMOS hardware. Since Richard Feynman envisioned the possibility of using quantum physics for computation [1, 2], many theoretical and technological advances have been made. Quantum computing is both Beyond Von Neumann and Beyond CMOS. It represents the most potentially transformative technology in computing.
Recent experiments have underscored the remarkable potential for quantum
computing to offer new capabilities for computation. Experiments using
superconducting electronic circuits have shown that control over quantum
mechanics can be scaled to modest sizes with relatively good accuracy
[3, 4, 5].
By exercising control over quantum physical systems, quantum computers are
expected to accelerate the time to solution and/or accuracy of a variety of
different applications [6, 7, 8, 1, 9, 10, 11, 12], while simultaneously reducing power
consumption.
For instance, important encryption protocols such as RSA, which is widely used
in private and secure communications, can theoretically be broken using quantum
computers [13].
Moreover, quantum simulation problems, including new material design and drug
discovery, could be accelerated using quantum processors
[1, 9, 10, 11, 12, 14].
However, building a universal, fault tolerant quantum computer is, at the moment, a longterm goal driven by strong theoretical
[15, 16, 17] and experimental [3, 4, 18]evidence. Nonetheless, despite the lack of error correction mechanisms to run arbitrary quantum algorithms, Noisy IntermediateScale Quantum (NISQ) devices of about 50100 qubits are expected to perform certain tasks which surpass the capabilities of classical highperformance computer systems
[19, 20, 17, 21, 16, 22, 23, 24, 25, 26, 27, 18, 28, 29, 30, 31], therefore achieving Quantum Supremacy. Examples of NISQ devices are the 50qubit IBM QPU [32], and the 72qubit Google Bristlecone QPU [33]. Both NISQ devices are “digital” universal quantum computers, namely they perform a universal set of discrete operations (“gates”) on qubits. Quantum algorithms are translated into quantum circuits and run on the NISQ device. The “depth” of quantum circuits is defined as the number of clock cycles, where each clock cycle is composed of gate operations executed on distinct qubits. Given the noisy nature of NISQ devices, only shallow or low depth quantum circuits can be run.An extensive body of work in complexity theory in the context of ”Quantum Supremacy” and random circuit sampling (RCS) supports the conclusion that ideal quantum computers can break the Strong ChurchTuring thesis by performing computations that cannot be polynomially reduced to any classical computational model [34, 35, 17, 16, 36, 24, 37, 38].
Here we present a stateoftheart classical highperformance simulator of large random quantum circuits (RQCs), called qFlex, and propose a systematic benchmark for Noisy IntermediateScale Quantum (NISQ) devices [20]. qFlex makes use of efficient singleprecision matrixmatrix multiplications and an optimized tensor transpose kernel on NVIDIA GPUs. It utilizes efficient asynchronous pipelined taskbased execution of tensor contractions implemented in its portable computational backend (TALSH library) that can run on both multicore CPU and NVIDIA GPU architectures. Our GPU implementation of qFlex reaches a peak performance of 92% and delivers a sustained absolute performance of 68% in simulations spanning the entire Summit supercomputer (281 Pflop/s single precision without mixedprecision acceleration). Due to our communicationavoiding algorithm, this performance is stable with respect to the number of utilized nodes, thus also demonstrating excellent strong scaling.
Ia RCS protocol
While these demonstrations show progress in the development of quantum computing, it remains challenging to compare alternative quantum computing platforms and to evaluate advances with respect to the current dominant (classical) computing paradigm. The leading proposal to fulfill this critical need is quantum random circuit sampling (RCS) [36, 24, 38, 16], that is, to approximately sample bitstrings from the output distribution generated by a random quantum circuit. This is a ”hello world” program for quantum computers, because it is the task they are designed to do. This proposal also counts with the strongest theoretical support in complexity theory for an exponential separation between classical and quantum computation [36, 17]. Classically, it requires a simulation of the quantum computation, and the computational cost increases exponentially in the number of qubits for quantum circuits of sufficient depth. Storing the wave function from the output of a random quantum circuit with 40 qubits requires 8.8 TB at single precision (
bytes), and 50 qubits require 9 PB. Although new algorithms such as qFlex are designed to avoid these memory requirements, calculating the necessary large number of probabilities for the output bitstrings of quantum circuits of sufficient depth with 40 qubits or more can only be achieved with HPC resources.
The RCS protocol to benchmark NISQ computers consists of the following steps:

Generate random circuits fitted to a given NISQ architecture,

Execute random circuits, collecting bitstrings, time to solution, and energy consumption of the quantum computer,

Calculate probabilities of collected bitstrings using a classical simulation (qFlex) to measure the NISQ performance or fidelity (see below),

Perform a classical noisy simulation (qFlex) based on calculated (point 3) fidelity to collect equivalent classical computational metrics of relevance – total floating point operations/timetosolution/power consumption.
The NISQ fidelity (item 3 above) of the RCS protocol can be estimated using qFlex to calculate the probabilities of the bitstrings obtained in the NISQ experiment. This is done by using the crossentropy benchmarking (XEB), see Fig.
1 and Ref. [16]. Then, the socalled heavy output generation (HOG) score is also calculated with the same data [36]. An alternative benchmark for NISQ devices called “Quantum Volume” [39] is also based on RCS with additional requirements in the model circuits (item 1 above).IB Motivation for RCS as a Standard Benchmark
The purpose of benchmarking quantum computers is twofold: to objectively rank NISQ computers as well as to understand quantum computational power in terms of equivalent “classical computing” capability. Therefore, to achieve the above objectives, a good benchmark for NISQ computers must have the following properties:

[label=, leftmargin=]

A good predictor of NISQ algorithm performance. In the classical computing world, Linpack is used as a benchmark for HPC Top500 and serves as a relatively good predictor of application performance. Similarly, RCS serves as a good “multiqubit” benchmark because:

[label=, leftmargin=20pt]

The computation is well defined,

The computation requires careful control of multiple qubits,

Works for any set of universal gates.


Estimating the equivalent classical worstcase computation resources with commonly used metrics such as timetosolution speedup, equivalent floating point operations, and equivalent power usage metrics. Typically, advancements in classical algorithms could invalidate a quantum benchmark result. A key feature of RCS is that there is a large body of theory in computational complexity against an efficient classical algorithm [34, 35, 17, 16, 36, 24, 37, 38]. Furthermore, qFlex for RCS requires computational resources proportional to the sampling fidelity, taking into account NISQ errors [29, 30]. Lastly, quantum random circuits produce quasimaximal entanglement in the ideal case.

Architectural neutrality. Different architectures (e.g. ion traps vs superconducting qubits) have very different performance capabilities in terms of:

[label=, leftmargin=20pt]

Number of qubits

Types of gates and fidelity

Connectivity

Number of qubits per multiqubit operation

Number of parallel executions
RCS allows a fair comparison across architectures by estimating the classical computational power of each given sampling task.

Other techniques for benchmarking NISQ devices, such as randomized benchmarking [40, 41, 42], cycle benchmarking [43] or preparing entangled GHZ states [5] are restricted to nonuniversal gate sets and easy to simulate quantum circuits. This is an advantage for extracting circuit fidelities of nonuniversal gate sets, but does not measure computational power.
IC Energy Advantages of Quantum Computing
The largest casualty as we approach the end of Moore’s law has been the energy efficiency gains due to Dennard scaling, also known as Koomey’s law. Today’s HPC data centers are usually built within the constraints of available energy supplies, rather than the constraints in hardware costs. For example, the Summit HPC system at Oak Ridge National Laboratory has a total power capacity of 14 MW available to achieve a design specification of 200 Pflop/s doubleprecision performance. To scale such a system by 10x would require 140 MW of power, which would be prohibitively expensive. Quantum computers on the other hand have the opportunity to drastically reduce power consumption [44, 45].
To compare the energy consumption needed for a classical computation to the corresponding needs of a quantum computation, it is important to understand the sources of energy consumption for quantum computers. For superconducting quantum computers such as Google’s (Fig. 2), the main sources of energy consumption are:

[label=, leftmargin=]

Dilution Refrigerator  because quantum computers operate at around the 15 mK range, a large dilution refrigerator that typically consumes 10 kW is necessary.

Electronic racks  these typically consume around 5 kW of power. They consist of oscilloscopes, microwave electronics, AnalogDigital converters, and clocks.
Therefore a typical quantum computer will consume approximately 15 kW of energy for a single QPU of 72 qubits. Even as qubit systems scale up, this amount is unlikely to significantly grow. The main reasons are: dilution refrigerators are not expected to scale up; the power for the microwave electronics on a perqubit basis will decrease as quantum chips scale in the number of qubits. The last component needed to perform a fair comparison between classical and quantum computing is the rate at which quantum computers can execute a circuit. In the case of superconducting qubits, the circuit execution rate is approximately 10 kHz to 100 kHz.
For classical computers, different architectures can result in different energy consumption rates. In this paper, we compared the consumption of NASA’s supercomputer Electra and ORNL’s supercomputer Summit. On ORNL’s Summit, each node consumes around 2.4 kW of energy and consists of 2 IBM Power9 CPUs and 6 NVIDIA V100 GPUs, with a total of 4608 nodes (1.8 kW per node is consumed by 6 NVIDIA Volta GPUs). On NASA’s Electra supercomputer, each of the 2304 40core Skylake nodes consumes about 0.52 kW, and each of the 1152 28core Broadwell nodes consumes about 0.38 kW.
Ii Current Stateoftheart
With the advent of NISQ devices, different classes of classical simulators have blossomed to tackle the demanding problem of simulating large RQCs. To date, classical simulators of RQCs can be classified into three main classes (plus a fourth category of “hybrid” algorithms):

[label=, leftmargin=]

Direct evolution of the quantum state. This approach consists in the sequential application of quantum gates to the full representation of an qubit state. While this method does not have any limitation in terms of gates and topology of RQCs, the memory required to store the full quantum state quickly explodes as , and taking into account the nonnegligible overhead given by the nodetonode communication to update the full quantum state. Thus, this approach becomes impractical for the simulation and benchmark of NISQ devices with qubits. Beyond 40 qubits it already requires fast internode connectivity [19, 21, 22, 25, 31].

Perturbation of stabilizer circuits. It is known that stabilizer circuits, i.e. quantum circuits made exclusively of Clifford gates, can be efficiently simulated and sampled on classical computers [46]. An intriguing idea consists in representing RQCs as a linear combination of different stabilizer circuits [47, 48, 49]. Therefore, the computational complexity to simulate RQCs becomes proportional to the number of stabilizer circuits required to represent them. Unfortunately, doubles every time a nonClifford gate (for instance, a rotation) is used. Furthermore, the number of nonClifford gates grows quickly with the depth of RQCs [16] and, in general, much faster than the number of qubits. Consequently, stabilizer circuits are not suitable to benchmark NISQ devices using RQCs.

Tensor network contractions. The main idea of this approach is based on the fact that any given quantum circuit can always be represented as a tensor network, where onequbit gates are rank2 tensors (tensors of 2 indexes with dimension 2 each), twoqubit gates are rank4 tensors (tensors of 4 indexes with dimension 2 each), and in general qubit gates are rank tensors [50, 51] (this approach should not be confused with another application of tensor network theory to quantum circuit simulations where tensor networks are used for wavefunction compression [52]). In general, the computational and memory cost for contracting such networks is (at least) exponential with the number of open indexes (corresponding to the input state and output state, respectively). Therefore, for large enough circuits, the network contraction is impractical. Nonetheless, once the input and output states are specified through rank1 Kronecker projectors, the computational complexity drastically reduces. More precisely, the computational and memory costs are dominated by the largest tensor during the contraction [50]. The size of the largest contraction can be estimated by computing the treewidth of the underlying tensor network’s line graph which is, for most of the NISQ devices, proportional to the depth of the RQCs. This representation of quantum circuits gives rise to an efficient simulation technique to simulate RQCs for XEB. Among the most relevant tensorbased simulators, it is worth to mention the following:

[label=]

Undirected graphical model. An approach based on algorithms designed for undirected graphical models, closely related to tensor network contractions, was introduced in [53]. Later, the Alibaba group used the undirected graphical representation of RQCs to take advantage of underlying weaknesses of the design [28]. More precisely, by carefully projecting the state of a few nodes in the graph, the Alibaba group was able to reduce the computational complexity to simulate large RQCs. However, it is important to stress that [28] reports the computational cost to simulate a class of RQCs which is much easier to simulate than the class of RQCs reported in Ref. [16]. Indeed, Chen et al. fail to include the final layer of Hadamard gates in their RQCs and use more gates in detriment of nondiagonal gates at the beginning of the circuit. For these reasons, we estimate that such class is about easier to simulate than the RQCs available in [54], as discussed in Ref. [30]. The latter are the circuits simulated in this submission, as well as in [29, 30].

Quantum teleportationinspired algorithms. Recently, Chen et al. have proposed a related approach for classical simulation of RQCs inspired by quantum teleportation [55]. Such approach allows to “swap” space and time to take advantage of lowdepth quantum circuits (equivalent to a Wick rotation in the imaginary time direction). While the approach is interesting per se, the computational performance is lower than that one achieved by qFlex [30].


Hybrid algorithms. Several works have explored algorithms based on splitting grids of qubits in smaller subcircuits, which are then independently simulated [26, 29]. As an example, every time a gate crosses between subcircuits, the number of independent circuits to simulate is duplicated. Therefore, the computational cost is exponential in the number of entangling gates in a cut. MFIB [29] also introduced a technique to “match” the target fidelity of the NISQ device, which actually reduces the classical computation cost by a factor . By matching the fidelity of a realistic quantum hardware (), MFIB was able to simulate and grids with depth by numerically computing amplitudes in respectively and core hours on Google cloud. However, MFIB becomes less efficient than qFlex for grids beyond qubits because of memory requirements. In principle, one could mitigate the memory requirements by either using distributed memory protocols like MPI, or by partitioning the RQCs in more subcircuits. Nevertheless, this has been impractical so far. Moreover, the low arithmetic intensity intrinsic to this method (which relies on the direct evolution of subcircuits) makes it not scalable for floporiented heterogeneous architectures like Summit.
Iia qFlex
In 2018, NASA and Google implemented a circuit simulator – qFlex – to compute amplitudes of arbitrary bitstrings and ran it on NASA’s Electra and Pleiades supercomputers[30]. qFlex novel algorithmic design emphasizes communication avoiding and minimization of memory footprint, and it can also be reconfigured to optimize for local memory bandwidth. The computational resources required by qFlex for simulating RCS are proportional to the fidelity of the NISQ device [29, 30]. qFlex provides a fast, flexible and scalable approach to simulate RCS on CPUs (see Section III). In this work, qFlex has been redesigned and reimplemented in collaboration with the Oak Ridge National Laboratory in order to be able to utilize the GPUaccelerated Summit HPC architecture efficiently (see Section III). The computational results presented here, achieving more than 90% of peak performance efficiency and 68% sustained performance efficiency on the World’s most powerful classical computer, define the frontier for quantum computing to declare Quantum Supremacy. Moreover, given configurability of qFlex as a function of RAM and desired arithmetic intensity, it can be used to benchmark not only NISQ computers, but also classical highend parallel computers.
Iii Innovation Realized
qFlex is a flexible RQC simulator based on an innovative tensor contraction algorithm for classically simulating quantum circuits that were beyond the reach for previous approaches [30]. More precisely, qFlex is by design optimized to simulate the generic (worst) case RQCs implemented on real hardware, which establish the baseline for the applications of nearterm quantum computers. Moreover, qFlex is agnostic with respect to the randomness in the choice of singlequbit gates of the RQCs. Therefore, it presents no fluctuations in performance from one circuit to another in a given ensemble.
For the sake of simplicity, from now on we will focus on planar NISQ devices where qubits are positioned on a grid, with the only entangling gates being gates between two adjacent qubits. As RQCs, we use the prescription described in Refs. [16, 54], which has been shown to be hard to simulate.
Unlike other approaches, the first step of the qFlex algorithm consists of contracting the RQC tensor network in the “time” direction first. This step allows us to reduce the RQC to a regular 2D grid of tensors that are then contracted to produce a single quantum amplitude bitstring (see Fig. 3
). The regularity of the resulting tensors is highly advantageous for exploiting modern classical computer hardware architectures characterized by multilevel cache systems, large vector lengths and high level of parallelism, which requires high arithmetic intensity. The chosen order of tensor contractions is extremely important for minimization of the dimension of the largest tensor during the contraction.
In addition, qFlex expands on a technique introduced in [26, 27, 28, 29], which is based on systematic tensor slicing via finegrained “cuts” that enable us to judiciously balance memory requirements with the number of concurrent computations. Furthermore, by computing an appropriate fraction of “paths”, it is possible to control the “fidelity” of the simulated RCS, and therefore to “mimic” the sampling of NISQ devices. Our simulator can produce a sample of bitstrings with a target fidelity at the same computational cost as to compute noiseless amplitudes. More importantly, the use of systematic tensor slicing in our tensor contraction algorithm results in complete communication avoiding between MPI processes, thus making our simulator embarrassingly scalable by design. The number of tensor “cuts” can be properly tuned to fit each independent computation into a single node. Therefore, all the computations required are run in parallel with no communication between compute nodes. This is an important design choice that allowed us to avoid the communication bottleneck that would otherwise highly degrade the performance of qFlex, in particular when GPUs are used.
Finally, compared to other approaches, qFlex is the most versatile and scalable simulator of large RCS. In our recent paper [30], qFlex has been used to benchmark the Google Bristlecone QPU of 72qubits. To date, our Bristlecone simulations are the largest numerical computation in terms of sustained Pflop/s and the number of nodes utilized ever run on NASA Ames supercomputers, reaching a peak of Pflop/s (single precision), corresponding to an efficiency of of the Pleiades and Electra supercomputers. On Summit, we were able to achieve a sustained performance of 281 Pflop/s (single precision) over the entire supercomputer, which corresponds to an efficiency of 68%, as well as a peak performance of 92%, simulating circuits of 49 and 121 qubits on a square grid.
To achieve the performance reported here, we have combined several innovations. Some of them have appeared previously in the literature, as is explained in detail in their corresponding sections (see Ref. [30] for more details on the techniques and methods used here).
Iiia Systematic tensor slicing technique: Communication avoiding and memory footprint reduction
For RCS with sufficient depth, the contraction of the corresponding 2D grid of tensors will inevitably lead to the creation of a temporary tensor whose size exceeds the total amount of memory available in a single node. For instance, for a RQC of depth , the memory footprint exceeds the terabytes of memory. To limit the memory footprint and allow the possibility to contract RQCs on single nodes without communication, we further expand and optimize the technique of using “cuts” [26, 27, 28, 29].
Given a tensor network with tensors and a set of indexes to contract , , we define a cut over index as the decomposition of the contraction into . This results in the contraction of up to many tensor networks of lower complexity, namely those covering all slices over index of the tensors involving that index. The resulting tensor networks can be contracted independently, which results in a computation that is embarrassingly parallelizable. It is possible to make more than one cut on a tensor network, in which case refers to a multiindex, as well as to adaptively choose the size of the slices. The contribution to the final sum of each of the contractions (each of the slices over the multiindex cut) is a “path”, and the final value of the contraction is the sum of all path contributions.
Fig. 3 depicts the cut we used for RQCs of size and depth . The chosen cut allows us to reduce the amount of memory to fit six contractions on each Summit node (one FPR each GPU). Moreover, as explained in the next sections, such cuts will greatly improve the sampling of amplitudes for a given RQC.
IiiB Fast sampling technique
Sampling is a critical feature of random circuit sampling (RCS) and, therefore, it must be optimized to reduce the timetosolution. To this end, we use a frugal rejection sampling which is known to have a low overhead for RCS [29]. In addition, as explained in Ref. [30], a faster sampling can be achieved by recycling most of the large contraction for a given amplitude calculation. More precisely, if different tensors and must be contracted to get the amplitude of bitstring , our fast sampling technique first contracts all tensors (for a given ) to get a tensor . Then, is recycled and contracted to for many different . Since the most computationally demanding part is the calculation of , many amplitudes can be computed with different (but the same ). Since RQCs are extremely chaotic, arbitrary close bitstrings are uncorrelated [16]. These amplitudes are used to perform the frugal rejection sampling of one single output bitstring in the output sample, and the computation is restarted with a new random to output the next bitstring. For deep enough RQCs and by carefully choosing the size of , one can reduce the sampling time by an order of magnitude, as shown in [30].
IiiC Noisy simulation
Given the noisy nature of NISQ devices, simulating noisy RCS is of utmost importance. Intuitively, one would expect that noisier RCS would be easier to simulate. Indeed, as shown in Ref. [29], independent tensor contractions to compute a single amplitude can be seen as orthogonal “Feynman paths” (or simply “paths”). Since RQCs are very chaotic and Feynman paths typically contribute equally to the final amplitude, computing only a fraction of paths is equivalent to computing noisy amplitudes of fidelity . The computational cost for a typical target fidelity of for NISQ devices is therefore reduced by a factor of [29, 30].
While computing only a fraction of paths allows for a speedup in the simulation of noisy devices, computing fewer amplitudes with higher fidelity is an alternative method to achieve a similar speedup, as we introduced in Ref. [30]. In particular, computing a fraction of perfect fidelity amplitudes from a target amplitude set lets us simulate RCS with fidelity . In general, the fraction of amplitudes computed times their fidelity equals the fidelity of the simulated sampling.
IiiD Optimization of the tensor contraction order for optimal timetosolution
The optimization of the number and position of the cuts and the subsequent tensor contraction ordering is fundamental for minimizing the total flop count and the timetosolution in evaluating a given tensor network and to simulate sampling. Note that on modern floporiented computer architectures the minimal flop count does not necessarily guarantee the minimal timetosolution, making the optimization problem even more complicated. Indeed, different tensor contraction orderings can result in performances that differ by orders of magnitude. In general, finding the optimal cuts and tensor contraction order is an NPHard problem (closely related to the treedecomposition problem [50]). However, for planar or quasiplanar tensor network architectures produced by most of the NISQ devices, such tensor cuts can be found by carefully splitting the circuits into pieces that minimize the shared boundary interface; furthermore, we find that choosing a contraction ordering that prioritizes a few large, high arithmetic intensity contractions over many smaller, low arithmetic intensity contractions, often provides large speedups, and therefore better timetosolution.
IiiE Outofcore asynchronous execution of tensor contractions on GPU
The novel tensor slicing technique used by the qFlex algorithm removes the necessity of the interprocess MPI communication and controls the memory footprint per node, thus paving the way to scalability. In order to achieve high utilization of the hardware on a heterogeneous Summit node, we implemented an outofcore GPU tensor contraction algorithm in the TALSH library [56], which serves as a computational backend in qFlex. The TALSH library is a numerical tensor algebra library capable of executing basic tensor algebra operations, most importantly tensor contraction, on multicore CPU and NVIDIA GPU hardware. The key features of the TALSH library that allowed us to achieve highperformance on GPUaccelerated node architectures are: (a) fast heterogeneous memory management; (b) fully asynchronous execution of tensor operations on GPU; (c) fast tensor transpose algorithm; (d) outofcore algorithm for executing large tensor contractions on GPU (for tensor contractions that do not fit into individual GPU memory).
The fast CPU/GPU memory management, and in general fast resource management, is a necessary prerequisite for achieving high level of asynchronism in executing computations on CPU and GPU. TALSH provides custom memory allocators for Host and Device memory. These memory allocators use preallocated memory buffers acquired during library initialization. Within such a buffer, each memory allocator implements a simplified version of the “buddy” memory allocator used in Linux. Since the memory allocation/deallocation occurs inside preallocated memory buffers, it is fast and it is free from highly unwanted side effects associated with regular CUDA malloc/free, for example serialization of asynchronous CUDA streams.
All basic tensor operations provided by the TALSH library can be executed asynchronously with respect to the CPU Host on any GPU device available on the node. The execution of a tensor operation consists of two phases: (a) scheduling (either successful or unsuccessful, to be retried later); (b) checking for completion, either testing for completion or waiting for completion (each scheduled tensor operation is a TALSH task with its associated TALSH task handle that is used for completion checking). All necessary resources are acquired during the scheduling phase, thus ensuring an uninterrupted progress of the tensor operation on a GPU accelerator via a CUDA stream. The tensor operation scheduling phase includes scheduling the necessary data transfers between different memory spaces, which are executed asynchronously as well. The completion checking step also includes consistency control for images of the same tensor in different memory spaces. All these are automated by TALSH.
The tensor contraction operation in TALSH is implemented by the general TransposeTransposeGEMMTranspose (TTGT) algorithm, with an optimized GPU tensor transpose operation [57] (see also Ref. [58]). In the TTGT algorithm, the tensor contraction operation is converted to the matrixmatrix multiplication via permuting tensor indexes. The performance overhead associated with the tensor transpose steps can be as low as few percent, or even lower for highly arithmetically intensive tensor contractions, when using the optimized implementation. Also, all required HosttoDevice and DevicetoHost data transfers are asynchronous, and they overlap with kernels execution in other concurrent CUDA streams, thus minimizing the overhead associated with CPUGPU data transfers. Additionally, TALSH provides an automated tensor image consistency checking mechanism, enabling tensor presence on multiple devices with transparent data consistency control (tensor image is a copy of the tensor on some device).
In this work, we used the regular complex FP32 CGEMM implementation provided by the cuBLAS library, without tensor core acceleration. The possibility of reducing the input precision to FP16 in order to use Volta’s tensor cores is strongly conditioned by the precision needed for simulating large quantum circuits. The average squared output amplitude of a quantum circuit of qubits is , which for (simulated in this work) is of the order of and for (also simulated here) is of the order of . We avoid the possibility of underflowing single precision by normalizing the input tensors, and renormalizing them back at the end of the computation.
Finally, an important extension to the basic TALSH functionality, which was the key to achieving high performance in this work, is the outofcore algorithm for executing large tensor contractions on GPU, that is, for executing tensor contractions which do not fit into individual GPU memory. Although the Summit node has IBM AC922 Power9Volta architecture with unified CPUGPU coherent memory efficiently synchronized via NVLink, we did not rely on the unified memory abstraction provided by the CUDA framework for two reasons. First, although the hardware assisted memory page migration is very efficient on this architecture, we believe that the explicitly managed heterogeneous memory and data transfers deliver higher performance for tensor contractions (this is an indirect conclusion based on the performance of similar algorithms on Summit). Second, the TALSH library prioritizes performance portability among other things, that is, it is meant to deliver high performance on other accelerated HPC systems as well, many of which do not have hardwareassisted coherence mechanism between CPU and GPU.
The outofcore tensor contraction algorithm implemented in TALSH is based on recursive decomposition of a tensor operation (tensor contraction in this case) into smaller tensor operations (tensor contractions) operating on slices of the original tensors, following the general philosophy presented in Ref. [59]. During each decomposition step, the largest tensor dimension associated with the largest matrix dimension in the TTGT algorithm is split in half (in TTGT algorithm multiple tensor dimensions are combined into matrix dimensions, thus matricizing the tensors). This ensures maximization of arithmetic intensity of individual derived tensor contractions, which is important for achieving high performance on modern floporiented computer architectures, like the NVIDIA Volta GPU architecture employed in this work (arithmetic intensity is the floptobyte ratio of a given operation). The recursive decomposition is repeated until all derived tensor operations fit within available GPU resources. Then the final list of derived tensor operations is executed by TALSH using a pipelined algorithm in which multiple tensor operations are progressed concurrently, thus overlapping CPU and GPU elementary execution steps. In general, TALSH allows tuning of the number of concurrently progressed tensor operations, but in this work we restricted it to 2. Each tensor operation has 5 stages of progress:

Resource acquisition on the execution device (GPU);

Loading input tensor slices on Host (multithreaded);

Asynchronous execution on GPU: Fast additional resource acquisition/release, asynchronous data transfers, asynchronous tensor transposes, asynchronous GEMM;

Accumulating the output tensor slice on Host (multithreaded);

Resource release on the execution device (GPU).
The pipelined progressing algorithm is driven by a single CPU thread and is based on the “asynchronous yield” approach, that is, each active (concurrent) tensor operation proceeds through its consecutive synchronous stages uninterrupted (unless there is a shortage in available resources, in which case it is interrupted), but it “yields” to the next concurrent tensor operation once an asynchronous stage starts. The asynchronous GPU execution step, mediated by the CUDA runtime library, involves asynchronous HosttoDevice and DevicetoHost data transfers, asynchronous (optimized) tensor transpose kernels and asynchronous (default) CGEMM cuBLAS calls. Since each NVIDIA Volta GPU has multiple physical data transfer engines, all incoming and outgoing data transfers are overlapped with the CUDA kernels execution in different CUDA streams, thus almost completely removing CPUGPU data transfer overhead. Overall, this algorithm, combined with fast tensor transpose and efficient cuBLAS GEMM implementation, results in highly efficient execution of large tensor contractions on GPU, as demonstrated by our results (up to 96% of Volta’s theoretical flop/s peak).
Iv How Performance Was Measured
We used an analytical approach to measuring the number of floating point operations performed by our algorithm. Our computational workload consists of a series of tensor contractions operating on dense tensors of rank 1 to 10. Thus, the total number of Flops executed by an MPI process is the sum of the flop counts of each individual tensor contraction executed by that MPI process. Each individual tensor contraction has a well defined flop count:
(1) 
where , , are the volumes of the tensors participating in the tensor contraction (volume is the number of tensor elements in a tensor), and the factor of 8 shows up due to the use of the complex multiplyadd operation (4 real multiplications plus 4 real additions). All tensors used in our simulations are of complex single precision (complex FP32), SP.
For the timing of the simulations, we do not include job launching, MPI_Init(), nor the initialization of the TALSH library on each MPI process, as well as the final time taken to write the results to file by the master MPI process (scheduler). In order to do so, the two timings are reported after two MPI_Barrier() synchronization calls.
Based on the analytical flop count calculation, we have computed multiple performance metrics in this work. The average SP flop/s count, , is computed by dividing the total flop count of our simulation by its execution time. The peak SP flop/s count, , is computed by dividing the total flop count of the largest tensor contraction by the sum of its execution times on each node, times the number of nodes. Additionally, we have computed the energy efficiency of the Summit hardware with respect to our simulation. The average flop/s/watt value, , is computed by dividing the total flop count of our simulation by the total energy consumed by Summit during the simulation. This energy consumption value is actually the upper bound because it includes all components of Summit, some of which, like disk storage, were not actively used by our simulation. The Summit HPC system at Oak Ridge National Laboratory has a total power capacity of 14 MW available to achieve a design specification of slightly more than 200 Pflop/s peak doubleprecision performance, and more than 400 Pflop/s singleprecision performance. This energy powers 4608 GPUenabled compute nodes. Each Summit node contains 2 Power9 CPU, 6 NVIDIA Volta V100 GPUs, 512 GB of DDR4 RAM, 16 GB of HBM2 onchip GPU memory in each GPU, 1.6 TB of nonvolatile NVMe storage, and 2 physical network interface cards, with a total power cap of approximately 2400W, of which up to 1800W are consumed by the 6 NVIDIA Volta GPUs.
V Performance Results
PFlop/s  Efficiency (%)  

Circuit Size  Nodes Used  Runtime (h)  Peak  Sust.  Peak  Sust.  Power (MW)  Energy Cost (MWh)  PFlop/s/MW 
2300  4.84  191  142  92.0  68.5        
4600  2.44  381  281  92.1  68.0  8.075  21.1  34.8  
4550  0.278  368  261  89.8  63.7  7.3  2.32  35.8 
We have simulated RCS on two large hard RQCs running on either half or the entire Summit supercomputer. Both circuits belong to the class of revised, worstcase circuits on planar graphs (no design weaknesses exploited) [29] and can be found in [54] (inst_7x7_41_0.txt and inst_11x11_25_0.txt). See Table I for a summary of our performance results.
Va GPU results and peak performance
Most of the simulation time is spent in arithmetically intensive, outofcore contractions of large tensors (with volumes as large as ) that do not necessarily fit in the GPU memory. On these contractions, we achieve a performance efficiency of over with respect to the theoretical singleprecision peak of 15 Tflop/s for the NVIDIA Volta V100 GPUs, and this efficiency goes as high as 96%. For this reason, we compute our peak performance as the average performance of these largest tensor contractions times the number of GPUs corresponding to the number of nodes used in each simulation. We therefore achieve a peak performance of 92% (381 Pflop/s when running on 100% of Summit) in the simulation, and 90% (367 Pflop/s when running on 99% of Summit) in the simulation.
VB Sustained performance
While we successfully slice (“cut”) the tensors in our circuits and contract them in an ordering that leaves most of the computation to large arithmetically intensive tensor contractions, it is inevitable to deal with a nonnegligible number of less arithmetically intensive tensor contractions, which achieve a suboptimal performance. Averaged over the entire simulation time, we reach a sustained performance efficiency of about 68% (281 Pflop/s on 100% of Summit) for the simulation, and about 64% (261 Pflop/s on 99% of Summit) for the simulation.
VC Scaling
Due to the communicationavoiding design of our algorithm, the impact of communication on scaling is negligible, and therefore the performance is stable as a function of the number of nodes used, demonstrating an excellent strong scaling (see Table I).
VD Energy consumption
We report the energy consumed by Summit in our full scale simulations. We achieve a rate of 34.8 Pflop/s/MW for the simulation, and 35.8 Pflop/s/MW for the simulation. Note that the power consumed should be taken as an upper bound, since it takes into account the consumption of the entire machine, including components that were not used in the simulations. Furthermore, we report the energy consumed by the entire job, including job launch and initialization of the TALSH library, which again slightly lifts the upper bound; this might be part of the reason for obtaining a slightly larger flop per energy rate on the shorter simulation (, see Fig. 4), where the actual computing time is smaller in relative terms.
Runtime (hours)  Energy Cost (MWh)  
Circuit Size  Target Fidelity (%)  Electra  Summit  QPU  Electra  Summit  QPU 
0.5  59.0  2.44  0.028  96.8  21.1 
Vi Implications and Conclusions
As we explore technologies beyond Moore’s law, it is important to understand the potential advantages of quantum computing both in terms of timetosolution and energy consumption. The research presented here also provides a comparison of different classical computing architectures, comparing Random Circuit Sampling implementations on both CPU and GPU based supercomputers (see Table II).
In conclusion, the implications of the proposed research are the following:

[label=, leftmargin=]

Establishes minimum hardware requirements for quantum computers to exceed available classical computation. qFlex is able to calculate the required fidelity, number of qubits, and gate depth required for a quantum computer to exceed the computational ability of the most powerful available supercomputer for at least one well defined computational task: RCS.

Objectively compares quantum computers in terms of computational capability. Different architectures (e.g. ion traps vs superconducting qubits) vary in:

[leftmargin=20pt]

Number of qubits

Types of gates

Fidelity

Connectivity

Number of qubits per multiqubit operation

Number of parallel executions
RCS allows comparisons across different qubit architectures estimating the equivalent amount of classical computation with commonly used metrics.


Establishes a multiqubit benchmark for nonClifford gates. Prior to XEB, there was no benchmarking proposal to measure multiqubit fidelity for universal (nonClifford) quantum gate sets. qFlex enables XEB on a large number of qubits and large circuit depths by efficiently using classical computing resources. XEB allows quantum hardware vendors to calibrate their NISQ devices.

Objectively compares classical computers for simulating large quantum manybody systems. Using RCS as a benchmark for classical computers, one can compare how different classical computers perform simulation of large quantum manybody systems across different classical computational architectures. In this paper, we specifically compare NASA’s Electra supercomputer, which is primarily powered by Intel Skylake CPUs, with ORNL’s Summit supercomputer, which is primarily powered by NVIDIA Volta GPUs.

Objectively compares computer energy consumption requirements across a variety of quantum and classical architectures for one specific computational task. We estimated the energy cost to produce a sample of bistrings for 2 different circuits across classical CPU, GPU, and superconducting quantum computer architectures (see Table II). The classical computers would take 96.8 MWh and 21.1 MWh for NASA’s Electra and ORNL’s Summit, respectively. The factor of almost 5 of improvement of Summit over Electra is due to Summit’s GPU architecture. Comparing this to a superconducting quantum computer with a sampling rate of 10 kHz, a quantum computer would have a 5 orders of magnitude improvement (see Table II). This separation in energy consumption performance is much greater than the timetosolution performance improvement of 2 orders of magnitude. Furthermore, we expect the separation in energy performance to continue to grow faster than the timetosolution performance. We emphasize that RCS is a computational task particularly favorable to quantum computers and an advantage is presently not achievable in practice for most other computational problems.

Guides nearterm quantum algorithm and hardware design. The simulation capabilities of qFlex can be harnessed for such tasks as verifying that a hardware implementation of an algorithm is behaving as expected, evaluating relative performance of quantum algorithms that could be run on nearterm devices, and suggesting hardware architectures most suitable for experimenting with specific quantum algorithms in the near term. The specific capabilities of qFlex are particularly useful for evaluating quantum circuits large enough that the full quantum state cannot be produced but for which qFlex can still compute the probability of individual outputs of interest. In the case of RCS, it suffices to compute the amplitudes of random bistrings to perform rejection sampling in order to produce a sample of bistrings [29, 30]
. In other cases, it might be possible to use classical heuristics to determine what probabilities to calculate.
Acknowledgments
We are grateful for support from NASA Ames Research Center, NASA Advanced Exploration systems (AES) program, NASA Earth Science Technology Office (ESTO), and NASA Transformative Aeronautic Concepts Program (TACP), and also for support from AFRL Information Directorate under grant F4HBKC4162G001. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DEAC0500OR22725. We would like to thank Jack Wells, Don Maxwell, and Jim Rogers for their help in making Summit simulations possible and for providing hardware utilization statistics. This manuscript has been authored by UTBattelle, LLC under Contract No. DEAC0500OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a nonexclusive, paidup, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan. (http://energy.gov/downloads/doepublicaccessplan).
References
 [1] R. P. Feynman, “Simulating Physics with Computers,” International Journal Of Theoretical Physics, vol. 21, no. 67, pp. 467–488, 1982.
 [2] R. P. Feynman, “Quantum mechanical computers,” Optics news, vol. 11, no. 2, pp. 11–20, 1985.
 [3] R. Barends, J. Kelly, A. Megrant, A. Veitia, D. Sank, E. Jeffrey, T. C. White, J. Mutus, A. G. Fowler, B. Campbell, and others, “Superconducting quantum circuits at the surface code threshold for fault tolerance,” Nature, vol. 508, no. 7497, pp. 500–503, 2014.
 [4] J. Kelly, R. Barends, A. G. Fowler, A. Megrant, E. Jeffrey, T. C. White, D. Sank, J. Y. Mutus, B. Campbell, Y. Chen, and others, “State preservation by repetitive error detection in a superconducting quantum circuit,” Nature, vol. 519, no. 7541, pp. 66–69, 2015.
 [5] Y. Wang, Y. Li, Z.q. Yin, and B. Zeng, “16qubit ibm universal quantum computer can be fully entangled,” npj Quantum Information, vol. 4, no. 1, p. 46, 2018.
 [6] M. A. Nielsen and I. L. Chuang, Quantum computation and quantum information. Cambridge University Press, Cambridge, 2000.
 [7] E. G. Rieffel and W. Polak, Quantum Computing: A Gentle Introduction. Cambridge, MA: MIT Press, 2011.

[8]
L. K. Grover, “A fast quantum mechanical algorithm for database search,” in
Proceedings of the twentyeighth annual ACM symposium on Theory of computing  STOC ’96
, pp. 212–219, ACM, 1996.  [9] A. AspuruGuzik, A. D. Dutoi, P. J. Love, and M. HeadGordon, “Chemistry: Simulated quantum computation of molecular energies,” Science, vol. 309, pp. 1704–1707, sep 2005.
 [10] R. Babbush, N. Wiebe, J. McClean, J. McClain, H. Neven, and G. K. L. Chan, “LowDepth Quantum Simulation of Materials,” Physical Review X, vol. 8, may 2018.
 [11] Z. Jiang, K. J. Sung, K. Kechedzhi, V. N. Smelyanskiy, and S. Boixo, “Quantum Algorithms to Simulate ManyBody Physics of Correlated Fermions,” Physical Review Applied, vol. 9, p. 44036, apr 2018.
 [12] R. Babbush, C. Gidney, D. W. Berry, N. Wiebe, J. McClean, A. Paler, A. Fowler, and H. Neven, “Encoding Electronic Spectra in Quantum Circuits with Linear T Complexity,” Physical Review X, vol. 8, p. 41015, oct 2018.
 [13] P. Shor, “Algorithms for quantum computation: discrete logarithms and factoring,” in Proceedings 35th Annual Symposium on Foundations of Computer Science, pp. 124–134, Ieee, 1994.
 [14] V. N. Smelyanskiy, K. Kechedzhi, S. Boixo, S. V. Isakov, H. Neven, and B. Altshuler, “Nonergodic delocalized states for efficient population transfer within a narrow band of the energy landscape,” arXiv:1802.09542, 2018.
 [15] A. G. Fowler, M. Mariantoni, J. M. Martinis, and A. N. Cleland, “Surface codes: Towards practical largescale quantum computation,” Phys. Rev. A, vol. 86, no. 3, p. 032324, 2012.
 [16] S. Boixo, S. V. Isakov, V. N. Smelyanskiy, R. Babbush, N. Ding, Z. Jiang, M. J. Bremner, J. M. Martinis, and H. Neven, “Characterizing quantum supremacy in nearterm devices,” Nature Physics, vol. 14, pp. 1–6, jun 2018.
 [17] M. J. Bremner, A. Montanaro, and D. J. Shepherd, “Achieving quantum supremacy with sparse and noisy commuting quantum computations,” Quantum, vol. 1, p. 8, apr 2016.
 [18] C. Neill, P. Roushan, K. Kechedzhi, S. Boixo, S. V. Isakov, V. Smelyanskiy, A. Megrant, B. Chiaro, A. Dunsworth, K. Arya, et al., “A blueprint for demonstrating quantum supremacy with superconducting qubits,” Science, vol. 360, no. 6385, pp. 195–199, 2018.
 [19] K. De Raedt, K. Michielsen, H. De Raedt, B. Trieu, G. Arnold, M. Richter, T. Lippert, H. Watanabe, and N. Ito, “Massively parallel quantum computer simulator,” Computer Physics Communications, vol. 176, pp. 121–136, jan 2007.
 [20] J. Preskill, “Quantum Computing in the NISQ era and beyond,” arXiv:1008.5137 [physics:quantph], jan 2018.
 [21] M. Smelyanskiy, N. P. D. Sawaya, and A. AspuruGuzik, “qHiPSTER: The Quantum High Performance Software Testing Environment,” arXiv:1601.07195, 2016.
 [22] T. Häner and D. S. Steiger, “0.5 Petabyte Simulation of a 45Qubit Quantum Circuit,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 33, ACM, 2017.
 [23] S. Aaronson and L. Chen, “Complexitytheoretic foundations of quantum supremacy experiments,” arXiv:1612.05903, 2016.
 [24] A. Bouland, B. Fefferman, C. Nirkhe, and U. Vazirani, “Quantum Supremacy and the Complexity of Random Circuit Sampling,” arXiv:1803.04402, 2018.
 [25] E. Pednault, J. A. Gunnels, G. Nannicini, L. Horesh, T. Magerlein, E. Solomonik, and R. Wisnieff, “Breaking the 49Qubit Barrier in the Simulation of Quantum Circuits,” arXiv:1710.05867, 2017.
 [26] Z. Y. Chen, Q. Zhou, C. Xue, X. Yang, G. C. Guo, and G. P. Guo, “64Qubit Quantum Circuit Simulation,” Science Bulletin, vol. 63, pp. 964–971, aug 2018.
 [27] R. Li, B. Wu, M. Ying, X. Sun, and G. Yang, “Quantum Supremacy Circuit Simulation on Sunway TaihuLight,” arXiv:1804.04797 [quantph], apr 2018.
 [28] J. Chen, F. Zhang, C. Huang, M. Newman, and Y. Shi, “Classical Simulation of IntermediateSize Quantum Circuits,” arXiv:1805.01450 [quantph], may 2018.
 [29] I. L. Markov, A. Fatima, S. V. Isakov, and S. Boixo, “Quantum Supremacy Is Both Closer and Farther than It Appears,” arXiv:1807.10749 [quantph], jul 2018.
 [30] B. Villalonga, S. Boixo, B. Nelson, C. Henze, E. Rieffel, R. Biswas, and S. Mandrà, “A flexible highperformance simulator for the verification and benchmarking of quantum circuits implemented on real hardware,” arXiv preprint arXiv:1811.09599, 2018.
 [31] H. De Raedt, F. Jin, D. Willsch, M. Nocon, N. Yoshioka, N. Ito, S. Yuan, and K. Michielsen, “Massively parallel quantum computer simulator, eleven years later,” arXiv:1805.04708, 2018.
 [32] “IBM Announces Advances to IBM Q Systems & Ecosystem, IBM Press Release,” November 2017.
 [33] “A Preview of Bristlecone, Google’s New Quantum Processor, Google AI Blog,” March 2018.
 [34] S. Aaronson and A. Arkhipov, “The Computational Complexity of Linear Optics,” in Proceedings of the fortythird annual ACM symposium on Theory of computing, pp. 333–342, ACM, 2010.
 [35] M. J. Bremner, A. Montanaro, and D. J. Shepherd, “AverageCase Complexity Versus Approximate Simulation of Commuting Quantum Computations,” Physical Review Letters, vol. 117, p. 80501, aug 2016.
 [36] S. Aaronson and L. Chen, “ComplexityTheoretic Foundations of Quantum Supremacy Experiments,” in LIPIcsLeibniz International Proceedings in Informatics, vol. 79, Schloss DagstuhlLeibnizZentrum fuer Informatik, 2016.
 [37] A. Harrow and S. Mehraban, “Approximate unitary designs by short random quantum circuits using nearestneighbor and longrange gates,” arXiv preprint arXiv:1809.06957, 2018.
 [38] R. Movassagh, “Efficient unitary paths and quantum computational supremacy: A proof of averagecase hardness of Random Circuit Sampling,” arXiv:1810.04681, 2018.
 [39] L. S. Bishop, S. Bravyi, A. Cross, J. M. Gambetta, and J. Smolin, “Quantum volume,” Quantum Volume. Technical Report, 2017.
 [40] E. Magesan, J. M. Gambetta, and J. Emerson, “Scalable and robust randomized benchmarking of quantum processes,” Phys. Rev. Lett., vol. 106, no. 18, p. 180504, 2011.
 [41] E. Knill, D. Leibfried, R. Reichle, J. Britton, R. Blakestad, J. Jost, C. Langer, R. Ozeri, S. Seidelin, and D. Wineland, “Randomized benchmarking of quantum gates,” Phys. Rev. A, vol. 77, no. 1, p. 012307, 2008.
 [42] E. Magesan, J. M. Gambetta, and J. Emerson, “Characterizing Quantum Gates via Randomized Benchmarking,” Phys. Rev. A, vol. 85, April 2012.
 [43] A. Erhard, J. J. Wallman, L. Postler, M. Meth, R. Stricker, E. A. Martinez, P. Schindler, T. Monz, J. Emerson, and R. Blatt, “Characterizing largescale quantum computers via cycle benchmarking,” arXiv:1902.08543, 2019.
 [44] K. A. Britt, F. A. Mohiyaddin, and T. S. Humble, “Quantum accelerators for highperformance computing systems,” in 2017 IEEE International Conference on Rebooting Computing (ICRC), pp. 1–7, Nov 2017.
 [45] T. S. Humble, R. J. Sadlier, and K. A. Britt, “Simulated execution of hybrid quantum computing systems,” in Quantum Information Science, Sensing, and Computation X, vol. 10660, p. 1066002, International Society for Optics and Photonics, 2018.
 [46] D. Gottesman, “The heisenberg representation of quantum computers,” arXiv preprint quantph/9807006, 1998.
 [47] S. Aaronson and D. Gottesman, “Improved simulation of stabilizer circuits,” Physical Review A, vol. 70, no. 5, p. 052328, 2004.
 [48] S. Bravyi and D. Gosset, “Improved Classical Simulation of Quantum Circuits Dominated by Clifford Gates,” Physical Review Letters, vol. 116, p. 250501, jun 2016.
 [49] R. S. Bennink, E. M. Ferragut, T. S. Humble, J. A. Laska, J. J. Nutaro, M. G. Pleszkoch, and R. C. Pooser, “Unbiased simulation of nearclifford quantum circuits,” Physical Review A, vol. 95, no. 6, p. 062337, 2017.
 [50] I. L. Markov and Y. Shi, “Simulating quantum computation by contracting tensor networks,” SIAM Journal on Computing, vol. 38, pp. 963–981, jan 2008.
 [51] J. Biamonte and V. Bergholm, “Tensor networks in a nutshell,” arXiv preprint arXiv:1708.00006, 2017.
 [52] A. McCaskey, E. Dumitrescu, M. Chen, D. Lyakh, and T. Humble, “Validating quantumclassical programming models with tensor network simulations,” PLOS ONE, vol. 13, pp. 1–19, 12 2018.
 [53] S. Boixo, S. V. Isakov, V. N. Smelyanskiy, and H. Neven, “Simulation of lowdepth quantum circuits as complex undirected graphical models,” arXiv:1712.05384 [quantph], dec 2017.
 [54] S. Boixo and C. Neill, “The question of quantum supremacy,” Google AI Blog, May 2018. Available on GitHub at https://github.com/sboixo/GRCS.
 [55] M.C. Chen, R. Li, L. Gan, X. Zhu, G. Yang, C.Y. Lu, and J.W. Pan, “Quantum TeleportationInspired Algorithm for Sampling Large Random Quantum Circuits,” arXiv:1901.05003 [quantph], January 2019.
 [56] D. Lyakh, “Tensor algebra library routines for shared memory systems (talsh).” https://github.com/DmitryLyakh/TAL_SH.
 [57] D. I. Lyakh, “An efficient tensor transpose algorithm for multicore cpu, intel xeon phi, and nvidia tesla gpu,” Computer Physics Communications, vol. 189, pp. 84–91, 2015.
 [58] D. I. L. AnttiPekka Hynninen, “cutt: A highperformance tensor transpose library for CUDA compatible gpus,” CoRR, vol. abs/1705.01598, 2017.
 [59] D. I. Lyakh, “Domainspecific virtual processors as a portable programming and execution model for parallel computational workloads on modern heterogeneous highperformance computing architectures,” International Journal of Quantum Chemistry, p. e25926, 2019.
Comments
There are no comments yet.