1. Introduction
A central design challenge for future generations of wireless networks is to meet users’ everincreasing demand for capacity and throughput. Recent advances in the design of wireless networks to this end, including the 5G efforts underway in industry and academia, call in particular for the use of Large and Massive multiple input multiple output (MIMO) antenna arrays to support many users near a wireless access point (AP) or base station sharing the wireless medium at the same time.^{1}^{1}1We use the terms ”access point” and ”base station” interchangeably in this paper.
Much effort has been and is currently being dedicated to large and massive MIMO, and these techniques are coming to fruition, yielding significant gains in network throughput. An apropos example is Massive MIMO: in LTE cellular and 802.11ac localarea networks, up to eight antennas are supported at the AP spatially multiplexing (telatarcapacity99) parallel information streams concurrently to multiple receive antennas. The technique is also known as multiuser MIMO (MUMIMO) and can be used both in the uplink and the downlink of multiuser MIMO networks: in the uplink case, several users concurrently transmit to a multiantenna AP, while over the downlink, the AP multiplexes different information streams to a number of mobile users.
From a design standpoint, one of the most promising and cost effective architectures to implement 5G technologies is the centralized radio access network (CRAN) architecture (namba2012bbu; rost2014opportunistic). CRAN pushes most of the physicallayer processing that currently takes place at the AP to a centralized data center, where it is aggregated with other APs’ processing on the same hardware. The CRAN concept has undergone several iterations since its inception, with more recent work treating the unique demands of small cells (fluidnetton16), centralizing most of the computation and supporting hundreds or thousands of the APs from a centralized data center.
To fully realize Massive MIMO’s potential throughput gains, however, the system must effectively and efficiently demultiplex mutuallyinterfering information streams as they arrive. Current large MIMO designs such as Argos (argosmobicom12), BigStation (BigStation), and SAM (SAM) use linear processing methods such as zeroforcing and minimum mean squared error (MMSE) filters. These methods have the advantage of very low computational complexity, but suffer when the MIMO channel is poorlyconditioned (Geosphere), as is often the case when the number of user antennas approaches the number of antennas at the AP (Geosphere; BigStation). Sphere Decoderbased maximum likelihood (ML) MIMO decoders/detectors (Geosphere; ETH_HARD) can significantly improve throughput over such linear filters, but suffer from increased computational complexity: compute requirements increase exponentially with the number of antennas (SDComp1), soon becoming prohibitive (Section 2.1).
The problem of limited computational capacity stems from the requirement that a receiver’s physical layer finish decoding a frame before the sender requires feedback about its decoding success or failure. For WiFi networks, e.g., this quantity is on the order of tens of s, the spacing in time between the data frame and its acknowledgement. More sophisticated physical layers, such as 4G LTE, require the receiver to give feedback in the context of incremental redundancy schemes, for the same reason; the processing time available is 3 ms for 4G LTE and 10 ms for WCDMA (BigStation; dahlman20134g). As a result, most current systems adopt linear filters, accepting a drop in performance.
New approach: Quantum computation in the data center. This paper explores the leveraging of quantum computation (QC) to speed up the computation required for the ML MIMO decoder. We place our ideas in the context of the QC currently already realized in experimental hardware, and in the context of the dominant CRAN architecture. Here we imagine a future quantum computer, colocated with CRAN computational resources in a data center, connected to the APs via highspeed, lowlatency fiber or millimeterwave links.
Optimization is one of the key applications the quantum community has identified as viable in the shortterm (i.e. before quantum processors become scalable devices capable of error correction and universality). While their potential in optimization is largely unproven, it is believed that it may be possible for Noisy IntermediateScale Quantum (NISQ) devices to achieve polynomial or exponential speedups over the best known classical algorithms (preskill2018quantum). It is, however, important to leverage understanding from current prototypes in order to inform the design of realworld systems, since performance cannot be predicted or simulated efficiently, especially in the presence of devicespecific noise. This is the approach we therefore advocate here. Two main approaches have been identified for quantum optimization in NISQs: Quantum Annealing (QA) and Quantum Approximate Optimization Algorithms (QAOA). The former approach is a form of analog computation that has been developed theoretically in the early nineties (finnila1994quantum), but realized experimentally in a programmable device only in 2011 by DWave Systems. We focus on QA here, discussing QAOA briefly in Section 6.
This paper presents QuAMax
, the first system to apply QA to the computationally challenging ML MIMO wireless decoding problem in the context of a centralized RAN architecture where a QA is colocated in a data center serving one or more wireless APs. The contributions of our paper can be summarized as follows: Firstly, we present the first reduction of the ML MIMO decoding problem to a form that a QA solver can process. Secondly, we introduce a new, communicationsspecific evaluation metric,
TimetoBER (TTB), which evaluates the performance of the QA as it aims to achieve a target bit error rate (BER) on the decoded data. Finally, we evaluate QuAMax with various scenarios and parameter settings and test their impact on computational performance. To achieve a BER of and a frame error rate of , ML MIMO detection on the DWave 2000Q quantum annealer requires 10–20 s of computation time for 48user, 48AP antenna binary modulation or 30–40 s for QPSK at 20 dB SNR, and with the realworld trace of MIMO channel, the largest spatial multiplexing MIMO size publicly available for experiments (Argos), QuAMax requires 2 s for BPSK and 2–10 s for QPSK. Paper roadmap. The next section is a background primer on ML detection and QA. Section 3 details our programming of the ML problem on the QA hardware. Section 4 describes QuAMax implementation in further detail, followed by our evaluation in Section 5. We conclude with a review of related work (Section 6), discussion of current status of technology and practical considerations (Section 7), and final considerations (Section 8).2. Background
In this section, we present primer material on the MIMO ML Detection problem (§2.1), and Quantum Annealing (§2.2).
BPSK  QPSK  16QAM  Complexity (Visited Nodes) 

40 (feasible)  
270 (borderline)  
1,900 (unfeasible) 
2.1. Primer: Maximum Likelihood Detection
Suppose there are mobile users, each of which has one antenna, each sending data bits to a multiantenna () MIMO AP based on OFDM, the dominant physical layer technique in broadband wireless communication systems (nee2000ofdm)
. Considering all users’ data bits together in a vector whose elements each comprise a single user’s data bits, the users first map those data bits into a complexvalued
symbol that is transmitted over a radio channel: . Each user sends from a constellation of size = ( bits per symbol). The MIMO decoding problem, whose optimal solution is called the ML solution, consisting of a search over the sets of transmitted symbols, looking for the set that minimizes the error with respect to what has been received at the AP:(1) 
The ML decoder then “demaps” the decoded symbols to decoded bits . In Eq. 1, is the wireless channel^{2}^{2}2The channel changes every channel coherence time
, and is practically estimated and tracked via preambles andor pilot tones. Typical coherence time at 2 GHz and a walking speed is
ca. 30 ms (tseviswanath). on each OFDM subcarrier and () is the received set of symbols, perturbed by , additive white Gaussian noise (AWGN). This solution minimizes detection errors, thus maximizing throughput (i.e., throughputoptimal decoding).The Sphere Decoder (Agrell02; Damen03) is a ML detector that reduces complexity with respect to bruteforce search by constraining its search to only possible sets that lie within a hypersphere of radius centered around (i.e., Eq. 1 with constraint ). It transforms Eq. 1 into a tree search (SD)
, where Q is orthonormal and R uppertriangular, resulting in , with . The resulting tree has a height of , branching factor of , and nodes. ML detection becomes the problem of finding the single leaf among with minimum metric; the corresponding tree path is the ML solution. Thus, the in Eq. 1 is a search in an exponentiallylarge space of transmitted symbols , despite Sphere Decoder reductions in the search space size (SD).Table 1
shows the average number of tree nodes visited to perform ML Sphere decoding, with clients transmitting modulation symbols on 50 subcarriers over a 20 MHz, 13 dB SNR (Signal to Noise Ratio) Rayleigh channel. The table is parameterized on the number of clients and AP antennas, and the modulation, highlighting the exponential increase in computation. For 8 clients with 16QAM symbols, 15 clients with QPSK symbols, or 30 clients sending binary (BPSK) symbols, the Sphere Decoder visits close to 2,000 tree nodes, saturating, for example, Intel’s Skylake core i7 architecture, whose arithmetic subsystem achieves an order of magnitude less computational throughput
(flexcorensdi17). Since traditional silicon’s clock speed is plateauing (elec:courtland), the problem is especially acute.2.2. Primer: Quantum Annealing
Quantum Annealers (AQC; QA) are specialized, analog computers that solve NPcomplete and NPhard optimization problems on current hardware, with future potential for substantial speedups over conventional computing (Mc). Many NPhard problems can be formulated in the Ising model (10.3389/fphy.2014.00005) (cf. §3.2), which many QA machines use as input (DW_part; DW_map). NPcomplete and NPhard problems other than ML MIMO detection in the field of (wireless) networking that potentially benefit from QA include MIMO downlink precoding (mazrouei2012vector), channel coding (jose2015analysis; wu2003block), network routing (chen2002efficient), security (forouzan2007cryptography), and scheduling (liu2001opportunistic; georgiadis2006resource).
Quantum Annealing hardware.
Compared to simulated annealing, the classical algorithm from which QA inherits its name, QA aims to exploit quantum effects such as tunneling, manybody delocalization and quantum relaxation to circumvent computational bottlenecks that would otherwise trap Monte Carlo methods in local minima of the solution landscape. While exploiting QA is technologically challenging, with the appearance of the DWave quantum annealer (Fig. 1), the research community is now able to run experiments, and critically, to study under what conditions a noisyintermediatescalequantum (NISQ) machine (preskill2018quantum) can use quantum resources to deliver a speedup (job2018test). For instance, recently Boixo et al. (boixo45467) and Denchev et al. (denchev44814) have found evidence that tunneling under ideal conditions can be exploited on an earlier model of the DWave 2000Q (DW2Q) machine, delivering many orders of magnitude speedup against CPUbased simulated annealing, which is considered to be one of the best classical competitions to Quantum Processing Units (QPUs). QPUs also outperform GPU implementations by several orders of magnitude in random problems whose structure is related to real world optimization problems (king2019quantum).
The DW2Q is an analog optimizer, meaning that it computes continuously rather than in discrete clock cycles, and that it represents numerical quantities as analog instead of digital quantities. The hardware initializes each of its constituent quantum bits, or qubits, to begin in a superposition state that has no classical counterpart. In concrete terms, these qubits are metallic circuits in a chip that are maintained in a superconducting state by low temperature and subjected to the influence of tailored magnetic fluxes. The collection of qubits at this point in time encodes all the possible outputs in a single state. This initial setting is achieved by exposing all the qubits in the chip to a signal whose magnitude at this point in time is maximal. Then the system implements an objective function which is represented by another signal and is ramped up from zero, while is decreased progressively at the same time. The synchronized sequence of signals and and their time dependence is the annealing schedule. The schedule is essentially the QA algorithm, and has to be optimized so that at the end of the run ( and ), each qubit in the chip assumes either a value of or , corresponding to classical bit values, 0 or 1, respectively. This final state of these qubits collectively represents a candidate solution of the problem, ideally the ground state of the system (i.e., the minimum of the optimization objective function) (johnsonnature11; quantph/0001106).
In practice, at the end of the run, the ground state will be found with a probability that depends on the degree to which the schedule is optimal for the problem at hand, as well as on the effect of uncontrollable QA noise and environmental interference on the annealer. While the quantum community is investigating physics principles to guide schedule parameters, most clearlyunderstood theoretical principles do not apply to current, imperfect experimental systems
(job2018test). Hence the empirical approach, which we take in this paper, represents current stateoftheart (ronnow2014defining). Three degrees of freedom are specifically investigated in this work.

First, there are many ways of mapping a problem to an equivalent Ising formulation that runs on the machine (we investigate one such mapping in Section 3).

Second, the user may accelerate or delay / evolution, thus determining annealing time (1–300 s), the duration of the machine’s computation.

Finally, the user may introduce stops (anneal pause) in the annealing process, which have been shown to improve performance in certain settings (DWPauseMarshall).
3. Design
Starting from the abstract QA problem form (§3.1), QuAMax’s design reduces ML detection to form (§3.2), then compiles it on actual hardware, a process called embedding (§3.3).
3.1. QA Problem Formulation
The first step in leveraging QA for any problem is to define the problem of interest as an objective function to be minimized, consisting of a quadratic polynomial of binary variables. We now introduce two equivalent forms of this objective functions, as is customary in the QA application literature.
1. Ising spin glass form.
In this form the solution variables are traditionally referred to as spins .
(2) 
where is the number of spin variables, and and are the Ising model parameters that characterize the problem. The characterize the preference for each spin to be or : positive indicates a preference for while negative indicates a preference for , with the magnitude corresponding to the magnitude of the preference for either state. The capture preferred correlations between spins: positive causes the QA to prefer , while negative causes the QA to prefer in its optimization outcome. Analogously to , the magnitude of corresponds to the magnitude of its preference.
2. QUBO form.
The Quadratic Unconstrained Binary Optimization (QUBO) has solution variables that are classical binary bits (zero or onevalued):
(3) 
where is the qubit count and is upper triangular. The offdiagonal matrix elements () correspond to in Eq. 2, and the diagonal elements correspond to .
The two forms are equivalent, their solutions related by:
(4) 
leading to and .
3.2. MLtoQA Problem Reduction
We now explain our process for transforming the ML detection problem into the QUBO and Ising forms. Since QuAMax also assumes OFDM where the wireless channel is subdivided into multiple flatfading orthogonal subcarriers (nee2000ofdm), this MLtoQA reduction is required at each subcarrier.
3.2.1. MLtoQUBO problem reduction.
Let’s first consider the transformation of the ML problem into QUBO form—the key idea is to find a variabletosymbol transform function that represents the “candidate” vector in the ML search process (Eq. 1 on p. 2) instead with a number of QUBO solution variables. Specifically, we represent each of the senders’ candidate symbols (), with QUBO solution variables, naturally requiring QUBO variables for transmitters, and form these QUBO variables into a vector for each sender : . For example, recasts a QPSK () problem into a QUBO problem with four solution variables, split into two vectors and . In general, the transform recasts the ML problem of Eq. 1 into the form
(5) 
where . Then, the resulting vectors correspond to the QUBO solution variables, . Continuing our QPSK example, . Then, Eq. 5 results in two MLdecoded vectors , (noting that , corresponds to the ML solution in Eq. 1, the nearest symbol vector around received ). The decoded vectors , correspond to the four decoded QUBO variables in Eq. 3. If the transmitter’s bittosymbol mapping and QuAMax’s variabletosymbol transform are equivalent, then the decoded are the directly demapped bits, from the ML solution in Eq. 1.
When transform is linear the expansion of the norm in Eq. 5 yields a quadratic polynomial objective function, since for any 0 or 1valued . Then the ML problem (Eq. 1) transforms directly into QUBO form (Eqs. 3 and 5
). Our task, then, is to find variabletosymbol linear transform functions
for each of BPSK, QPSK, and 16QAM.Binary modulation. If the two mobile transmitters send two signals simultaneously, each with one of two possible information symbols, their transmissions can be described with a twovector of symbols , . This type of data transmission is called binary modulation, of which one popular kind is binary phase shift keying (BPSK). The ML problem applied to the BPSK case where symbols are represented by thus results in a QUBO form (a detailed derivation can be found in Appendix A).
We next consider higherorder modulations, which send one of possible information symbols with each channel use (where ), resulting in higher communication rates.
QPSK modulation. In the case of quadrature phase shift keying (QPSK), each sender transmits one of four possible symbols . Since it can be viewed as a twodimensional BPSK , we represent each possiblytransmitted QPSK information symbol with the linear combination of one QUBO variable, plus the other QUBO variable times the imaginary unit. Transforming and to and respectively leads to the transform .
Higherorder modulation. 16 quadrature amplitude modulation (16QAM) and higherorder modulations increase spectral efficiency, but utilize multiple amplitudes (levels) so require a that inputs more than one (binary) solution variable per I or Q dimension. First consider a transform for the simplest multilevel 1D constellation: . maps these bits to the values . Now to generalize this to 2D, let the first two arguments of , , represent the I part and the next two, represent the Q part. We call this transform, shown in Fig. 1(a), the 16QAM QuAMax transform. It has the desirable property that it maps solution variables to symbols linearly, viz. , thus results in a QUBO form.
However, transmitters in practical wireless communication systems use a different bittosymbol mapping, the Gray code shown in Fig. 1(d), which minimizes bit errors. This means that the QuAMax receiver’s bit to symbol mapping differs from the sender’s. Thus one further step remains so that we may map the decoded QUBO variables into the correct Graycoded transmitted bits.
A naïve approach is simply for QuAMax to use the Graycoded bittosymbol mapping as its transform . The Graycoded mapping results in a onedimensional 4PAM constellation assuming bits , , , and are transformed to , , , and without loss of generality. The transform would map between a 4PAM symbol and two QUBO variables , but the resulting expansion of the ML norm would yield cubic and quartic terms for , requiring quadratization with additional variables to represent the problem in QUBO form (ishikawa2009higher; boros2014quadratization).
Instead, we retain Gray coding at the transmitter and the QuAMax transform at the receiver. To correct the disparity, we develop a bitwise posttranslation that operates on QuAMaxtransformed solution output bits at the receiver, translating them back into Graycoded bits (i.e., moving from Fig. 1(a) to Fig. 1(d)). Starting with the QuAMax transform shown in Figure 1(a), if the second bit of the QUBO solution bits , , , is 1, then the translation flips the third bit and the fourth bit (e.g. 1100 to 1111), otherwise it does nothing. This can be generalized to QAM () as an operation that flips even numbered columns in the constellation upside down. We term the result an intermediate code, shown in Figure 1(b). Next, we apply the differential bit encoding transformation of Figure 1(c) to the intermediate code to obtain the Graycoded bits in Figure 1(d) (e.g. translating 1111 to 1000).
QuAMax decoding example.
To clarify processing across all stages, here we present a complete QuAMax decoding example. Suppose a client maps a bit string onto , one of the Graycoded 16QAM symbols in Figure 1(d), and sends to an AP through wireless channel . The AP receives , the transmitted signal perturbed by AWGN. The steps of QuAMax’s decoding are:

Form the ML QUBO equation using , , and , where , a linear transform based on the QuAMax transform in Figure 1(a).

Solve the QUBO form of the ML detection problem on the QA machine, resulting an MLdecoded vector , comprised of QUBO variables .
If , decoding is done successfully, noting that in the case of a symbol error, we preserve the aforementioned advantage of Gray coding.
3.2.2. MLtoIsing problem reduction.
The Ising spin glass form of the ML problem can be obtained by simply transforming the resulting QUBO form (§3.2.1) into the Ising form by Eq. 4. Due to the fact that DW2Q implements an Ising model, QuAMax works by using the following generalized Ising model parameters:
BPSK modulation. Given a channel matrix and vector of received signals , we obtain the following Ising model parameters:
(6) 
where denotes the column of channel matrix .
QPSK modulation. In the case of QPSK, the following is the resulting Ising parameter for QPSK:
(7) 
Since the real and imaginary terms of each symbol are independent, the coupler strength between and (or and ) is 0. For other and , the Ising coupler strength for QPSK is:
where and the sign of the latter case of Eq. LABEL:eqn:QPSK_ising_coefficient is determined by whether (when , then ‘’ and ‘’).
16QAM modulation. Ising parameters follow the same structure as BPSK and QPSK and can be found in Appendix C.
In summary, the process to obtain the Ising spin glass form can be simplified with these generalized Ising model parameters; a QuAMax system simply inserts the given channel and received signal at the receiver into these generalized forms accordingly, not requiring any computationally expensive operations (i.e. directly considering the expansion of the norm in Eq. 5). Thus, computational time and resources required for MLtoQA problem conversion are insignificant and can be neglected.
Config.  BPSK  QPSK  16QAM  64QAM 

3.3. Embedding into QA hardware
Once the ML detection problem is in quadratic form, we still have to compile the corresponding Ising model onto actual QA hardware. The DWave machine works by implementing an Ising model objective function energetically hardcoded into the chip, so the problem (Eq. 2 on p. 2) can support a certain coefficient to be nonzero only if variables and are associated to physical variables (qubits or physical qubits) located on the chip in such a way that the qubits are energetically coupled. In the case of the DW2Q machine we use the coupling matrix is a Chimera graph, shown in Figure 2(a), with each node corresponding to a qubit. Once Ising coefficients are passed to the annealer, the hardware assigns them to the edges of the Chimera graph, which are divided (along with their connected nodes) into unit cells. Note however that, while the Ising problem generated from Eq. 1 is almost fully connected (i.e., for most pairs), the Chimera graph itself has far from full connectivity, and so a process of embedding the Ising problem into the Chimera graph is required.
One standard method of embedding is to “clone” variables in such a way that a binary variable becomes associated not to a single qubit but to a connected linear chain of qubits instead: a logical qubit, as shown in Figure 2(b).^{3}^{3}3The optimal assignment problem, in the general case, is equivalent to the NPHard “minor embedding” problem of graph theory (Choi2011), however for fullyconnected graphs very efficient embeddings are known (klymko2014adiabatic; PhysRevX.5.031040; boothby2016fast). We show an embedding of a fullyconnected graph of 12 nodes. Each unit cell on the diagonal holds four logical qubits (a chain of two qubits), while the other unit cells are employed in order to interconnect the diagonal cells. Specifically, suppose unit cell includes logical qubits 1–4 and unit cell includes logical qubits 5–8. The left side of unit cell has a vertical clone of qubits 5–8 and the right side has a horizontal clone of logical qubits 1–4. Then, logical qubits 1–4 and 5–8 are all connected by means of the single unit cell . The unit cell hosting the next four logical qubits 9–12 is placed at coordinates . Two unit cells below, and , are used for connections between 9–12 and 1–4, and 9–12 and 5–8 respectively. Given a number of spin variables (i.e., logical qubits) in Ising form, this embedding represents each with a chain of qubits, for a total of qubits. Recall that .
Table 2 summarizes the size of the embedding in both logical and physical qubits, as a function of the MIMO detection problem’s parameters—number of users and AP antennas, and modulation type. Color coding and bold font indicate whether or not the given parameters fit into the number of qubits available on current DWave machines.
The embedded version of the Ising problem.
After embedding into Chimera graph we need to recast the Ising problem into an equivalent problem that has the same ground state, but also satisfies the Chimera graph constraints. We also need to introduce a constant penalty term () to quantify the relatively large coupling that constrains all physical qubits belonging to the same logical qubit to prefer the same state. Appendix B contains additional detail, but we discuss important experimental considerations for choosing in Section 5.3.
Unembedding with majority voting. The bit string that the DW2Q returns is expressed in terms of the embedded Ising problem, and so must be unembedded in order to have the values of the bits expressed in terms of our ML Ising problem. This is done by checking that all the qubits of a logical chain are either or . Should not all spins be concordant, the value of the corresponding logical variable is obtained by majority voting (in case of a vote tie, the value is randomized). Once the logical variables are determined, each configuration yields the corresponding energy of the Ising objective function by substituting it into the original Ising spin glass equation (Eq. 2).
4. Implementation
This section describes our implementation on the DWave 2000Q quantum annealer (DW2Q), explaining the API between the annealer’s control plane and its quantum substrate, machine parameters, and their tuning to the problem at hand.
Each anneal cycle on the DW2Q yields a configuration of spins (i.e., one decoded bit string). The user programs the annealer to run a batch of annealing cycles (one QA run) with the same parameters to accumulate statistics, which means that we have a set of configurations from a DW2Q job submission. The lowest energy configuration among anneals is the best answer found.
Parallelization.
Multiple instances (identical or not) can be run physically alongside each other, reducing run time by the parallelization factor^{4}^{4}4While asymptotically the parallelization factor is just the ratio of used physical qubits after embedding to the number of qubits in the chip , in finitesize chips, chip geometry comes into play. —a small 16qubit problem employing just 80 physical qubits (e.g. 16user BPSK, 8user QPSK, and 4user 16QAM) could in fact be run more than 20 times in parallel on the DW2Q.
Precision Issues.
As analog devices, the desired embedded Ising coefficients (Eqs. 1012 in Appendix B) do not perfectly match their real energy values once hardcoded in the real machine, and hence give rise to intrinsic control errors (ICE), an uncontrollable shift in the actual programmed values of the objective function. ICE is appropriately modeled as a noise fluctuating at a time scale of the order of the anneal time, i.e., on each anneal, Ising coefficients are perturbed: ,
. where the noise is Gaussian with mean and variance measured
and in the most delicate phase of the annealing run (ICE). The impact of ICE on performance depends on the problem at hand (PhysRevA.93.012317; PhysRevA.65.012322), but it is clear that precision issues will arise if the largest energy scale squeezes the value of the coefficients (in Eqs. 11–12 in Appendix B) to a level where ICE is likely to erase significant information of the problem’s ground state configuration.Annealer Parameter Setting. As discussed in Section 3.3, the that enforces a chain of qubits to return a series of values which are all in agreement (either all or ), and the annealing time are both key performance parameters that determine the net time to find a solution, and hence overall QA performance. We also introduce 1, 10, and 100 s pause time in the middle of annealing ( s) with various pause positions , to see the effect of pausing (DWPauseMarshall) on our problems. Setting too large would wash out the problem information due to ICE, however on average should increase with the number of logical chains in fullyconnected problems in the absence of ICE (PhysRevX.5.031040). Due to the lack of a firstprinciples predictive theory on the correct value for a given instance, we present in Section 5.3 an empirical investigation of the best embedding parameters, employing the microbenchmarking methodologies generally accepted (Rieffel2015; o15compiling; PhysRevX.5.031040). Below we perform a sensitivity analysis on , , , and (§5.3.1) over the ranges , , , and .
Improved coupling dynamic range. The dynamic range of coupler strengths is defined as the ratio between the maximum and minimum values that can be set ( in Eq. 2). To strengthen interactions between embedded qubits, the DW2Q is able to double the magnitude of valid negative coupler values, effectively increasing the precision of embedded problems and reducing ICE. However, this improved range option, when enabled, breaks the symmetry of the Ising objective function (substituting the opposite signs for connected coefficients and their couplings, into the same problem), and hence precludes averaging over these symmetrical instances as the DW2Q does without the improved range option, to mitigate leakage errors (boixo2014evidence). It is thus unclear whether the use of this feature is beneficial in the end without experimentation, and so we benchmark in Section 5 both with and without improved range.
5. Evaluation
We evaluate QuAMax on the DW2Q Quantum Annealer machine shown in Figure 1. We consider the same number of antennas at the clients and AP (i.e., ). Section 5.1 introduces QA results, while Section 5.2 explains our experimental methodology. After that in Section 5.3 we present results under only the presence of the annealer’s internal thermal noise (ICE). Sections 5.4 and 5.5 add wireless AWGN channel noise and tracebased realworld wireless channels, respectively, quantifying their interactions with ICE on endtoend performance. Over anneals are used in our performance evaluation.
5.1. Understanding Empirical QA Results
We begin with a close look at two runs of the QA machine, to clarify the relationships between Ising energy, Ising energyranked solution order, and BER. Figure 4 shows six QA problem instances (all of which require 36 logical qubits), corresponding to two different wireless channel uses for each of varying modulation and number of users. The solutions are sorted (ranked) by their relative Ising energy difference from the minimum Ising energy (blue numbers atop selected solutions), where red bars show each solution’s frequency of occurrence (in the rare case of two or more tied distinct solutions, we split those solutions into multiple solution ranks). The number of bit errors associated with each solution appears as a green curve. For statistical significance, this data summarizes 50,000 anneals, more than QuAMax’s practical operation. As modulation order increases and number of users decreases (from left to right in Figure 4), the probability of finding the ground state tends to be lower, while the search space size remains constant, leading eventually to higher BER and FER.^{5}^{5}5See section 5.2.2. Frame error rate FER is computed as . The relative Ising energy gap also trends smaller,^{6}^{6}6The energy distribution of the Ising objective function (Eq. 2) corresponds to the distribution of ML decoder Euclidean distances (Eq. 1). and is likely to be inversely correlated with the impact of ICE on the problem instance (PhysRevA.93.012317; 2018arXiv180603744A).
5.2. Experimental methodology
In this section, we introduce performance metrics and figures of merit that give insight into the operation of the QA machine as it solves the ML MIMO decoding problem.
We note that in our performance evaluation we exclude from consideration programming time and postprogramming time of the Ising coefficients on the chip, and readout latency of the qubit states after a single anneal. Currently, these times dominate the pure computation time (i.e. total anneal time) by several orders of magnitude (milliseconds), due to engineering limitations of the technology. However, these overheads do not scale with problem size and are not fundamental performance factors of the fully integrated QuAMax system, and so this is in accordance with experimental QA literature convention. We discuss these overheads in Section 7.
5.2.1. Metric: TimetoSolution (TTS)
Suppose we find the ground state (corresponding to the minimum energy solution within the search space of bit strings, where is the variable count) with probability . In the absence (but not presence) of channel noise, this ground state corresponds to a correct decoding. Each anneal is an independent, identicallydistributed random process, meaning that the expected time to solution, or , is the anneal time of each anneal multiplied by the expected number of samples to be able to find the ML solution with probability : . TTS is commonly used in the QA literature, setting (ronnow2014defining).
5.2.2. Our Metrics: BER and TimetoBER (TTB)
TTS reflects the expected time to find the ground state, but does not characterize the expected time our system takes to achieve a certain Bit Error Rate (BER, averaged across users), the figure of merit at the physical layer. This quantity differs from TTS, because TTS only considers the ground state, and as illustrated in the example shown in Figure 4, solutions with energy greater than the ground state may also have (rarely) no or relatively few bit errors, while wireless channel noise may induce bit errors in the ground state solution itself. Hence we introduce a metric to characterize the time required to obtain a certain BER , TimetoBER: TTB(p). This is preferred in our setting, since a low but nonzero bit error rate is acceptable (error control coding operates above MIMO detection).
TTB for a single channel use.
Since one QA run includes multiple () anneals, we return the annealing solution with minimum energy among all anneals in that run. We show an example of this process for one instance (i.e., channel use, comprised of certain transmitted bits and a certain wireless channel) in Fig. 4. The annealer finds different solutions, with different Ising energies, ranking them in order of their energy. Considering this order statistic, and the fact that QuAMax considers only the best solution found by all the anneals in a run, the expected BER of instance after anneals can be expressed as
(9) 
where is qubit count, () is the number of distinct solutions, () is the rank index of each solution, is the probability of obtaining the solution, and is the number of bit errors of the solution against ground truth.^{7}^{7}7Note that the metric has omniscient knowledge of ground truth transmitted bits, while the machine does not. To calculate TTB(), we replace the left hand side of Eq. 9 with , solve for , and compute TTB() .
Generalizing to multiple channel uses.
The preceding predicts TTB for a fixed instance. In the following study we compute TTB and BER across multiple instances (random transmitted bits and randomlyselected wireless channel), reporting statistics on the resulting sampled distributions.
5.3. Performance Under Annealer Noise
This section presents results from the DW2Q annealer for wireless channel noisefree scenarios, in order to characterize the machine’s performance itself as a function of time spent computing. Sections 5.4 and 5.5 experiment with Gaussian noise and tracebased wireless channels, respectively.
In this section, we run several instances with unit fixed channel gain and average transmitted power. Each instance has a randomphase channel, randomly chosen transmitted bit string, and is repeated for each of three different modulations (BPSK, QPSK, 16QAM) and varying numbers of users and AP antennas. Each instance is reduced to Ising as described in Section 3.2, for a total of 780 different problems per QA parameter setting. Unless otherwise specified, this and subsequent sections use the fixed parameter settings defined in §5.3.1. We obtain significant statistics by postprocessing up to 50,000 anneals per QA run (except 10,000 anneals for anneal pause analysis in Figure 7).
5.3.1. Choosing Annealer Parameters
In order to isolate the effect of different parameter settings on individual problems, we employ microbenchmarks on TTS. This section explains our choice of parameter settings for our main performance results in §5.3.3, §5.4, and §5.5. Note that while we plot results here only for BPSK and QPSK to save space, our results show that the methods, arguments and observations generalize to higher modulations, unless otherwise indicated. For the purpose of setting the parameters, we restrict the dataset to the ML problems that solve within a median TTS(0.99) of 10 ms for which we have low uncertainty on the measured success probability. We use the determined parameters for all instances regardless of their TTS for the performance analysis.
Ferromagnetic couplings: We examine median versus over 10 random instances of different sizes both with and without extended dynamic range. In Fig. 5, we observe that while there is a performance optimum that depends on the problem size for standard dynamic range, extended dynamic range shows less sensitivity to , obtaining roughly the optimal performance of standard dynamic range. Anneal time: As we vary , we observe greater sensitivity when we use nonoptimal , as the scatter points next to each data point in Fig. 6 (left) show. On the other hand, Fig. 6 (right) shows that an extended dynamic range setting achieves best results at s regardless of problem size, showing less sensitivity to different . Anneal Pause Time and Location: When we apply improved dynamic range at s, we observe a slight independence (Fig. 7) of and on , and as increases, so does TTS. While the dynamic range setting has shown less sensitivity to , anneal pause with extended dynamic range shows more sensitivity. Note that TTS of 18user QPSK at s is slightly improved, compared to the best results in Figs. 5 and 6.
Annealer Parameter Optimization.
Based on the previous sensitivity analysis, we select a default QA parameter setting. First, we choose improved dynamic range since it is relatively robust to choice of , nearly equaling the best performance of the standard dynamic range. Second, we choose s, since it shows better results and greater pause times dominate the anneal time.
5.3.2. Choosing whether to pause
With the above default QA parameters, we now use TTB to explore whether or not we should use the QA pause functionality, as TTB encompasses both algorithms’ BER performance as well as wall clock running time (cf. TTS). We first define a fixed parameter setting by selecting the best estimated choices for the nonpausing algorithm and for the pausing algorithm, meaning the parameters which optimize medians across a sample of instances belonging to the same problem class (e.g. 1818 QPSK). This approach is to be compared against an oracle that optimizes {, } or {, } instance by instance. In the figures, we denote the two parameter setting methods as Fix (fixed) and Opt (optimal), respectively.
Our motivation for considering Opt is that it provides a bound to what can be achieved by the methods that seek to optimize machine parameter settings instance by instance (ReverseVenturelli; venturelli2015quantum), currently under investigation. With our traces we compute BER as a function of using Eq. 9; the median result across 20 random instances is shown in Figure 8 (upper). Figure 8 (lower) shows the corresponding BER as a function of time (i.e., TTB). Note that the pausing algorithm has a better performance than the nonpausing algorithm (regardless of Opt and Fix strategy) despite the fact that each anneal in the former () takes twice as much time than the latter (when ). Based on this empirical finding, we will present the results in the following section only for the protocol that includes a pause.
5.3.3. QuAMax: EndtoEnd performance
We now evaluate the TTB and TTF (TimetoFER) of QuAMax, comparing:

QuAMax: Fixedparameter, averagecase performance.

Oracle: Mediancase Opt performance (§5.3.2: outlier data points have minimal influence on the median order statistic), optimizing QA parameters.^{8}^{8}8Outlier mitigation methods for QA may address such outliers in future work (qubopreproc).
Figure 9 shows the TTB with varying user numbers and modulations at the edge of QuAMax’s performance capabilities. Solid and dashed lines report median and average BER, respectively. We note that mean TTB dominates median TTB due to a small number of longrunning outliers. QuAMax accordingly sets a time deadline (measured median TTB for the target BER) for decoding and after that discards bits, relying on forward error correction to drive BER down. Next considering the relationship between TTB and problem size, Fig. 10 explores TTB for target BER , for each instance that reaches a BER of within 10 ms as well as average performance. ML problems of these sizes are well beyond the capability of conventional decoders (cf. Table 1), and we observe that Opt achieves superior BER within 1–100 and that QuAMax achieves an acceptable BER for use below error control coding. Note that instances with TTB below the minimum required time (i.e., + ) caused by parallelization require (an amortized) 2 .
Next, we consider frame error rate performance, measuring mean and median FER QuAMax achieves. Results in Fig. 11 show that tens of microseconds suffice to achieve a low enough (below ) FER to support high throughput communication for 60user BPSK, 18user QPSK, or fouruser 16QAM suffices to serve four users with the idealized median performance of Opt. QuAMax (mean Fix) achieves a similar performance with slightly smaller numbers of users. Furthermore, our results show low sensitivity to frame size, considering maximalsized internet data frames (1,500 bytes) all the way down to TCP ACKsized data frames (50 bytes).
5.4. Performance under AWGN Noise
We next evaluate the impact of AWGN from the wireless channel, testing six SNRs ranging from 10 dB to 40 dB. In order to isolate the effect of noise, the results in this subsection fix the channel and transmitted bitstring and consider ten AWGN noise instances. Looking at the data in depth to begin with, the effect of AWGN channel noise, which is itself additive to ICE, is shown in Fig. 12 for six illustrative examples. As SNR increases, the probability of finding the ground state and the relative energy gap tend to increase. At 10 dB SNR the energy gap between the lowest and second lowest energy solutions narrows to just three percent, leaving minimal room for error. In terms of overall performance, Fig. 13 (left) shows TTB at 20 dB SNR, varying number of users and modulation. At a fixed SNR, we observe a graceful degradation in TTB as the number of users increases, across all modulations. Fig. 13 (right) shows TTB at a certain user number, varying SNR and modulation. At a fixed user number, as SNR increases, performance improves, noting that the idealized median performance of Opt shows little sensitivity to SNR, achieving BER within s in all cases.
Fig. 14 compares QuAMax’s performance versus zeroforcing decoder at bad SNR scenarios, showing the necessity of MLbased MIMO decoders for large MIMO system. Linear decoders such as zeroforcing and MMSE suffer from the effect of the poor channel condition (when ), requiring (i.e., more antennas) for appropriate BER performance. In Figure 14, QuAMax reaches the zeroforcing’s BER (or even better BER) approximately 101000 times faster than zeroforcing in both BPSK and QPSK modulation. Here, computation times for zeroforcing are inferred from processing time using a single core in BigStation (BigStation), one of the current large MIMO designs based on zeroforcing. While this processing time can be reduced proportionally with more cores, BER (i.e., quality of solutions) remains unchanged. The Sphere Decoder achieves comparable BER, but processing time cannot fall below a few hundreds of s with the given numbers of users and SNRs.^{9}^{9}9Extreme levels of parallelization or GPU implementation might be able to resolve the issue. However, practical constraints will eventually limit the increase in performance on classical platforms (king2019quantum). Contrarily, overheads in QuAMax are apart from pure computation, which can be resolved by engineering design.
5.5. TraceDriven Channel Performance
We evaluate system performance with real wideband MIMO channel traces at 2.4 GHz, between 96 base station antennas and eight static users (Argos). This dataset comprises the largest MIMO trace size currently available. For each channel use, we randomly pick eight base station antennas to evaluate the MIMO channel use at SNR ca. 25–35 dB: Fig. 15 shows the resulting TTB and TTF. We achieve BER and FER within s for QPSK. For BPSK, considering multiple problem instances operating in parallel, we achieve the same BER and FER within (an amortized) s period. This implies that tens of microseconds suffice to achieve a low BER and FER even without parallelization of identical problems, which creates an opportunity for QuAMax to parallelize different problems (e.g., different subcarriers’ ML decoding).
6. Related Work
Applications of QA.
Despite the immaturity of software toolchains, existing quantum annealing machines have been already programmed successfully to solve problems in Planning and Scheduling (Rieffel2015), Databases (Trummer:2016:MQO:2947618.2947621), Fault Diagnostics (PerdomoOrtiz2015)
(perdomo2017opportunities), Finance (Rosenberg:2015:SOT:2830556.2830563), Data Analysis (mott2017solving), Chemistry (hernandez2017enhancing), and Space SciencesAeronautics (biswas2017nasa). A Similar problem to ML detection, CDMA multiuser demodulation, was solved using quantum fluctuations controlled by the transverse field (similar as QA) in (otsubo2014code). Of particular relevance is work on optimization of fullyconnected graphs, such as the ones used to map the ML problem (PhysRevX.5.031040); the results of which showed that QA performance could match the most highly optimized simulated annealing code run on the latest Intel processors. For further details on the logical to physical qubit embedding process, see Venturelli et al. (PhysRevX.5.031040). Efficient embeddings which do not force the chip coverage to be a triangle are also known (boothby2016fast).Qaoa.
Quantum Approximate Optimization Algorithms, invented in 2014 (farhi2014quantum)
, and recently generalized for constrained combinatorial optimization
(Hadfield:2017:QAO:3149526.3149530), require digital gatemodel QC, which became available at reasonable scale only in 2017 (prototypes from IBM, Rigetti Computing, and Google are available). While QA and QAOA require different hardware (the former is analog, the latter digital) they have in common that: (1) For problems that don’t have hard constraints, the programming step consists in defining a classical combinatorial problem which is cast into QUBO (boros2007local; QUBO) or Ising form, hence they both may leverage our formulation §3.2. (2) QAOA can be seen in some parameter range as a “digitized” version of QA, and it has been formally demonstrated that it can simulate the results and performance of QA and outperform it, in principle (yang2017optimizing). The first commonality is particularly important since it opens the door to application of our techniques on future hardware capable of running QAOA.Conventional ML Detectors.
Faster silicon based ML detector strategies typically approximate and parallelize the ML computation (Wenk10; cuitvt13). In these general directions, much progress has been made to the point that Sphere Decoders have been realized in ASIC hardware (ETH_HARD; winter12) but fall short for the reasons noted in Table 1 when the setting demands more antennas at the AP (serving more users), or when the modulation chosen increases (flexcorensdi17; BigStation).
7. Discussion
In this section, we discuss the current status of QA technology and practical considerations.
Computational Power Consumption.
The computation in the DWQ2 is performed at zero energy consumption, as dictated by reversible computing, although energy is dissipated in the initialization and readout. The DW2Q draws 16 kW of power, primarily used by the cryogenic refrigeration unit (DWaveWP). The computational power (per watt) for QPUs is expected to increase much more rapidly than for conventional computing platforms since the DW2Q power draw is not expected to change much as qubit and coupler counts grow in future generation systems while the computational power substantially increases.
Operating Expenses.
Operating the DW2Q results in significant electricity cost, and the dilution refrigerator requires liquid nitrogen 12 times a month, for a total yearly cost of about USD $17,000.
Processing Times.
The scenario envisioned by QuAMax assumes a centralized RAN architecture where a QPU, colocated with centralized RAN computational resources in a data center, is connected to the APs via highspeed fiber or millimeterwave links. In this setting, a latency between the APs and data center will not be significant. Nonetheless, QuAMax cannot be deployed today, since additional processing times in the current QPU include approximately 3050 ms preprocessing time, 68 ms programming time, and 0.125 ms readout time per anneal. These overheads are well beyond the processing time available for wireless technologies (at most 3–10 ms). However, these overhead times are not of a fundamental nature and can be reduced by several orders of magnitude by efforts in system integration. By means of extrapolation of improvement trends it is expected that quantum engineering advances in superconducting qubit technology will enable QuAMax to be viable within a decade. Moreover, QuAMax’s Ising form (in Section 3.2.2) can be adapted to be run in other emerging physicsbased optimization devices based on photonic technologies (hamerly2019) whose processing times overhead are in principle much faster. Hence, we leave an endtoend evaluation in a fully centralized RAN architecture, with more advanced hardware, as future work.
8. Conclusion
QuAMax is the first design, implementation and experimental evaluation of a quantumcomputing solver for the computationally challenging ML MIMO decoding problem. Our performance results establish a baseline for a future fullyintegrated systems in the context of the centralized RAN architecture. We show that once engineering efforts optimize the integration between quantum and conventional computation, quantum computation should be considered a competitive technology for the future design of highcapacity wireless networks.
Future Work.
There are several improvements over the design we have evaluated here. First, we anticipate that further optimization of , , and as well as new QA techniques such as reverse annealing (ReverseVenturelli) may close the gap to Opt performance. Second, there are changes in QA architecture expected in annealers due this year (2019arXiv190107636D) featuring qubits with the degree of Chimera, the number of qubits and with longer range couplings. Based on similar gains in recent results on different problem domains (hamerly2019), we anticipate this will permit ML problems of size, e.g. for QPSK and dramatically increase the parallelization opportunity of the chip due to the reduced embedding overhead where each chain now only requires qubits.
Going forward, we will benefit from QA technology improvements from the international community manufacturing quantum annealers with advanced capabilities. According to the development roadmap for these nextgeneration quantum optimizers, it is expected that in ca. a decade a system such as QuAMax could be based on chips with tens of thousands of highlyconnected qubits, with annealing schedules capable of more advanced quantum effects (e.g. nonstoquasticity (novikov2018exploring)) and engineering advances will have orderofmagnitude improvements on the aforementioned overhead operation times. While quantum annealers are ahead in terms of number of qubits, gatemodel systems offer additional controls that may conceivably increase performance in the future. We will investigate MIMO ML decoding on gatemodel QPUs in future work, which currently cannot support algorithms that decode more than 44 BPSK.
Acknowledgements
We thank our shepherd John Heidemann and the anonymous reviewers for their insightful feedback. We thank the NASA Quantum AI Laboratory (QuAIL) and DWave Systems for useful discussions. The research is supported by National Science Foundation (NSF) Awards #1824357 and #1824470 and by the USRA Cycle 3 Research Opportunity Program that allowed machine time on the DW2Q hosted at NASA Ames Research Center. Kyle Jamieson and Minsung Kim are partially supported by the Princeton University School of Engineering and Applied Science Innovation Fund. This work does not raise any ethical issues.
References
Appendix A QUBO Forms
We demonstrate how to transform BPSK MIMO Maximum Likelihood (ML) detection into the QUBO form. ML detection solves Eq. 1, where
The norm expansion in Eq. 1 can be expressed as
In the case of BPSK, symbol is represented by a QUBO variable . One possible transform is where corresponds to and to . This leads to , where and . Using these relationships, we can express the above norm as
Then we obtain the objective function of ML problem with QUBO variables. Using , minimization of this objective function becomes the QUBO form (Eq. 3):
Appendix B Embedded Ising
Embedding maps the Ising problem to an equivalent one that has the same ground state, but also satisfies Chimera graph constraints. The QuAMax compiled objective function is:
(10)  
(11)  
(12) 
where the original logical variables are now associated to a chain of qubits, indexed with new spins . penalizes the condition that , i.e., enforces that all qubits in the chain assume the same value (). This enforcement is more likely to happen for large values of , however the maximum negative energy value is set to by hardware design. In (11) and (12), effectively renormalizes all terms in the objective function by the factor . The linear term value is additionally divided by the number of qubits in a chain (). The term in (12) shows that the duplication of variables ensures the existence of a pair of qubits in the chains such that a physical coupler in the Chimera graph exists ( is the set of pairs of qubits that are connected by a physical bond once the chains and are specified).
Appendix C 16QAM Ising Model Parameters
Following are the Ising parameters for 16QAM:
(13) 
Since real and imaginary terms of each symbol are independent, the coupler strength between , and , is 0. For other and , the Ising coupler strength for 16QAM is:
(14) 
Comments
There are no comments yet.