## 1. Introduction

Multi-User Multiple-Input Multiple-Output (MU-MIMO) has proven an
essential technique to maximize capacity in many different kinds of wireless systems
such as 802.11 wireless LAN and 5G New Radio cellular networks. In MU-MIMO, the
uplink receiver (*i.e.*, an *access point*—AP—in a wireless
LAN, or a *base station*—BS—in a cellular network) with
multiple antennas supports many users simultaneously by striping data over parallel streams
(a technique known as *spatial multiplexing*), and thus enables significantly
higher data capacities. In an ideal world, the number of parallel streams that MU-MIMO
can support would be the lesser of the number of mobile users and the number of
radios at the base station, and overall system capacity
would increase proportionally to the number of spatial streams.

In practice, however, the *channel hardening* phenomenon
complicates this situation, in the following way.
MU-MIMO requires signal processing to disentangle the spatial
streams from each other, a technique called *MIMO detection*.
For a base station
with as many antennas as radio front ends, when the number
of users approaches the number of base station antennas,
MIMO detection becomes extremely difficult
resulting in poor performance for conventional linear detection algorithms (BigStation): this is the
*Large MIMO* regime that lies along
the points where the number of users equals
the number of base station antennas , as depicted
in Figure 1.^{1}^{1}1For simplicity, we call MIMO regimes, “Large MIMO” when , while “Massive MIMO” when , regardless of size.
For Large MIMO, there exist *maximum-likelihood* (ML) *exact*
solvers, that can achieve the lowest possible bit error rate and,
therefore, restore a high throughput. Unfortunately, these detection algorithms come at the expense of
an exponential increase of the required computational resources
as MIMO size increases, eventually becoming infeasible for many users
because of the processing time limits in wireless systems.
For example, at most
three milliseconds of BS’s computation are available for both the 4G
LTE uplink and downlink (BigStation; dahlman20134g).

*Massive MIMO* systems such as LuMaMi and Lund (6-12 users, 100-128 BS antennas)
(rusek2012scaling; vieira2014flexible; malkowsky2017world), Argos
(eight users, 96 BS antennas) (argos-mobicom12; argos-asilomar17), BigStation (BigStation), Agora (ding2020agora), and Samsung’s
5G base stations (16 users, 64 BS antennas) (samsung)
mitigate channel hardening
in the following way. Since linear detectors such as *ZeroForcing*
(ZF) and *Minimum MeanSquared
Error* (MMSE) can achieve nearML
performance when the wireless channel is wellconditioned,
systems that use many more base station antennas than usersspatial streams
(*i.e.,* for MIMO) may offer each base station radio a choice
of one out of a number of antennas to use. This largely negates the effect of
channel hardening, but requires base station antennas numbering
a sufficient factor greater than users (*e.g.* (bjornson2015massive) or for 16 users or below (argos-asilomar17), while there is no proven rule-of-thumb of that maximizes the spectral efficiency (bjornson2015massive)),
as shown in Figure 1, to achieve the full
throughput of spatial streams.
In addition, the deployment of larger numbers of antennas
eventually becomes challenging from a practical standpoint, most acutely
in wireless local area networks, but also in small, denselydeployed
5G base stations where form factors preclude excessive numbers of antennas,
and eventually in normal base stations where tower size faces practically
limited.

In this paper, we take a complementary approach to Massive MIMO: we begin
with a particular Massive MIMO configuration in which the number of
base station antennas is practically at its maximum, and then ask the
question *how can performance be further improved via
additional spatial streams?* The answer lies in a fusion of two
preceding ideas: add radio chains at the Massive MIMO
base station to equal the number of antennas, and at the same
time, utilize nearML detection algorithms. This pushes us
out towards the upperright corner of the space in Figure 1
and maximizes computational complexity, yet offers the promise of the greatest
spatial multiplexing gains, given our practical constraint on
base station antenna count.

A shift to Physics-inspired approaches.
Over the last few years,
there has been a surge of interest in alternative computation approaches to reduce
the complexity of current detectors by leveraging algorithms that relate
optimization convergence to Physics principles. This interest
is further accelerating in view of experimental initiatives featuring hardware-native
implementations of these approaches, using both quantum and classical physics-based computations (kim2019leveraging; hamerly2019; aramon2019physics; csaba2020coupled; Hadfield:2017:QAO:3149526.3149530; kasi2020towards; kim2020towards).
One common aspect of these algorithms is that they frame the
computational problem as an energy minimization problem of a magnetic spin system,
also known as the *Ising spin model* (ising1925beitrag).
Beside being an important model to understand the physics behind magnetic
systems, *any* NP computational problem can be expressed as the energy
minimization of an appropriate Ising spin model (lucas2014ising) (that is, the Ising spin model
is NP-Complete (welsh1990computational)).
In this regard, physicsinspired algorithms
can be seen as parametric “black boxes” that accept an Ising spin problem as input,
and output the configuration with lowest associated energy. What distinguishes one algorithm from another is the underlying mechanism used to find the global minimum, which corresponds to the ML optimal solution in MIMO detection.

This paper presents the design and implementation
of ParaMax, a soft MU-MIMO detector system for Large and Massive MIMO
networks that uses *parallel tempering*, a
physics-inspired heuristic algorithm, on classical platforms.
ParaMax operates flexibly in parallel for any number of available processors,
supporting fixed latency and highlyscalable parallelism.
We design the *ParaMax Ising Solver* (§4.1),
a parallel temperingbased
solver that is tailored for MIMO detection, implemented as
a fully classical algorithm that does not require any specific
hardware, and integrate it into the overall design of
our system (§4.2).
We also introduce a new algorithm (§4.2.2)
to generate soft information for
heuristic detectors that enables a more reliable detection and
decoding. The proposed algorithm utilizes heuristic detection outputs and generates soft information,
defined as the
bitwise detection confidences that implicitly take channel conditions
and noise into consideration.
To our best knowledge, this is the first application of parallel tempering
to wireless networks, and ParaMax is the first heuristic-based
MIMO detector that demonstrates nearML performance for both very Large
and Massive MIMO successfully.

Our experiments show that ParaMax achieves a constantlyincreasing performance as the number of processing elements increases. In the case of lowerorder BPSK and QPSK modulations, very large MIMO of and respectively, can achieve near-ML performance for less than tens of processing elements, as depicted in Figure 2. With respect to Massive MIMO systems, in MIMO with 16-QAM modulation at SNR 16 dB, ParaMax achieves a 330 Mbits/s near-optimal system throughput with 48 processing elements (PEs) per subcarrier, approximately better throughput than linear detectorbased Massive MIMO systems.

## 2. Background

This section introduces background knowledge, indicating relevant literature. Sections 2.1 and 2.2 respectively explain ParaMax’s algorithmic foundations, simulated annealing and parallel tempering. Section 2.3 describes the MU-MIMO model and detection problem.

### 2.1. Simulated Annealing

*Simulated annealing* (SA) is a classical heuristic optimization technique typically used to find the state or *configuration* with the lowest energy of *Ising spin* problems, where

is a vector consisting of

spins, with each spins assuming the values {-1, +1}. In general, the energy objective function of Ising spin problems (also called*Hamiltonian*) is represented as a quadratic cost function of the following form:

(1) |

with being *(anti-)ferromagnetic couplings* that indicate a preference of correlation ( or ) between two spins, and *local magnetic fields* that individually
act on . Any optimization problem, including
MIMO detection, can in theory be translated to an
Ising spin model by properly choosing the and
(welsh1990computational; lucas2014ising).

Simulated Annealing (SA) Heuristic.
SA is inspired by the physical process of *annealing*, where a
metallic material is slowly cooled from high temperature to
eventually reach a molecular state or atomic configuration where
the potential energy of the material is minimized. SA
numerically emulates this process in order to
find the global optimum (or *ground state*) of
Eq. 1. To enable SA, it is necessary
to simulate a *thermal bath*

which imitates the cooling or annealing process interacting with the Ising spin model. More precisely, the probability that a given spinconfiguration

is explored by the Ising spin system at a given*inverse temperature*follows the

*Gibbs distribution*, with usually called

*partition*function (georgii2011gibbs; crooks2007measuring). As the temperature is lowered, the probability of finding a state with an energy larger than the minimum energy becomes exponentially lower. Therefore, sampling from the low-temperature Gibbs distribution allows rapid detection of the spin configuration with the lowest energy with high probability.

However, the calculation of the partition function , and thus , is computationally challenging, particularly for low temperatures.
To avoid the direct calculation of the Gibbs distribution , Metropolis *et al*. (metropolis1949monte)

proposed the use of Markov chain processes to help the system emulate the annealing and heuristic exploration of configurations at a given temperature. Specifically, they proposed a random process to “flip” a spin, with probability depending only on temperature and the Hamiltonian (but not on

),*i.e.*:

(2) |

with the variation of energy once
the spin () is flipped for a given initial configuration.
Hence,
moves that would eventually reduce the overall energy of the spin system are always accepted. Otherwise,
there is a chance that such spin flip is either accepted or rejected.
Metropolis *et al*. showed that the spin system will eventually thermalize
to the corresponding temperature if the rejection rule in Eq. 2
(also called *Metropolis updates* or *sweeps*)
is iteratively applied. Therefore, it is in principle possible to find the lowest energy spin configuration and, consequently,
the solution to the original problem, by starting from a very large temperature and slowly decreasing
it by iteratively applying the rule in Eq. 2.

### 2.2. Parallel Tempering

SA guarantees that the spin system
will eventually find the lowest energy spin configurations if the temperature is lowered
slowly enough. However, for hard optimization problems, it may require an exponentially long time. Indeed,
a rugged energy landscape “traps” the spin system in local minima which are hard to escape: *parallel tempering* (swendsen1986replica) helps the spin system escaping local minima and, therefore, thermalize
faster at a low temperature.
The basic principle of parallel tempering is simple: instead of a a single spin system, different *replicas* are simulated in parallel, each with
a different temperature. After a certain amount of Metropolis updated, the temperatures of the
two replicas and are exchanged following the updating rule:

(3) |

with and being the difference in the inverse of temperature and the difference in energies of the two replicas respectively. As one can see, two temperatures are always exchanged if a replica at higher temperature has a lower energy than a replica with a lower temperature. Otherwise, the exchange of the two temperatures is either accepted or rejected accordingly to Eq. 3. In a variety of hard optimization problems, parallel tempering drastically speeds up the thermalization of the spin system (katzgraber2006feedback; young2004absence; trebst2006optimized), including benchmark against quantum annealers (mandra2016strengths; mandra2017exponentially; mandra2018deceptive).

### 2.3. MIMO Detection

The input and output relationship of a spatial multiplexing MIMO system (per subcarrier in OFDM systems (nee2000ofdm)) with input antennas at user side (or single-antenna users for simplicity) and output antennas () with radios () at the receiver side is described as . With (*i.e.,* MIMO with radio streams), here is the received vector perturbed by *additive white Gaussian noise* (AWGN) , is the wireless channel, and is the transmitted set of symbols with *constellation* (*e.g.*, 4-, 16-, 64-QAM), representing bits per channel use, with the size of the modulation.
MIMO detection at the receiver side (AP or BS) is a technique to find a candidate solution with an objective of detecting the transmitted symbol vector (*i.e.*, the objective is ) based on the received signal

and estimated

. Pilot symbols enable the estimation of .Maximum Likelihood Detection.
*Maximum likelihood detection* (ML detection) is optimal in the
sense that it minimizes the error probability of the detection.
It is defined as

(4) |

The search set of (*i.e.*, the *search space* ) is the set of possible solutions that the
optimizer can take into account.
Each element in the search space is a candidate solution
with which the values of the *ML objective function*
in Eq. 4 (*i.e.*, Euclidean distances) are measured and
compared with each other. The best candidate, with minimum value,
becomes the ML solution . Further, is
an indicator of complexity.
In principle, the ML search involves all possible candidates (*i.e.*, )
which makes the brute-force approach intractable for large MIMO sizes with high-order modulations.

Sphere Decoder.
The *Sphere Decoder* (SD) achieves optimal performance even with by applying an adaptable search constraint in a sequential manner (fincke1985improved; Agrell02; Damen03; SD). The SD transforms Eq. 4 into an equivalent tree search
and applies tree pruning,
visiting fewer nodes and leaves without loss of optimality.
However, since it is an exact algorithm (*i.e.*, achieves ML performance), the search space for SD is still exponentially large in the worst case (SDComp1). Further, because of its sequential nature, its processes cannot be fully parallelized and the complexity (latency) varies per detection, which is not desirable for hardware implementation.

## 3. Related Work

In this section we introduce related work on MIMO detection.

Parallel Sub-Optimal Architectures. These approaches divide the optimal SD tree search into parallel tasks in order to make use of hardware containing many processing elements (PEs) such as a GPU
or FPGA (flexcore-nsdi17; nikitopoulos2018massively), while search algorithms become approximate.
For these methods such as the Fixed-Complexity Sphere Decoder (FCSD) (FCSD; barbero2008extending; jalden2009error; barbero2008low) and K-best SD (Guo06; mondal2009design; li2008reduced; zheng2017k), is a subset of ,
so how to select for comparing is a key factor.
For instance,
the FCSD splits the SD tree of levels into two separate search areas, one for *full search* (FS) and the other for *greedy search* (GS). During the FS, the FCSD visits all nodes at the first levels and then switches to GS, where only one child node with minimum partial Euclidean distance is explored for the remaining levels (). This exploration process can run in parallel.^{2}^{2}2For the maximum effect of the algorithm, a channel ordering scheme is used to ensure users with poor channel are detected in the FS phase of the FCSD.
The FCSD results in consisting of candidate solutions.
Here, is a controllable positive integer parameter that trades off the FCSD’s detection performance with its computational complexity. Note that
the complexity of the FCSD (even with small ) is still larger than linear methods and the FCSD enables only parallel processes such as 16, 256, 4096 for 16-QAM with , respectively (*i.e.,* bounded complexity but not flexible).
ParaMax features flexible and scalable parallelism.

Heuristics for MIMO Detection.

Heuristic approaches inspired by Biology or Combinatorial Optimization methods such as genetic algorithms, reactive tabu search, and particle swarm optimization for MIMO detection exist

(abrao2010s; vsvavc2013soft; hedstrom2017achieving), but significant performance gains are not observed. Some studies have used analog quantum hardware platforms (kim2019leveraging; kim2020towards), but these are specialized platforms that not yet generally available. Other studies on Physics-inspired SA, Gibbs distribution, and quantum search algorithm show feasibility to some extent (guangda2009multi; hansen2009near; farhang2006markov; abrao2010s; botsinis2013quantum), but lack comprehensive evaluations for Large and Massive MIMO systems and comparisons against other state of the art detectors.## 4. Design

In this section, we describe the design of ParaMax:
Section 4.1 introduces the key building block of ParaMax’s design,
a SAparallel tempering solver.
Section 4.2 describes the complete
design of ParaMax. Section 4.3 then
introduces a refinement of ParaMax, *2R-ParaMax*, which uses soft
information to enhance performance at the cost of some computational
complexity. We evaluate both designs in Section 6.

### 4.1. ParaMax Ising Solver (PMIS)

The *ParaMax Ising Solver* (PMIS) is the main solver
module in ParaMax. It is based on SA, featuring a parallel tempering
algorithm highlytailored to optimize the Ising model of
MIMO detection.
PMIS is a completely classical algorithm that does not
require any specialized hardware for its implementation.
It performs a local search by updating the spin values of a given
random initial configuration according to Eq. 2. Each
replica is associated with a different temperature, and temperatures
may be exchanged according to the update rule in Eq. 3. Since
the calculation of the energy associated with each replica can be trivially
reduced to matrixvector and vectorvector multiplications,
couplings and local fields of the Ising cost
function are stored as a matrix and as
a vector respectively.
Therefore, the calculation of , critical for the update
rules in Eq. 2 and Eq. 3, is reduced to:

(5) |

with the vector representing the spin configuration, where the factor takes into account the symmetry of the matrix . During our implementation, PMIS is optimized to maximize the performance for operations involving spin variables which cover up to 512, 256, and 128 singleantenna users with BPSK, QPSK, and 16QAM modulations, respectively. We provide further details on our PMIS implementation in Section 5.

Computational complexity. In MIMO detection, compute time complexity is a fundamental metric, along with BER and network throughput. From Eq. 5, it is clear that complexity is proportional to the square of the number of spin variables (). Therefore, recalling that every replica is independently updated, overall PMIS complexity scales as , with and the number of replicas and Metropolis sweeps, respectively.

Replicas and Metropolis Sweeps. To reduce the computational
cost to the bare minimum, we have opted for a “bang-bang” parallel
tempering approach. That is, only two replicas are used ():
one at very low temperature and one at higher temperature: the replica
at lower temperature acts as a *greedy* searcher while the
replica at a higher temperature acts as an *observer*. When
the greedy searcher is stuck in a local minimum, the two replicas
can exchange roles (*i.e.*, temperature) to resolve the bottleneck.
While this has been successfully used in the context of
quantum annealing (yang2017optimizing), ours is one of the first
reports of a bangbang parallel tempering schedule on a classical platform.
To choose the number of Metropolis sweeps (§2.1),
we empirically examine different numbers of sweeps from one to 80 with
16-QAM in Figure 3 out of 20 instances and 10,000 PMIS
runs per instance.
Not surprisingly, we observe a tradeoff between latency (*upper*) and
sampling quality (*lower*). We choose as an appropriate
point that satisfies the current LTE standard’s latency requirements.

Choice of temperature range.
Unlike and , the temperature range does
not influence ParaMax’s complexity, so we choose a PMIS temperature
range where it achieves the highest probability of finding the ML
solution (*i.e.,* the ground state). For benchmarks we
select the values and , which
perform well, as shown in Figure 4.

The foregoing description has described a single PMIS run. In ParaMax, multiple PMIS runs on multiple PEs in parallel, one PMIS run per PE. Each PMIS run is independent from the others, accepting the Ising model of the MIMO detection as input, and outputting a candidate solution.

### 4.2. ParaMax Design

In this section we describe the complete ParaMax design (Figure 5). We describe the function of each block required for MIMO detection in §4.2.1 and the soft output generator module in §4.2.2.

#### 4.2.1. ParaMax Detection Algorithm

We assume the base station receives a signal perturbed by AWGN and estimates the wireless channel as stated in Section 2.3.

1. ML-to-Ising Reduction.
The procedure and the generalized formula of reducing the MIMO
detection
(
from Eq. 4) to the Ising form
using the *spin-to-symbol mapping* were first introduced
in (kim2019leveraging); our system assumes the same
mapping. In the mapping, spins represent all
possible symbol combinations (*i.e.*,
spins for a possible symbol per user),
so its ground state always corresponds to the ML solution.

2. PMIS Parallel Processing.
Each PMIS run samples an independent solution candidate
(*i.e.,* an Ising configuration ). Since detection
is based on a heuristic,
a single PMIS run’s solution may not be optimal, and thus multiple
runs are required to form a set of candidate solutions, gradually
increasing the probability of collecting the optimal ML solution.
Since each PMIS optimization run on a single PE completely independent of the
others, ParaMax flexibly operates on any number of independent
processing elements (), with highly scalable parallelism.
The number of available processing elements
is equal to the number of PMIS outputs that the ParaMax
system can generate with full parallelism.

3. Ising Solution Filter and Demapping.
After all PMIS parallel runs, the corresponding outputs
are collected in a table of Ising spin configurations.
Before further processing, the list is sorted in order of solution quality,
based on the Ising energy of each output.
The *Ising solution filter* returns only the configuration
with the best (minimum) , which
is equivalent to the wireless symbol , *i.e.,* with the minimum (among candidate solutions), after proper
demapping (spins symbols). Finally, is converted into MIMO detected bits.

#### 4.2.2. Spinwise Soft Information Output

For most heuristicsbased solvers, only the lowestenergy Ising configuration
is returned (regardless of how many times it occurs among PMIS outputs)
and any outputs other than it are discarded. In ParaMax, however,
we utilize all PMIS outputs to generate soft information
(*i.e.,* detection confidences, for each spin in a given configuration).
In general, softoutput MIMO detectors’ soft values are
utilized for iterative MIMO detection or channel coding (roger2012efficient; larsson2002maximum; larsson2008fixed; barbero2008low).
In this work, we design the former (2R-ParaMax) in Section 4.3.

ParaMax collects candidate solutions from independent
PMIS runs. Among these, multiple occurrences of a certain spin
configuration (with agreeing spin variables) are very likely to be observed,
which could be used to identify spins easy (or hard) to detect (*i.e.*
variables that are very likely to be assigned a certain value in
the unknown optimal ML solution). Figure 6 shows
an illustrative example of detecting a received QPSK signal in
two equivalent representations, one in the I-Q plane with
Euclidean distance *(left)*, and the other with
Ising energies *(right)*.
In this example, the first spin variable (corresponding
to the symbol’s real part) is likely to be detected as for
most PMIS runs, since the difference in Ising energy from all
configurations that have (resulting in or
in Figure 6, *right*)
and (resulting in or in
Figure 6, *right*) is significant.
The spin is easy to detect compared to spin (corresponding
to the symbol’s imaginary part). Multiple occurrences of PMIS runs
agreeing on indicate this, while PMIS runs will
disagree on the value of , because the two
most frequent spin configurations’ energies ( and )
themselves disagree on the value of .

This phenomenon becomes even clearer for highorder modulations, since in 16-QAM or higher modulations, the value of the spin coefficients in the Ising spintosymbol mapping varies across spins. For example, Figure 7 shows the spintosymbol mapping of 16-QAM for the user, where the user’s possible symbol maps onetoone with spins , that is . Here, spins and , the oddnumbered spins, determine the symbol’s quadrant, while spins and , the evennumbered spins, determine the symbol in the given quadrant. Here, the oddnumbered spins for 16-QAM are in general easier to detect than the evennumbered spins because of higher robustness to AWGN (detection reliability). Table 1 presents empirical spinwise error rates of ParaMax for 16-QAM detection. These differences in robustness indicate that using ParaMax’s soft information would be particularly helpful for further processing, similarly for unequal error protection (UEP) (masnick1967linear; wei1993coded; horn1999robust; boyarinov1981linear; 237878)).

Mean Spinwise Error Rate | |||
---|---|---|---|

Ovennumbered spins: | Evennumbered spins: | ||

0.167 | 0.152 | 0.329 | 0.352 |

Either/both: 0.32 | Either/both: 0.68 |

Soft information computation. Based on the equivalence of
of the I-Q and spin configuration representations, the occurrence count
of a given spin value for a certain spin across all
PMIS outputs samples the distance of the symbol corresponding to
that spin’s value, and hence estimates that spin’s likelihood
of correctness.
More specifically, after collecting PMIS outputs sorted
by Ising energies (), the
system has () unique outputs with
corresponding *occurrence counts* ().
The detection confidence of spin ()
is defined as:

(6) |

where is a count of
occurrences of the
ranked configuration (defined in §2.1 on p. 2.1),
only when the
configuration’s spin is equal to the firstranked
configuration’s spin (*i.e.*,
is either or zero).
The spinwise detection confidences () are the
soft values ParaMax outputs in this step.
Note that the reliability of each increases as the best observed
Ising energy among the collected outputs ()
becomes closer to the unknown ground state (of energy ),
which implies as increases the quality of soft values
improves.^{3}^{3}3While ParaMax is an inherent softoutput MIMO detector,
requiring simple computations, conventional soft-output MIMO detectors
require additional computations of exponential complexity, of loglikelihood
ratio (LLR) for all coded bits to generate soft values at channel decoder (roger2012efficient; studer2006soft; wang2004approaching). Similar algorithm is introduced using quantum
annealing (karimi2017boosting), where only partial outputs are used.

### 4.3. 2R-ParaMax: Iterative Soft Detection

We now introduce a method of using the soft information described in the prior section to enhance the operation of ParaMax. We call this protocol 2R-ParaMax. The main idea is to iterate the PMIS block twice, once for generating soft confidence information, and again to obtain a final detection result based on the confidences from the first iteration. Intermediate processing between the first and second iterations functions pre-decision of spins with high detection confidence. An error correction post-processing is applied at the end of the second round, both of which have linear complexity. The end result is a more accurate MIMO detection result, at the expense of a modestly increased latency, and so this might be employed for challenging wireless channels andor large numbers of users. With reference to Figure 5, the structure of 2R-ParaMax PMIS block is shown in Figure 8. This block is replacing the third block marked in red of Figure 5. The other blocks are exactly the same as described in ParaMax. We also note that the soft information generated by the second round can also be used for the channel decoding or further iterations of the algorithm.

Intermediate Predecision. The intermediate predecision
module identifies those spins with a high detection confidence
(over a threshold ) from the first round
of PMIS outputs in order to reduce the number of spin variables involved
in secondround of PMIS runs, simplifying secondround detection.
That is, if the spin’s detection confidence
(), then we predecide
the spin variable to be the value of the corresponding
spin of the best solution in the first round (*i.e.*, ).

After thus obtaining a predecided set of spin indices, the next step is to update the Ising form accordingly. For each spin index in and for each Ising problem index , we set , and then remove , and . The result is a reduced Ising problem that contains spins only. Table 2 summarizes the average ratio of decided spins () and the success ratio () of the pre-decision process. Here, denotes an index group of spins where spins decided by the pre-decision process are exactly the same as the corresponding spins in the ML solution. In 2R-ParaMax, we apply , which ensures (for lower , higher applied). With the updated Ising form (with and for spins), we execute a second round of PMIS to generate outputs and then filter the best output consisting of spins with minimum Ising energy in terms of . When the filtered configuration is combined with the pre-decided spins appropriately, the full configuration consisting of spins can be restored and demapped into symbols. This full configuration is further compared against the best PMIS outputs of the first round based on the original Ising form . The final best configuration is returned as 2R-ParaMax’s detection solution, which guarantees that 2R-ParaMax’s bare minimum performance is ParaMax’s performance.

0.91 | 0.93 | 0.95 | 0.97 | 0.99 | |
---|---|---|---|---|---|

0.35 | 0.32 | 0.28 | 0.21 | 0.01 | |

0.97 | 0.99 | 1.0 | 1.0 | 1.0 |

## 5. Implementation

We now describe our ParaMax implementation.

Computing Environments. CPU-based experiments are executed on an Intel i9-9820X at 3.30GHz with 20 cores, 2,189 threads, and 126 GB RAM. GPU-based experiments are tested based on the CUDA (Compute Unified Device Architecture (sanders2010cuda)) 10.2 with GeForce RTX 2080 Ti of 4,352 CUDA cores and 68 streaming multiprocessors.

Wireless MIMO Channels. Both simulationbased and
tracedriven real world wireless channels are used for
our experiments. In the case of the simulationbased channel,
independent and identically distributed (i.i.d) Gaussian channels with
AWGN are synthesized for various SNR settings. For tracedriven channels,
we use nonline of sight wideband MIMO channel traces at 2.4 GHz,
between 96 base station antennas () and eight static users (),
the largest MU-MIMO dataset provided in Argos (Argos).
Among , we single out 8 to 32 (in steps of four) antennas
to test the most challenging MIMO regimes (*e.g.* ).
Since trace based channels include measured noise and limited user numbers,
we use synthesized channels, unless otherwise stated, in order
to precisely control SNRs and evaluate various MIMO regimes
such as . Based on both channel settings, we generate
large-scale Ising models of MIMO detection
(100,0001,000,000 random instances per scenario) in
order to measure detection BER up to ,
approximately .

PMIS CPU-Implementation.
While the front-end of our PMIS implementation is in Python, the core is completely written in standard (josuttis2012c++). We assign only a single core and a single thread (a single PE) to the calculation of each PMIS run by manually modifying the OpenMP (chandra2001parallel; dagum1998openmp) and C++ parallelization settings.
Furthemore, to maximize the performance of PMIS (to satisfy limited processing time in wireless standards), the following *innovations* have been implemented: (1) Use of static memory: Static allocation, unlike dynamic allocation, happens at global scope and it is pre-populated when the library is loaded. Moreover, since the size of arrays is known in advance, compilers can further optimize math operations on static arrays. (2) Parameter pack expansion: Loops in the matrix-matrix and vector-matrix multiplications are the most expensive part in PMIS implementation. To further reduce the computational cost, most of the critical loops are statically unrolled using features like the parameter pack expansion, introduced in the standard. (3) Intel SIMD instructions: Most of the modern CPU architecture have intrinsic operations to allow multiple operations on contiguous arrays of floats. In PMIS, we have used Intel SIMD instructions to vectorialize operations like matrix-vector and vector-vector multiplications (plank2013screaming).

PMIS GPU-Implementation. The core design of CPU-ParaMax is based on a highly optimized C++ implementation of SA. Therefore, the natural extension of ParaMax to GPU consists into the implementation of the core SA engine to GPU. However, unlike CPUs, GPUs achieve the best performance for large arrays where multiple synchronous operations are applied at the same time. Indeed, while GPUs have more cores than CPUs, each single GPU core is typically much slower. Therefore, to maximize the performance of SA implemented on GPUs, we have designed a GPU kernel based on the JAX/XLA language that updates multiple PMIS runs at the same time. More precisely, the spin configuration for a single PMIS run (in Eq. 5) is now extended to a matrix , with corresponding to the PMIS index. Since PMIS runs are completely independent from each other, () can be updated independently and synchronously.

## 6. Evaluation

In this section, we evaluate ParaMax in various aspects. Section 6.1 evaluates ParaMax’s detection latency against other CPU- and GPU-based detectors. Section 6.2 illustrates sampling performance of ParaMax comparing against simulated annealing and required the number of processing elements for ParaMax to achieve near-ML performance in both Large and Massive MIMO. Section 6.3 and 6.4 show the ParaMax’s bit error rate and system throughput performance respectively, compared against other state-of-the-art detectors in both Large and Massive MIMO.

### 6.1. Detection Latency

Fully-Parallel Full-Blown ParaMax.
Figure 8(a) shows the detection time of ParaMax as a function of ( per channel use, where the background color coding indicates approximate feasibility for the wireless standards (WLAN and LTE). As increases (*i.e.*, and/or modulation increases), computing time tends to scale as (cf. does not affect and thus compute time). The available largest MIMO sizes that reach the borderline-limit of acceptable detecting time
in the LTE standards are , , for BPSK, QPSK, 16-QAM modulation, respectively.
While slight variations of runtime are observed that can cause overall latency increase and hardware synchronization issues, this can be resolved by an integrated system hardware since
the origin of the variations is
caused
related to our system
by the kernel allocation of jobs and concurrent
(unrelated) system processes.
In general, 2R-ParaMax requires approximately ParaMax latency.

CPU- vs GPU-ParaMax Comparison.
Figure 8(b)
shows the comparison of compute time between the XLA kernel (compiled for both CPU and GPU) and the original C++ kernel. The runtime for the C++ kernel has been obtained by computing the average runtime for a single PMIS run and then projected for multiple PMIS runs. XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that can be compiled and optimized separately for either CPU or GPU.
As one can see, for a sufficiently large number of parallel PMIS runs, while the kernel runtime compiled for CPUs have a consistent runtime with the full-blown C++ implementation of ParaMax, the kernel compiled for GPUs shows a speed-up, where total PMIS runs can be defined as multiplied by the number of subcarriers ().
While GPU-ParaMax can achieve the speed-up for over hundreds of PMIS runs, it cannot satisfy time requirements for standards. Recall that current GPUs are not designed to make full use of resources for small-size systems (*i.e.,* few PMIS runs). Thus we also extrapolate its performance to estimate what we could achieve in GPU without these limitations. Note that unlike on CPUs where a single core can be used to carry out any calculation, GPU cores are designed to work in concert to manipulate large block of data in parallel, and users cannot assign specific resources to a certain computation (flexcore-nsdi17). Therefore, we define a single PE for GPU-ParaMax as the extrapolation of a single PMIS run from large number of PMIS runs. In 5G New Radio, this extrapolation becomes more reasonable (while still being approximate), since 5G systems will support over three thousands of subcarriers, and slightly more time^{4}^{4}4Available compute time is for all BS-processing including channel decoding. (4 ms) than LTE (3 ms) will be allowed for enhanced mobile broadband (eMBB) (ding2020agora; BigStation; 3gpp-ts-36-211) (cf. 1 ms for 5G Ultra-Reliable Low-Latency Communication (URLLC)).

20-user MIMO (16-QAM) | ZF-SIC | ParaMax | FCSD | |||

=2 | =3 | =4 | ||||

Parallelism # | Flexible | |||||

Required | 1 | 256 | 4,096 | 65,536 | ||

Latency [us] |
CPU | 25 | 357 | 405 | 5,821 | 93,714 |

GPU | 83,861 | extr. 31 | 319 | 378 | 1,841 | |

CPU | CPU | GPU | GPU | GPU | ||

Min time |
25 | 357 | 319 | 378 | 1,841 |

*i.e.,*fully-parallel ParaMax) and GPU-ParaMax reports extrapolated compute time.

Comparison against Conventional Detectors. We compare ParaMax latency against various detectors implemented on the MIMOPACK library (ramiro2015mimopack)

, which is one of the fastest open-source MIMO detector implementations based on the (CUDA) C programming. The results for 20-user 16-QAM are summarized in Table

3. In the case of the zero-forcing successive interference cancellation (ZF-SIC or V-BLAST with ordering scheme) (wolniansky1998v), while its complexity is slightly higher than linear detectors such as ZF and MMSE, compute time is still few tens of microseconds. However, their computations (both ZF and ZF-SIC) are not appropriate for parallel processing, causing extra overheads such as job scheduling and data transition among computing resources. In the case of the FCSD, we consider three different that trade-offs the FCSD’s detection performance with its complexity. As long as the available number of PEs is large enough to allow full parallelism ( total PEs), the compute time remains in a few hundreds of microseconds, satisfying LTE requirements.### 6.2. Heuristic Detection Sampling

For ParaMax, we can report the expected number of sampling repetitions to reach ML-performance, which can be computed using the *probability of obtaining the ML-solution in one sample* of a given MIMO detection scenario (*i.e.,* MIMO size, modulation, and SNR), averaged across the problem distribution () (mandra2016strengths). Since cannot be determined *a priori* by theoretical means, we obtain it through empirical evaluation of statistically significant 1,000,000 PMIS runs across 100 detection instances per scenario. In order to compute this average probability, we use the ML-solutions found by expensive runs of the Sphere Decoder. Since each run is independent, the probability for ParaMax’s to find the optimal ML-solution:

(7) |

Inverting Eq. 7, we can obtain the required number of PMIS repetitions (samples) to achieve the ML-detection with a target probability as:

(8) |

In Figure 9(c),^{5}^{5}5Note that the formulas hold also for any non-parallel iterative method with independent sampling, where is simply the number of required repetitions. we plot and corresponding required for different .

Very Large MIMO with Low-Order Modulations.
Figure 9(a) plots as a function of Large MIMO detection with different heuristicbased detectors (SA, ParaMax, and 2RParaMax) for various and modulations. Surprisingly, for the BPSK and QPSK modulations, all tested heuristic detectors achieve , which implies nearly all PMIS runs successfully reach the ML-solution. For ParaMax and 2RParaMax, this tendency is observed up to while we plot here only up to to save space. Only a few processing elements are enough to perform ML-detection up to MIMO with BPSK and MIMO with QPSK. While
ParaMax becomes currently unpractical at around (Figure 9), this MIMO size and its requirement for optimal detection is promising for city-scale *Internet of Things* (IoT) applications envisioned in 5G networks or beyond. Those scenarios will handle hundreds or thousands of devices per BS with low-order modulations (miller2015internet; mazaheri2019millimeter; miorandi2012internet), and may accept longer processing time than ordinary data communications.

From Large to Massive MIMO with 16-QAM. In the case of 16-QAM in Figure 9(a), notably drops as increases for all heuristic-based detectors and we observe higher for 2RParaMax, ParaMax, and SA.
Given that Large MIMO detection with high-order modulations is a challenging problem in general, we add more receiver antennas () to see the impact of / ratio on . Figure 9(b) shows this relationship for various user numbers for different SNRs. As increases (*i.e,* from Large MIMO to Massive MIMO), rapidly increases and then is converged to 1.0. While for larger number of users () at lower SNRs tends to increase slower, can still be achieved around , where the required for ML-detection is around 1,000 (see Figure 9(c), where we summarize the applicability of ParaMax). We observed that the trace-driven channel with noise shows better performance (faster convergence than 20 dB SNR).

### 6.3. Bit Error Rate (BER) Performance

This section presents ParaMax’s detection BER. Recall that is the number of processing elements (PEs) assigned to ParaMax per subcarrier, where each PE performs a PMIS run. Since we assume fully-parallel ParaMax for minimum detection latency, is also equal to the number of PMIS runs. Note that regardless of computing platforms (CPU, GPU, or FPGA), the detection performance (BER and throughput) as a function of is the same, as long as they can satisfy limited time requirements supporting all subcarriers (unless there exists a serious precision issue), while the definition of a single PE and total available PEs per device can vary depending on platforms and/or implementation details. In the next subsection (Sec 6.4), we evaluate ParaMax on multi-subcarrier systems, considering its detection latency, available parallelism, available compute time in wireless standards, and impact of forward error control (FEC).

Overview: BER from Massive to Large MIMO. Figure 11 shows BER performance in various MIMO regimes with 16-QAM. We consider ParaMax and 2R-ParaMax with 16 and 256 PEs, comparing them against other detectors such as ZF (linear), SA (heuristic), FCSD (tree search-based), and optimal SD (ML), where SA and FCSD (with channel ordering (dai2010simplified)) are comparison schemes of parallel architecture-based detectors. As expected, linear-based ZF, which requires high ratio for proper detection, performs poorly as the regime goes from Massive MIMO to Large MIMO (*upper* to *lower* in Figure 10(a)) and with more users (*left* to *right* in Figure 10(a)), showing several orders of magnitude worse BER performance against the other detectors, particularly at high SNRs.
In the case of the parallel architecture-based detectors, it is observed that for the same PEs, ParaMax and 2R-ParaMax outperform SA and FCSD detectors in all MIMO regimes and SNRs tested except that
in MIMO at high SNRs, FCSD outperforms ParaMax and 2R-ParaMax when with 16 PEs. However, 2R-ParaMax reaches lower BER than FCSD when with 256 PEs. Figure 10(b) plots BER with 16-QAM as a function ratio with smaller such as 2, 8, and 16. For low ratios, parallel architecture-based detectors even with 2 PEs can obtain lower BER than ZF. As the ratio increases, all detectors achieve better BER for the same , but more PEs are required to beat ZF.

Detailed View: BER as a function of . We evaluate BER as a function of for 12-user Large and Massive MIMO in Figure 12 to show the detailed performance comparison. Note that ZF is not suitable for parallelization, so it achieves the same performance, regardless of the number of PEs.
Figure 11(a) presents BER for Large MIMO () with 16-QAM at various SNRs.
ParaMax can support any number of PEs and approach the optimal performance as increases (*i.e.,* fine parallelism granularity), while the FCSD requires at least 16 PEs to operate the fully-parallel algorithm for the minimum , and the FCSD with = performs equivalently until reaches (*i.e.,* no gain between 16 PEs and 256 PEs). Figure 11(b) focuses on Massive MIMO (), showing the impact of ratio on both BER and . Higher ratios (*upper* to *lower* in Figure 11(b)) lead to lower BER for the same PEs and smaller for the near-ML BER, especially compared against Large MIMO (Figure 11(a)). Precisely, to reach the near-ML BER at SNR 16 dB for 12 users, 12-BS antenna MIMO requires around 60 PEs, 18-BS antenna MIMO requires 18 PEs, and 48-BS antenna MIMO requires only 5 PEs. Compared to SA and FCSD, ParaMax’s BER drops more rapidly as increases for any ratios. For example, to reach at MIMO, where the FCSD and SA requires (over) 16 PEs, ParaMax requires 6 PEs.

We also test 128-user Very Large MIMO with the QPSK modulation in Figure 11(c). As analyzed in Figure 9(a), we observe that very small can result in BER convergence for the QPSK, which is very likely the optimal BER, although we cannot evaluate SD because of extremely high complexity. At SNR 16 dB, ParaMax achieves over five orders of magnitude better BER than SA at 2 PEs and over six orders of magnitude better than FCSD at 4 PEs.
Note that the FCSD even with thousands of PEs (*i.e.,* with high ) cannot reach the ParaMax’s performance with a single PE.

### 6.4. System Throughput Performance

This section evaluates throughput on multi-subcarrier systems. While detection BER is a fundamental metric for MIMO detection, the detector of the lowest BER does not necessarily imply the best throughput scheme, since real-world wireless systems include FEC techniques for error correction at the channel decoder under MIMO detector. Further, since the systems support many subcarriers with limited compute time, the total required computing resources to support them are another important metric for evaluation.

We first consider a WLAN wireless system with 64 OFDM subcarriers with 1/2 rate convolutional coding, where optimal achievable (ML-based) throughput on the system has been measured via over-the-air experiments in (flexcore-nsdi17). We translate the measured detection BER into the corresponding convolutional code-applied BER (*i.e.,* coded BER). Among the provided data we select achievable optimal throughput for MIMO and MIMO at SNR 21.6 dB as a baseline throughput, assuming the coded BER of optimal Sphere Decoder (SD) we test at the same scenario (*i.e.,* same MIMO size and SNR) is close to their optimal coded BER. We compute the achievable optimal throughput for various scenarios, considering SNR and optimal *Frame Error Rate* (FER) difference (ratio) against the baseline, for the frame size of 1500-byte and FER obtained from our coded BER. Then we compute throughput of various detectors considering FER difference between SD and corresponding detectors at each scenario. In WLAN scenarios, we maintain the minimum detection latency. For this, the system is expected to have total PEs of on a computing platform, where is the number of subcarriers. For example, for ParaMax to support 2 PMIS runs (*i.e.,* 2 PEs) per subcarrier, a single state-of-the-art CPU (with 128 cores) is enough, while multiple CPUs are required to support more PEs per subcarrier. Since MIMO detection is completely independent across subcarriers, ParaMax’s job scheduling and allocation for multiple devices are quite straightforward.

Figure 13 shows throughput comparison against ZF and FCSD for various scenarios in Massive MIMO. Figure 12(a) demonstrates the impact of SNRs for the given MIMO size. While for ParaMax and 2R-ParaMax, 8 PEs (per subcarrier) are enough to reach the optimal performance for all tested SNRs, ZF could not reach the optimal performance except over SNR 20 dB including trace-driven channel and noise. Figure 12(b) shows the impact of ratio varying receiver antenna numbers at fixed SNR 16 dB. As seen in the previous section, less PEs are required at high ratios to achieve the near-ML performance. We observe that for the same scenarios, even less PEs are required to achieve the near-optimal “throughput” performance (Figure 12(b)) than to achieve the near-optimal “BER” performance (Figure 11(b)) because of the impact of FEC. Figure 12(c) plots ParaMax’s throughput gains versus ZF for various ratios and SNRs. The high gains are achieved at low SNRs and/or low ratios. These throughput gains can be generalized to any (even at different standards with assumptions/modifications), as long as both schemes can support all subcarriers satisfying the corresponding limited time requirements.

In general, cellularnetworked systems, such as 4G or 5G systems, support many more subcarriers, allowing more compute time than WLAN. To project the throughput performance onto 5G scenarios, we plot the maximum that can be supported per millisecond with 5G target, as a function of total available PEs on a computing platform, considering detector’s latency and parallelism based on assigned PEs per subcarrier in Figure 13(a) (we report GPU-ParaMax based on extrapolated data, as discussed in section 6.1). The figure implies how many computing resources are required to support 5G systems. A ZF-based system can support 5G target even with a single state-of-the-art 128-core CPU due to its short latency and minimum PE usage. In the case of ParaMax, tens of CPUs are required to support 2 PEs per subcarrier and hundreds for 16 PEs.

Furthermore, we estimate the best throughput scheme (ZF vs. ParaMax) for various MIMO regimes and SNRs, assuming a computing platform with ten CPUs in Figure 13(b) (*left*) for 12-user 16-QAM and Figure 13(b) (*right*) for 128-user QPSK based on achievable ParaMax throughput gains (vs ZF), although 128-user QPSK MIMO is currently unpractical (Figure 9).
For gain 1.0, we report the ZF as the best scheme since it takes less compute time, while we report challenging regimes where ZF does not perform well and ParaMax requires at least several tens of PEs per subcarrier for the near-ML performance. We observe that ParaMax enables many challenging regimes of ZF (*i.e.*, low ratio and/or low SNRs) by assigning reasonably more PEs.^{6}^{6}6Advanced FEC schemes such as LDPC and Polar codes that are applied in 5G systems can enable the near-ML performance with ZF for more MIMO regimes and SNRs. However, even more users and lower SNRs will keep bringing out the same scenarios, where ParaMax outperforms ZF, due to the fundamental detection BER gap. In the case of the QPSK, at (relatively small ratio), ZF can outperform ParaMax at some SNRs, but for this, 256-BS antennas are required to support 128 users, which is the double size of the-state-of-the arts. Of course, ParaMax requires more computing resources (10100) than ZF, but the trend at emerging system-on-chip architectures with more and more PEs, as well as C-RAN architectures promisingly envisioned in 5G, support the direction of massively parallel architectures-based designs requiring low interaction among PEs.

## 7. Discussion

In this section, we investigate several challenges and opportunities of ParaMax that are likely to further advance the system.

Fully-optimized and adaptive ParaMax.
Considering that parallel tempering-related parameters are selected within a challenging scenario (16-QAM) in Section 4.1, ParaMax could be fullyoptimized for many different scenarios based on given user numbers, ratios, SNRs, modulation sizes, available total PEs, and/or wireless standards.
Moreover there is a well known trade-off between number of sweeps (latency) and required for near-ML performance (compute resources) that could be explored (*e.g.,* for 5G URLLC).

Higher-order modulations. As modulation size increases, ParaMax’s detection is degraded rapidly and becomes not operable for Large MIMO with high-order modulations such as 64-QAM or higher, requiring over PEs even for Large MIMO to achieve near ML-performance. For Massive MIMO, it is expected that even higher ratios than 16-QAM are required. Perhaps, more replicas or Metropolis sweeps ease the problem along with further optimization on ParaMax’s free-parameters related to parallel tempering such as temperature range for PMIS tuning. However, these gains will be obtained at the expense of longer latency. An implementation of ParaMax on dedicated hardware might improve the performance and reduce the computational cost order further.

Compatibility with specialized hardware. ParaMax does not require any specific hardware. However, another important aspect of ParaMax is that it is immediately compatible with future implementations that aim to deploy programmable specialized hardware (for Physics-based algorithms) designed to optimize problems in the Ising form including quantum devices such as quantum annealers (kim2019leveraging) and gate-model quantum computers running the QAOA algorithm (Hadfield:2017:QAO:3149526.3149530), as well as novel paradigm of classical calculation such as Optical Coherent Ising Machines (hamerly2019), CMOS-based annealers (aramon2019physics), and Oscillator-based platforms (csaba2020coupled).

## 8. Conclusion

In this work, we present ParaMax, a soft MU-MIMO detector system for Large and Massive MIMO networks that first makes use of parallel tempering for MIMO detection. Our performance evaluation shows that ParaMax enables currently-challenging MIMO regimes for commonly-used linear detectors, achieving the near-ML performance by assigning reasonably more compute resources. ParaMax also outperforms conventional parallel architecture-based detectors such as FCSD and SA-based detectors, requiring less processing elements to achieve the near-ML performance.

## Acknowledgements

We thank the anonymous shepherd and reviewers of this paper, the NASA Quantum AI Laboratory (QuAIL), and the Princeton Advanced Wireless Systems (PAWS) Group for their extensive technical feedback. This research is supported by National Science Foundation (NSF) Award CNS1824357 and CNS1824470, and an award from the Princeton University School of Engineering and Applied Science Innovation Fund. Salvatore Mandrá is supported by NASA Ames Research Center (ARC), particularly the autonomous ATM part of the Transformative Tools and Technologies Project under the NASA Transformative Aeronautic Concepts Program, and DARPA under IAA 8839, Annex 125. Minsung Kim was also supported by the USRA Feynman Quantum Academy funded by the NAMS R&D Student Program at NASA ARC (FA8750-19-3-6101).

Comments

There are no comments yet.