Physics-Inspired Heuristics for Soft MIMO Detection in 5G New Radio and Beyond

by   Minsung Kim, et al.

Overcoming the conventional trade-off between throughput and bit error rate (BER) performance, versus computational complexity is a long-term challenge for uplink Multiple-Input Multiple-Output (MIMO) detection in base station design for the cellular 5G New Radio roadmap, as well as in next generation wireless local area networks. In this work, we present ParaMax, a MIMO detector architecture that for the first time brings to bear physics-inspired parallel tempering algorithmic techniques [28, 50, 67] on this class of problems. ParaMax can achieve near optimal maximum-likelihood (ML) throughput performance in the Large MIMO regime, Massive MIMO systems where the base station has additional RF chains, to approach the number of base station antennas, in order to support even more parallel spatial streams. ParaMax is able to achieve a near ML-BER performance up to 160x160 and 80x80 Large MIMO for low-order modulations such as BPSK and QPSK, respectively, only requiring less than tens of processing elements. With respect to Massive MIMO systems, in 12x24 MIMO with 16-QAM at SNR 16 dB, ParaMax achieves 330 Mbits/s near-optimal system throughput with 4–8 processing elements per subcarrier, which is approximately 1.4x throughput than linear detector-based Massive MIMO systems.



There are no comments yet.


page 9


Efficient Soft-Output Gauss-Seidel Data Detector for Massive MIMO Systems

For massive multiple-input multiple-output (MIMO) systems, linear minimu...

A 354Mb/s 0.37mm^2 151mW 32-User 256-QAM Near-MAP Soft-Input Soft-Output Massive MU-MIMO Data Detector in 28nm CMOS

This paper presents a novel data detector ASIC for massive multiuser mul...

Ising Machines' Dynamics and Regularization for Near-Optimal Large and Massive MIMO Detection

Optimal MIMO detection has been one of the most challenging and computat...

Low-Complexity Multiuser QAM Detection for Uplink 1-bit Massive MIMO

This work studies multiuser detection for one-bit massive multiple-input...

Towards Radio Designs with Non-Linear Processing for Next Generation Mobile Systems

MIMO mobile systems, with a large number of antennas at the base-station...

Interference-Aware Flexible TDD Design for Massive MIMO 5G Systems

Both the use of very large arrays of antennas and flexible time division...

Fully Decentralized Massive MIMO Detection Based on Recursive Methods

Algorithms for Massive MIMO uplink detection typically rely on a central...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. Fundamental MIMO regimes in 5G New Radio and next generation local-area networks, and approximate feasibility of various detection approaches.

Multi-User Multiple-Input Multiple-Output (MU-MIMO) has proven an essential technique to maximize capacity in many different kinds of wireless systems such as 802.11 wireless LAN and 5G New Radio cellular networks. In MU-MIMO, the uplink receiver (i.e., an access point—AP—in a wireless LAN, or a base station—BS—in a cellular network) with multiple antennas supports many users simultaneously by striping data over parallel streams (a technique known as spatial multiplexing), and thus enables significantly higher data capacities. In an ideal world, the number of parallel streams that MU-MIMO can support would be the lesser of the number of mobile users and the number of radios at the base station, and overall system capacity would increase proportionally to the number of spatial streams.

In practice, however, the channel hardening phenomenon complicates this situation, in the following way. MU-MIMO requires signal processing to disentangle the spatial streams from each other, a technique called MIMO detection. For a base station with as many antennas as radio front ends, when the number of users approaches the number of base station antennas, MIMO detection becomes extremely difficult resulting in poor performance for conventional linear detection algorithms (BigStation): this is the Large MIMO regime that lies along the points where the number of users equals the number of base station antennas , as depicted in Figure 1.111For simplicity, we call MIMO regimes, “Large MIMO” when , while “Massive MIMO” when , regardless of size. For Large MIMO, there exist maximum-likelihood (ML) exact solvers, that can achieve the lowest possible bit error rate and, therefore, restore a high throughput. Unfortunately, these detection algorithms come at the expense of an exponential increase of the required computational resources as MIMO size increases, eventually becoming infeasible for many users because of the processing time limits in wireless systems. For example, at most three milliseconds of BS’s computation are available for both the 4G LTE uplink and downlink (BigStation; dahlman20134g).

Massive MIMO systems such as LuMaMi and Lund (6-12 users, 100-128 BS antennas) (rusek2012scaling; vieira2014flexible; malkowsky2017world), Argos (eight users, 96 BS antennas) (argos-mobicom12; argos-asilomar17), BigStation (BigStation), Agora (ding2020agora), and Samsung’s 5G base stations (16 users, 64 BS antennas) (samsung) mitigate channel hardening in the following way. Since linear detectors such as ZeroForcing (ZF) and Minimum MeanSquared Error (MMSE) can achieve nearML performance when the wireless channel is wellconditioned, systems that use many more base station antennas than usersspatial streams (i.e., for MIMO) may offer each base station radio a choice of one out of a number of antennas to use. This largely negates the effect of channel hardening, but requires base station antennas numbering a sufficient factor greater than users (e.g.  (bjornson2015massive) or for 16 users or below (argos-asilomar17), while there is no proven rule-of-thumb of that maximizes the spectral efficiency (bjornson2015massive)), as shown in Figure 1, to achieve the full throughput of spatial streams. In addition, the deployment of larger numbers of antennas eventually becomes challenging from a practical standpoint, most acutely in wireless local area networks, but also in small, denselydeployed 5G base stations where form factors preclude excessive numbers of antennas, and eventually in normal base stations where tower size faces practically limited.

In this paper, we take a complementary approach to Massive MIMO: we begin with a particular Massive MIMO configuration in which the number of base station antennas is practically at its maximum, and then ask the question how can performance be further improved via additional spatial streams? The answer lies in a fusion of two preceding ideas: add radio chains at the Massive MIMO base station to equal the number of antennas, and at the same time, utilize nearML detection algorithms. This pushes us out towards the upperright corner of the space in Figure 1 and maximizes computational complexity, yet offers the promise of the greatest spatial multiplexing gains, given our practical constraint on base station antenna count.

A shift to Physics-inspired approaches. Over the last few years, there has been a surge of interest in alternative computation approaches to reduce the complexity of current detectors by leveraging algorithms that relate optimization convergence to Physics principles. This interest is further accelerating in view of experimental initiatives featuring hardware-native implementations of these approaches, using both quantum and classical physics-based computations (kim2019leveraging; hamerly2019; aramon2019physics; csaba2020coupled; Hadfield:2017:QAO:3149526.3149530; kasi2020towards; kim2020towards). One common aspect of these algorithms is that they frame the computational problem as an energy minimization problem of a magnetic spin system, also known as the Ising spin model (ising1925beitrag). Beside being an important model to understand the physics behind magnetic systems, any NP computational problem can be expressed as the energy minimization of an appropriate Ising spin model (lucas2014ising) (that is, the Ising spin model is NP-Complete (welsh1990computational)). In this regard, physicsinspired algorithms can be seen as parametric “black boxes” that accept an Ising spin problem as input, and output the configuration with lowest associated energy. What distinguishes one algorithm from another is the underlying mechanism used to find the global minimum, which corresponds to the ML optimal solution in MIMO detection.

This paper presents the design and implementation of ParaMax, a soft MU-MIMO detector system for Large and Massive MIMO networks that uses parallel tempering, a physics-inspired heuristic algorithm, on classical platforms. ParaMax operates flexibly in parallel for any number of available processors, supporting fixed latency and highlyscalable parallelism. We design the ParaMax Ising Solver4.1), a parallel temperingbased solver that is tailored for MIMO detection, implemented as a fully classical algorithm that does not require any specific hardware, and integrate it into the overall design of our system (§4.2). We also introduce a new algorithm (§4.2.2) to generate soft information for heuristic detectors that enables a more reliable detection and decoding. The proposed algorithm utilizes heuristic detection outputs and generates soft information, defined as the bitwise detection confidences that implicitly take channel conditions and noise into consideration. To our best knowledge, this is the first application of parallel tempering to wireless networks, and ParaMax is the first heuristic-based MIMO detector that demonstrates nearML performance for both very Large and Massive MIMO successfully.

Figure 2. Summary of ParaMax’s feasible MIMO regimes (cf. Figure 1) and required processing element count per subcarrier.

Our experiments show that ParaMax achieves a constantlyincreasing performance as the number of processing elements increases. In the case of lowerorder BPSK and QPSK modulations, very large MIMO of and respectively, can achieve near-ML performance for less than tens of processing elements, as depicted in Figure 2. With respect to Massive MIMO systems, in MIMO with 16-QAM modulation at SNR 16 dB, ParaMax achieves a 330 Mbits/s near-optimal system throughput with 48 processing elements (PEs) per subcarrier, approximately better throughput than linear detectorbased Massive MIMO systems.

2. Background

This section introduces background knowledge, indicating relevant literature. Sections 2.1 and 2.2 respectively explain ParaMax’s algorithmic foundations, simulated annealing and parallel tempering. Section 2.3 describes the MU-MIMO model and detection problem.

2.1. Simulated Annealing

Simulated annealing (SA) is a classical heuristic optimization technique typically used to find the state or configuration with the lowest energy of Ising spin problems, where

is a vector consisting of

spins, with each spins assuming the values {-1, +1}. In general, the energy objective function of Ising spin problems (also called Hamiltonian) is represented as a quadratic cost function of the following form:


with being (anti-)ferromagnetic couplings that indicate a preference of correlation ( or ) between two spins, and local magnetic fields that individually act on . Any optimization problem, including MIMO detection, can in theory be translated to an Ising spin model by properly choosing the and (welsh1990computational; lucas2014ising).

Simulated Annealing (SA) Heuristic. SA is inspired by the physical process of annealing, where a metallic material is slowly cooled from high temperature to eventually reach a molecular state or atomic configuration where the potential energy of the material is minimized. SA numerically emulates this process in order to find the global optimum (or ground state) of Eq. 1. To enable SA, it is necessary to simulate a thermal bath

which imitates the cooling or annealing process interacting with the Ising spin model. More precisely, the probability that a given spinconfiguration

is explored by the Ising spin system at a given inverse temperature follows the Gibbs distribution , with usually called partition function (georgii2011gibbs; crooks2007measuring). As the temperature is lowered, the probability of finding a state with an energy larger than the minimum energy becomes exponentially lower. Therefore, sampling from the low-temperature Gibbs distribution allows rapid detection of the spin configuration with the lowest energy with high probability.

However, the calculation of the partition function , and thus , is computationally challenging, particularly for low temperatures. To avoid the direct calculation of the Gibbs distribution , Metropolis et al(metropolis1949monte)

proposed the use of Markov chain processes to help the system emulate the annealing and heuristic exploration of configurations at a given temperature. Specifically, they proposed a random process to “flip” a spin, with probability depending only on temperature and the Hamiltonian (but not on

), i.e.:


with the variation of energy once the spin () is flipped for a given initial configuration. Hence, moves that would eventually reduce the overall energy of the spin system are always accepted. Otherwise, there is a chance that such spin flip is either accepted or rejected. Metropolis et al. showed that the spin system will eventually thermalize to the corresponding temperature if the rejection rule in Eq. 2 (also called Metropolis updates or sweeps) is iteratively applied. Therefore, it is in principle possible to find the lowest energy spin configuration and, consequently, the solution to the original problem, by starting from a very large temperature and slowly decreasing it by iteratively applying the rule in Eq. 2.

2.2. Parallel Tempering

SA guarantees that the spin system will eventually find the lowest energy spin configurations if the temperature is lowered slowly enough. However, for hard optimization problems, it may require an exponentially long time. Indeed, a rugged energy landscape “traps” the spin system in local minima which are hard to escape: parallel tempering (swendsen1986replica) helps the spin system escaping local minima and, therefore, thermalize faster at a low temperature. The basic principle of parallel tempering is simple: instead of a a single spin system, different replicas are simulated in parallel, each with a different temperature. After a certain amount of Metropolis updated, the temperatures of the two replicas and are exchanged following the updating rule:


with and being the difference in the inverse of temperature and the difference in energies of the two replicas respectively. As one can see, two temperatures are always exchanged if a replica at higher temperature has a lower energy than a replica with a lower temperature. Otherwise, the exchange of the two temperatures is either accepted or rejected accordingly to Eq. 3. In a variety of hard optimization problems, parallel tempering drastically speeds up the thermalization of the spin system (katzgraber2006feedback; young2004absence; trebst2006optimized), including benchmark against quantum annealers (mandra2016strengths; mandra2017exponentially; mandra2018deceptive).

2.3. MIMO Detection

The input and output relationship of a spatial multiplexing MIMO system (per subcarrier in OFDM systems (nee2000ofdm)) with input antennas at user side (or single-antenna users for simplicity) and output antennas () with radios () at the receiver side is described as . With (i.e., MIMO with radio streams), here is the received vector perturbed by additive white Gaussian noise (AWGN) , is the wireless channel, and is the transmitted set of symbols with constellation (e.g., 4-, 16-, 64-QAM), representing bits per channel use, with the size of the modulation. MIMO detection at the receiver side (AP or BS) is a technique to find a candidate solution with an objective of detecting the transmitted symbol vector (i.e., the objective is ) based on the received signal

and estimated

. Pilot symbols enable the estimation of .

Maximum Likelihood Detection. Maximum likelihood detection (ML detection) is optimal in the sense that it minimizes the error probability of the detection. It is defined as


The search set of (i.e., the search space ) is the set of possible solutions that the optimizer can take into account. Each element in the search space is a candidate solution with which the values of the ML objective function in Eq. 4 (i.e., Euclidean distances) are measured and compared with each other. The best candidate, with minimum value, becomes the ML solution . Further, is an indicator of complexity. In principle, the ML search involves all possible candidates (i.e., ) which makes the brute-force approach intractable for large MIMO sizes with high-order modulations.

Sphere Decoder. The Sphere Decoder (SD) achieves optimal performance even with by applying an adaptable search constraint in a sequential manner (fincke1985improved; Agrell02; Damen03; SD). The SD transforms Eq. 4 into an equivalent tree search and applies tree pruning, visiting fewer nodes and leaves without loss of optimality. However, since it is an exact algorithm (i.e., achieves ML performance), the search space for SD is still exponentially large in the worst case (SDComp1). Further, because of its sequential nature, its processes cannot be fully parallelized and the complexity (latency) varies per detection, which is not desirable for hardware implementation.

3. Related Work

In this section we introduce related work on MIMO detection.

Parallel Sub-Optimal Architectures. These approaches divide the optimal SD tree search into parallel tasks in order to make use of hardware containing many processing elements (PEs) such as a GPU or FPGA (flexcore-nsdi17; nikitopoulos2018massively), while search algorithms become approximate. For these methods such as the Fixed-Complexity Sphere Decoder (FCSD) (FCSD; barbero2008extending; jalden2009error; barbero2008low) and K-best SD (Guo06; mondal2009design; li2008reduced; zheng2017k), is a subset of , so how to select for comparing is a key factor. For instance, the FCSD splits the SD tree of levels into two separate search areas, one for full search (FS) and the other for greedy search (GS). During the FS, the FCSD visits all nodes at the first levels and then switches to GS, where only one child node with minimum partial Euclidean distance is explored for the remaining levels (). This exploration process can run in parallel.222For the maximum effect of the algorithm, a channel ordering scheme is used to ensure users with poor channel are detected in the FS phase of the FCSD. The FCSD results in consisting of candidate solutions. Here, is a controllable positive integer parameter that trades off the FCSD’s detection performance with its computational complexity. Note that the complexity of the FCSD (even with small ) is still larger than linear methods and the FCSD enables only parallel processes such as 16, 256, 4096 for 16-QAM with , respectively (i.e., bounded complexity but not flexible). ParaMax features flexible and scalable parallelism.

Heuristics for MIMO Detection.

Heuristic approaches inspired by Biology or Combinatorial Optimization methods such as genetic algorithms, reactive tabu search, and particle swarm optimization for MIMO detection exist

(abrao2010s; vsvavc2013soft; hedstrom2017achieving), but significant performance gains are not observed. Some studies have used analog quantum hardware platforms (kim2019leveraging; kim2020towards), but these are specialized platforms that not yet generally available. Other studies on Physics-inspired SA, Gibbs distribution, and quantum search algorithm show feasibility to some extent (guangda2009multi; hansen2009near; farhang2006markov; abrao2010s; botsinis2013quantum), but lack comprehensive evaluations for Large and Massive MIMO systems and comparisons against other state of the art detectors.

4. Design

In this section, we describe the design of ParaMax: Section 4.1 introduces the key building block of ParaMax’s design, a SAparallel tempering solver. Section 4.2 describes the complete design of ParaMax. Section 4.3 then introduces a refinement of ParaMax, 2R-ParaMax, which uses soft information to enhance performance at the cost of some computational complexity. We evaluate both designs in Section 6.

4.1. ParaMax Ising Solver (PMIS)

The ParaMax Ising Solver (PMIS) is the main solver module in ParaMax. It is based on SA, featuring a parallel tempering algorithm highlytailored to optimize the Ising model of MIMO detection. PMIS is a completely classical algorithm that does not require any specialized hardware for its implementation. It performs a local search by updating the spin values of a given random initial configuration according to Eq. 2. Each replica is associated with a different temperature, and temperatures may be exchanged according to the update rule in Eq. 3. Since the calculation of the energy associated with each replica can be trivially reduced to matrixvector and vectorvector multiplications, couplings and local fields of the Ising cost function are stored as a matrix and as a vector respectively. Therefore, the calculation of , critical for the update rules in Eq. 2 and Eq. 3, is reduced to:


with the vector representing the spin configuration, where the factor takes into account the symmetry of the matrix . During our implementation, PMIS is optimized to maximize the performance for operations involving spin variables which cover up to 512, 256, and 128 singleantenna users with BPSK, QPSK, and 16QAM modulations, respectively. We provide further details on our PMIS implementation in Section 5.

Figure 3. Metropolis Sweep analysis of the PMIS solving MIMO detection (16-QAM at 20 dB SNR): overview (left) varying user numbers and detailed view (right).
Figure 4. Temperature range analysis of the PMIS solving MIMO detection for 20-user (16-QAM at 20 dB SNR).

Computational complexity. In MIMO detection, compute time complexity is a fundamental metric, along with BER and network throughput. From Eq. 5, it is clear that complexity is proportional to the square of the number of spin variables (). Therefore, recalling that every replica is independently updated, overall PMIS complexity scales as , with and the number of replicas and Metropolis sweeps, respectively.

Replicas and Metropolis Sweeps. To reduce the computational cost to the bare minimum, we have opted for a “bang-bang” parallel tempering approach. That is, only two replicas are used (): one at very low temperature and one at higher temperature: the replica at lower temperature acts as a greedy searcher while the replica at a higher temperature acts as an observer. When the greedy searcher is stuck in a local minimum, the two replicas can exchange roles (i.e., temperature) to resolve the bottleneck. While this has been successfully used in the context of quantum annealing (yang2017optimizing), ours is one of the first reports of a bangbang parallel tempering schedule on a classical platform. To choose the number of Metropolis sweeps (§2.1), we empirically examine different numbers of sweeps from one to 80 with 16-QAM in Figure 3 out of 20 instances and 10,000 PMIS runs per instance. Not surprisingly, we observe a tradeoff between latency (upper) and sampling quality (lower). We choose as an appropriate point that satisfies the current LTE standard’s latency requirements.

Choice of temperature range. Unlike and , the temperature range does not influence ParaMax’s complexity, so we choose a PMIS temperature range where it achieves the highest probability of finding the ML solution (i.e., the ground state). For benchmarks we select the values and , which perform well, as shown in Figure 4.

The foregoing description has described a single PMIS run. In ParaMax, multiple PMIS runs on multiple PEs in parallel, one PMIS run per PE. Each PMIS run is independent from the others, accepting the Ising model of the MIMO detection as input, and outputting a candidate solution.

Figure 5. Overview of ParaMax’s detection algorithm.

4.2. ParaMax Design

In this section we describe the complete ParaMax design (Figure 5). We describe the function of each block required for MIMO detection in §4.2.1 and the soft output generator module in §4.2.2.

4.2.1. ParaMax Detection Algorithm

We assume the base station receives a signal perturbed by AWGN and estimates the wireless channel as stated in Section 2.3.

1. ML-to-Ising Reduction. The procedure and the generalized formula of reducing the MIMO detection ( from Eq. 4) to the Ising form using the spin-to-symbol mapping were first introduced in (kim2019leveraging); our system assumes the same mapping. In the mapping, spins represent all possible symbol combinations (i.e., spins for a possible symbol per user), so its ground state always corresponds to the ML solution.

2. PMIS Parallel Processing. Each PMIS run samples an independent solution candidate (i.e., an Ising configuration ). Since detection is based on a heuristic, a single PMIS run’s solution may not be optimal, and thus multiple runs are required to form a set of candidate solutions, gradually increasing the probability of collecting the optimal ML solution. Since each PMIS optimization run on a single PE completely independent of the others, ParaMax flexibly operates on any number of independent processing elements (), with highly scalable parallelism. The number of available processing elements is equal to the number of PMIS outputs that the ParaMax system can generate with full parallelism.

3. Ising Solution Filter and Demapping. After all PMIS parallel runs, the corresponding outputs are collected in a table of Ising spin configurations. Before further processing, the list is sorted in order of solution quality, based on the Ising energy of each output. The Ising solution filter returns only the configuration with the best (minimum) , which is equivalent to the wireless symbol , i.e., with the minimum (among candidate solutions), after proper demapping (spins symbols). Finally, is converted into MIMO detected bits.

4.2.2. Spinwise Soft Information Output

For most heuristicsbased solvers, only the lowestenergy Ising configuration is returned (regardless of how many times it occurs among PMIS outputs) and any outputs other than it are discarded. In ParaMax, however, we utilize all PMIS outputs to generate soft information (i.e., detection confidences, for each spin in a given configuration). In general, softoutput MIMO detectors’ soft values are utilized for iterative MIMO detection or channel coding (roger2012efficient; larsson2002maximum; larsson2008fixed; barbero2008low). In this work, we design the former (2R-ParaMax) in Section 4.3.

Figure 6. Equivalent representations of 11 QPSK detection: Euclidean distances in the I-Q plane (left), and Ising energies (right). Shadings highlight likely solutions.

ParaMax collects candidate solutions from independent PMIS runs. Among these, multiple occurrences of a certain spin configuration (with agreeing spin variables) are very likely to be observed, which could be used to identify spins easy (or hard) to detect (i.e. variables that are very likely to be assigned a certain value in the unknown optimal ML solution). Figure 6 shows an illustrative example of detecting a received QPSK signal in two equivalent representations, one in the I-Q plane with Euclidean distance (left), and the other with Ising energies (right). In this example, the first spin variable (corresponding to the symbol’s real part) is likely to be detected as for most PMIS runs, since the difference in Ising energy from all configurations that have (resulting in or in Figure 6, right) and (resulting in or in Figure 6, right) is significant. The spin is easy to detect compared to spin (corresponding to the symbol’s imaginary part). Multiple occurrences of PMIS runs agreeing on indicate this, while PMIS runs will disagree on the value of , because the two most frequent spin configurations’ energies ( and ) themselves disagree on the value of .

This phenomenon becomes even clearer for highorder modulations, since in 16-QAM or higher modulations, the value of the spin coefficients in the Ising spintosymbol mapping varies across spins. For example, Figure 7 shows the spintosymbol mapping of 16-QAM for the user, where the user’s possible symbol maps onetoone with spins , that is . Here, spins and , the oddnumbered spins, determine the symbol’s quadrant, while spins and , the evennumbered spins, determine the symbol in the given quadrant. Here, the oddnumbered spins for 16-QAM are in general easier to detect than the evennumbered spins because of higher robustness to AWGN (detection reliability). Table 1 presents empirical spinwise error rates of ParaMax for 16-QAM detection. These differences in robustness indicate that using ParaMax’s soft information would be particularly helpful for further processing, similarly for unequal error protection (UEP) (masnick1967linear; wei1993coded; horn1999robust; boyarinov1981linear; 237878)).

Mean Spinwise Error Rate
Ovennumbered spins: Evennumbered spins:
0.167 0.152 0.329 0.352
Either/both: 0.32 Either/both: 0.68
Table 1. ParaMax’s spinwise error rate (conditioned on 103,850 incorrect outputs) for eightuser, 16QAM MIMO detection.
Figure 7. Mapping between four Ising spins (variables) to n-th user’s 16-QAM symbols (

symbols) on the constellation. Shading denotes an inner quadrant. In this mapping, odd numbered spins are easier to detect than even numbered spins.

Soft information computation. Based on the equivalence of of the I-Q and spin configuration representations, the occurrence count of a given spin value for a certain spin across all PMIS outputs samples the distance of the symbol corresponding to that spin’s value, and hence estimates that spin’s likelihood of correctness. More specifically, after collecting PMIS outputs sorted by Ising energies (), the system has () unique outputs with corresponding occurrence counts (). The detection confidence of spin () is defined as:


where is a count of occurrences of the ranked configuration (defined in §2.1 on p. 2.1), only when the configuration’s spin is equal to the firstranked configuration’s spin (i.e., is either or zero). The spinwise detection confidences () are the soft values ParaMax outputs in this step. Note that the reliability of each increases as the best observed Ising energy among the collected outputs () becomes closer to the unknown ground state (of energy ), which implies as increases the quality of soft values improves.333While ParaMax is an inherent softoutput MIMO detector, requiring simple computations, conventional soft-output MIMO detectors require additional computations of exponential complexity, of loglikelihood ratio (LLR) for all coded bits to generate soft values at channel decoder (roger2012efficient; studer2006soft; wang2004approaching). Similar algorithm is introduced using quantum annealing (karimi2017boosting), where only partial outputs are used.

4.3. 2R-ParaMax: Iterative Soft Detection

We now introduce a method of using the soft information described in the prior section to enhance the operation of ParaMax. We call this protocol 2R-ParaMax. The main idea is to iterate the PMIS block twice, once for generating soft confidence information, and again to obtain a final detection result based on the confidences from the first iteration. Intermediate processing between the first and second iterations functions pre-decision of spins with high detection confidence. An error correction post-processing is applied at the end of the second round, both of which have linear complexity. The end result is a more accurate MIMO detection result, at the expense of a modestly increased latency, and so this might be employed for challenging wireless channels andor large numbers of users. With reference to Figure 5, the structure of 2R-ParaMax PMIS block is shown in Figure 8. This block is replacing the third block marked in red of Figure 5. The other blocks are exactly the same as described in ParaMax. We also note that the soft information generated by the second round can also be used for the channel decoding or further iterations of the algorithm.

Figure 8. Structure of 2R-ParaMax PMIS Block (cf. Figure 5’s third block in red). The other blocks in 2R-ParaMax are exactly the same as ones in ParaMax.

Intermediate Predecision. The intermediate predecision module identifies those spins with a high detection confidence (over a threshold ) from the first round of PMIS outputs in order to reduce the number of spin variables involved in secondround of PMIS runs, simplifying secondround detection. That is, if the spin’s detection confidence (), then we predecide the spin variable to be the value of the corresponding spin of the best solution in the first round (i.e., ).

After thus obtaining a predecided set of spin indices, the next step is to update the Ising form accordingly. For each spin index in and for each Ising problem index , we set , and then remove , and . The result is a reduced Ising problem that contains spins only. Table 2 summarizes the average ratio of decided spins () and the success ratio () of the pre-decision process. Here, denotes an index group of spins where spins decided by the pre-decision process are exactly the same as the corresponding spins in the ML solution. In 2R-ParaMax, we apply , which ensures (for lower , higher applied). With the updated Ising form (with and for spins), we execute a second round of PMIS to generate outputs and then filter the best output consisting of spins with minimum Ising energy in terms of . When the filtered configuration is combined with the pre-decided spins appropriately, the full configuration consisting of spins can be restored and demapped into symbols. This full configuration is further compared against the best PMIS outputs of the first round based on the original Ising form . The final best configuration is returned as 2R-ParaMax’s detection solution, which guarantees that 2R-ParaMax’s bare minimum performance is ParaMax’s performance.

0.91 0.93 0.95 0.97 0.99
0.35 0.32 0.28 0.21 0.01
0.97 0.99 1.0 1.0 1.0
Table 2. 2R-ParaMax’s intermediate pre-decision process tested for 5,000 different instances of 20 20 16-QAM detection at SNR 20 dB on .

5. Implementation

We now describe our ParaMax implementation.

Computing Environments. CPU-based experiments are executed on an Intel i9-9820X at 3.30GHz with 20 cores, 2,189 threads, and 126 GB RAM. GPU-based experiments are tested based on the CUDA (Compute Unified Device Architecture (sanders2010cuda)) 10.2 with GeForce RTX 2080 Ti of 4,352 CUDA cores and 68 streaming multiprocessors.

Wireless MIMO Channels. Both simulationbased and tracedriven real world wireless channels are used for our experiments. In the case of the simulationbased channel, independent and identically distributed (i.i.d) Gaussian channels with AWGN are synthesized for various SNR settings. For tracedriven channels, we use nonline of sight wideband MIMO channel traces at 2.4 GHz, between 96 base station antennas () and eight static users (), the largest MU-MIMO dataset provided in Argos (Argos). Among , we single out 8 to 32 (in steps of four) antennas to test the most challenging MIMO regimes (e.g. ). Since trace based channels include measured noise and limited user numbers, we use synthesized channels, unless otherwise stated, in order to precisely control SNRs and evaluate various MIMO regimes such as . Based on both channel settings, we generate large-scale Ising models of MIMO detection (100,0001,000,000 random instances per scenario) in order to measure detection BER up to , approximately .

PMIS CPU-Implementation. While the front-end of our PMIS implementation is in Python, the core is completely written in standard (josuttis2012c++). We assign only a single core and a single thread (a single PE) to the calculation of each PMIS run by manually modifying the OpenMP (chandra2001parallel; dagum1998openmp) and C++ parallelization settings. Furthemore, to maximize the performance of PMIS (to satisfy limited processing time in wireless standards), the following innovations have been implemented: (1) Use of static memory: Static allocation, unlike dynamic allocation, happens at global scope and it is pre-populated when the library is loaded. Moreover, since the size of arrays is known in advance, compilers can further optimize math operations on static arrays. (2) Parameter pack expansion: Loops in the matrix-matrix and vector-matrix multiplications are the most expensive part in PMIS implementation. To further reduce the computational cost, most of the critical loops are statically unrolled using features like the parameter pack expansion, introduced in the standard. (3) Intel SIMD instructions: Most of the modern CPU architecture have intrinsic operations to allow multiple operations on contiguous arrays of floats. In PMIS, we have used Intel SIMD instructions to vectorialize operations like matrix-vector and vector-vector multiplications (plank2013screaming).

PMIS GPU-Implementation. The core design of CPU-ParaMax is based on a highly optimized C++ implementation of SA. Therefore, the natural extension of ParaMax to GPU consists into the implementation of the core SA engine to GPU. However, unlike CPUs, GPUs achieve the best performance for large arrays where multiple synchronous operations are applied at the same time. Indeed, while GPUs have more cores than CPUs, each single GPU core is typically much slower. Therefore, to maximize the performance of SA implemented on GPUs, we have designed a GPU kernel based on the JAX/XLA language that updates multiple PMIS runs at the same time. More precisely, the spin configuration for a single PMIS run (in Eq. 5) is now extended to a matrix , with corresponding to the PMIS index. Since PMIS runs are completely independent from each other, () can be updated independently and synchronously.

6. Evaluation

In this section, we evaluate ParaMax in various aspects. Section 6.1 evaluates ParaMax’s detection latency against other CPU- and GPU-based detectors. Section 6.2 illustrates sampling performance of ParaMax comparing against simulated annealing and required the number of processing elements for ParaMax to achieve near-ML performance in both Large and Massive MIMO. Section 6.3 and  6.4 show the ParaMax’s bit error rate and system throughput performance respectively, compared against other state-of-the-art detectors in both Large and Massive MIMO.

(a) Compute latency of fully-parallel full-blown C++ kernel CPU-ParaMax.
(b) Single Device: 9.3 TFLOP GPU-ParaMax vs 0.028 TFLOP CPU-ParaMax.
Figure 9. ParaMax Detection Latency.

6.1. Detection Latency

Fully-Parallel Full-Blown ParaMax. Figure 8(a) shows the detection time of ParaMax as a function of ( per channel use, where the background color coding indicates approximate feasibility for the wireless standards (WLAN and LTE). As increases (i.e., and/or modulation increases), computing time tends to scale as (cf. does not affect and thus compute time). The available largest MIMO sizes that reach the borderline-limit of acceptable detecting time in the LTE standards are , , for BPSK, QPSK, 16-QAM modulation, respectively. While slight variations of runtime are observed that can cause overall latency increase and hardware synchronization issues, this can be resolved by an integrated system hardware since the origin of the variations is caused related to our system by the kernel allocation of jobs and concurrent (unrelated) system processes. In general, 2R-ParaMax requires approximately ParaMax latency.

CPU- vs GPU-ParaMax Comparison. Figure 8(b) shows the comparison of compute time between the XLA kernel (compiled for both CPU and GPU) and the original C++ kernel. The runtime for the C++ kernel has been obtained by computing the average runtime for a single PMIS run and then projected for multiple PMIS runs. XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that can be compiled and optimized separately for either CPU or GPU. As one can see, for a sufficiently large number of parallel PMIS runs, while the kernel runtime compiled for CPUs have a consistent runtime with the full-blown C++ implementation of ParaMax, the kernel compiled for GPUs shows a speed-up, where total PMIS runs can be defined as multiplied by the number of subcarriers (). While GPU-ParaMax can achieve the speed-up for over hundreds of PMIS runs, it cannot satisfy time requirements for standards. Recall that current GPUs are not designed to make full use of resources for small-size systems (i.e., few PMIS runs). Thus we also extrapolate its performance to estimate what we could achieve in GPU without these limitations. Note that unlike on CPUs where a single core can be used to carry out any calculation, GPU cores are designed to work in concert to manipulate large block of data in parallel, and users cannot assign specific resources to a certain computation (flexcore-nsdi17). Therefore, we define a single PE for GPU-ParaMax as the extrapolation of a single PMIS run from large number of PMIS runs. In 5G New Radio, this extrapolation becomes more reasonable (while still being approximate), since 5G systems will support over three thousands of subcarriers, and slightly more time444Available compute time is for all BS-processing including channel decoding. (4 ms) than LTE (3 ms) will be allowed for enhanced mobile broadband (eMBB) (ding2020agora; BigStation; 3gpp-ts-36-211) (cf. 1 ms for 5G Ultra-Reliable Low-Latency Communication (URLLC)).

20-user MIMO (16-QAM) ZF-SIC ParaMax FCSD
=2 =3 =4
Parallelism # Flexible
Required 1 256 4,096 65,536

Latency [us]

CPU 25 357 405 5,821 93,714
GPU 83,861 extr. 31 319 378 1,841
Min time 25 357 319 378 1,841
Table 3. Available number of parallel processes, required for fully parallel processing, and average detection runtime of various MIMO detectors both on CPU and GPU. ParaMax’s compute time is for a single PMIS run on a single PE (i.e., fully-parallel ParaMax) and GPU-ParaMax reports extrapolated compute time.

Comparison against Conventional Detectors. We compare ParaMax latency against various detectors implemented on the MIMOPACK library (ramiro2015mimopack)

, which is one of the fastest open-source MIMO detector implementations based on the (CUDA) C programming. The results for 20-user 16-QAM are summarized in Table 

3. In the case of the zero-forcing successive interference cancellation (ZF-SIC or V-BLAST with ordering scheme) (wolniansky1998v), while its complexity is slightly higher than linear detectors such as ZF and MMSE, compute time is still few tens of microseconds. However, their computations (both ZF and ZF-SIC) are not appropriate for parallel processing, causing extra overheads such as job scheduling and data transition among computing resources. In the case of the FCSD, we consider three different that trade-offs the FCSD’s detection performance with its complexity. As long as the available number of PEs is large enough to allow full parallelism ( total PEs), the compute time remains in a few hundreds of microseconds, satisfying LTE requirements.

6.2. Heuristic Detection Sampling

For ParaMax, we can report the expected number of sampling repetitions to reach ML-performance, which can be computed using the probability of obtaining the ML-solution in one sample of a given MIMO detection scenario (i.e., MIMO size, modulation, and SNR), averaged across the problem distribution ((mandra2016strengths). Since cannot be determined a priori by theoretical means, we obtain it through empirical evaluation of statistically significant 1,000,000 PMIS runs across 100 detection instances per scenario. In order to compute this average probability, we use the ML-solutions found by expensive runs of the Sphere Decoder. Since each run is independent, the probability for ParaMax’s to find the optimal ML-solution:


Inverting Eq. 7, we can obtain the required number of PMIS repetitions (samples) to achieve the ML-detection with a target probability as:


In Figure 9(c),555Note that the formulas hold also for any non-parallel iterative method with independent sampling, where is simply the number of required repetitions. we plot and corresponding required for different .

Very Large MIMO with Low-Order Modulations. Figure 9(a) plots as a function of Large MIMO detection with different heuristicbased detectors (SA, ParaMax, and 2RParaMax) for various and modulations. Surprisingly, for the BPSK and QPSK modulations, all tested heuristic detectors achieve , which implies nearly all PMIS runs successfully reach the ML-solution. For ParaMax and 2RParaMax, this tendency is observed up to while we plot here only up to to save space. Only a few processing elements are enough to perform ML-detection up to MIMO with BPSK and MIMO with QPSK. While ParaMax becomes currently unpractical at around (Figure 9), this MIMO size and its requirement for optimal detection is promising for city-scale Internet of Things (IoT) applications envisioned in 5G networks or beyond. Those scenarios will handle hundreds or thousands of devices per BS with low-order modulations (miller2015internet; mazaheri2019millimeter; miorandi2012internet), and may accept longer processing time than ordinary data communications.

(a) Average probability of finding the ML-solution per run ().
(b) Impact of ratios (from Large to Massive MIMO).
(c) Required for ParaMax to perform near-ML detection.
Figure 10. Detection Sampling Evaluation. Figure 9(a) plots for three heuristic-based detectors (colors) varying modulations (hatch patterns), SNRs, and . Figure 9(b) shows impact of (from 1 to 4, from Large to Massive MIMO) on ParaMax’s for different user numbers (colors) and SNRs (hatch patterns). Figure 9(c) plots relationship between and for ML performance with various . Approximate MIMO feasibility of ParaMax is provided for two blocks of high .

From Large to Massive MIMO with 16-QAM. In the case of 16-QAM in Figure 9(a), notably drops as increases for all heuristic-based detectors and we observe higher for 2RParaMax, ParaMax, and SA. Given that Large MIMO detection with high-order modulations is a challenging problem in general, we add more receiver antennas () to see the impact of / ratio on . Figure 9(b) shows this relationship for various user numbers for different SNRs. As increases (i.e, from Large MIMO to Massive MIMO), rapidly increases and then is converged to 1.0. While for larger number of users () at lower SNRs tends to increase slower, can still be achieved around , where the required for ML-detection is around 1,000 (see Figure 9(c), where we summarize the applicability of ParaMax). We observed that the trace-driven channel with noise shows better performance (faster convergence than 20 dB SNR).

(a) 8-user and 12-user MIMO: BER as a function SNRs.
(b) 12-user MIMO: BER as a function of ratio.
Figure 11. (Overview) BER from Massive to Large MIMO. Comparisons of detection BER in Large and Massive MIMO for various detectors across MIMO regimes and/or SNRs with 16-QAM.

6.3. Bit Error Rate (BER) Performance

This section presents ParaMax’s detection BER. Recall that is the number of processing elements (PEs) assigned to ParaMax per subcarrier, where each PE performs a PMIS run. Since we assume fully-parallel ParaMax for minimum detection latency, is also equal to the number of PMIS runs. Note that regardless of computing platforms (CPU, GPU, or FPGA), the detection performance (BER and throughput) as a function of is the same, as long as they can satisfy limited time requirements supporting all subcarriers (unless there exists a serious precision issue), while the definition of a single PE and total available PEs per device can vary depending on platforms and/or implementation details. In the next subsection (Sec 6.4), we evaluate ParaMax on multi-subcarrier systems, considering its detection latency, available parallelism, available compute time in wireless standards, and impact of forward error control (FEC).

(a) 12-user Large MIMO: BER as a function of varying SNRs.
(b) 12-user Massive MIMO: BER as a function of at SNR 16 dB.
(c) 128 128 Very Large MIMO with QPSK modulation.
Figure 12. (Detailed View) BER as a function of . Comparisons of measured BER for various MIMO regimes and detectors with 16-QAM (Fig. 11(a),11(b)) and QPSK (Fig. 11(c)) modulation.

Overview: BER from Massive to Large MIMO. Figure 11 shows BER performance in various MIMO regimes with 16-QAM. We consider ParaMax and 2R-ParaMax with 16 and 256 PEs, comparing them against other detectors such as ZF (linear), SA (heuristic), FCSD (tree search-based), and optimal SD (ML), where SA and FCSD (with channel ordering (dai2010simplified)) are comparison schemes of parallel architecture-based detectors. As expected, linear-based ZF, which requires high ratio for proper detection, performs poorly as the regime goes from Massive MIMO to Large MIMO (upper to lower in Figure 10(a)) and with more users (left to right in Figure 10(a)), showing several orders of magnitude worse BER performance against the other detectors, particularly at high SNRs. In the case of the parallel architecture-based detectors, it is observed that for the same PEs, ParaMax and 2R-ParaMax outperform SA and FCSD detectors in all MIMO regimes and SNRs tested except that in MIMO at high SNRs, FCSD outperforms ParaMax and 2R-ParaMax when with 16 PEs. However, 2R-ParaMax reaches lower BER than FCSD when with 256 PEs. Figure 10(b) plots BER with 16-QAM as a function ratio with smaller such as 2, 8, and 16. For low ratios, parallel architecture-based detectors even with 2 PEs can obtain lower BER than ZF. As the ratio increases, all detectors achieve better BER for the same , but more PEs are required to beat ZF.

Detailed View: BER as a function of . We evaluate BER as a function of for 12-user Large and Massive MIMO in Figure 12 to show the detailed performance comparison. Note that ZF is not suitable for parallelization, so it achieves the same performance, regardless of the number of PEs. Figure 11(a) presents BER for Large MIMO () with 16-QAM at various SNRs. ParaMax can support any number of PEs and approach the optimal performance as increases (i.e., fine parallelism granularity), while the FCSD requires at least 16 PEs to operate the fully-parallel algorithm for the minimum , and the FCSD with = performs equivalently until reaches (i.e., no gain between 16 PEs and 256 PEs). Figure 11(b) focuses on Massive MIMO (), showing the impact of ratio on both BER and . Higher ratios (upper to lower in Figure 11(b)) lead to lower BER for the same PEs and smaller for the near-ML BER, especially compared against Large MIMO (Figure 11(a)). Precisely, to reach the near-ML BER at SNR 16 dB for 12 users, 12-BS antenna MIMO requires around 60 PEs, 18-BS antenna MIMO requires 18 PEs, and 48-BS antenna MIMO requires only 5 PEs. Compared to SA and FCSD, ParaMax’s BER drops more rapidly as increases for any ratios. For example, to reach at MIMO, where the FCSD and SA requires (over) 16 PEs, ParaMax requires 6 PEs.

We also test 128-user Very Large MIMO with the QPSK modulation in Figure 11(c). As analyzed in Figure 9(a), we observe that very small can result in BER convergence for the QPSK, which is very likely the optimal BER, although we cannot evaluate SD because of extremely high complexity. At SNR 16 dB, ParaMax achieves over five orders of magnitude better BER than SA at 2 PEs and over six orders of magnitude better than FCSD at 4 PEs. Note that the FCSD even with thousands of PEs (i.e., with high ) cannot reach the ParaMax’s performance with a single PE.

(a) Varying SNRs for 8 16 MIMO.
(b) Varying for 12-user MIMO at SNR 16dB.
(c) Achievable gain (vs ZF) for the same number of subcarriers ().
Figure 13. System Throughput of Massive MIMO in WLAN. Achievable system throughput comparisons of various detectors as a function of with 16-QAM in different scenarios (varying MIMO sizes or SNRs) for minimum detection latency.

6.4. System Throughput Performance

This section evaluates throughput on multi-subcarrier systems. While detection BER is a fundamental metric for MIMO detection, the detector of the lowest BER does not necessarily imply the best throughput scheme, since real-world wireless systems include FEC techniques for error correction at the channel decoder under MIMO detector. Further, since the systems support many subcarriers with limited compute time, the total required computing resources to support them are another important metric for evaluation.

(a) Maximum number of subcarriers () supported per millisecond.
(b) Best throughput scheme for various MIMO/SNR regimes: 12-user MIMO with 16-QAM (left) and 128-user MIMO QPSK (right).
Figure 14. Projection to 5G Systems. Fig 13(a) shows maximum number of subcarriers that ParaMax can support as a function of total available PEs on computing devices. Fig 13(b) presents the estimated bestthroughput detectors for various regimes, supporting 5G target , on a CPU platform with ten state-of-the-art 128-core CPUs (). Here, ParaMax’s .

We first consider a WLAN wireless system with 64 OFDM subcarriers with 1/2 rate convolutional coding, where optimal achievable (ML-based) throughput on the system has been measured via over-the-air experiments in (flexcore-nsdi17). We translate the measured detection BER into the corresponding convolutional code-applied BER (i.e., coded BER). Among the provided data we select achievable optimal throughput for MIMO and MIMO at SNR 21.6 dB as a baseline throughput, assuming the coded BER of optimal Sphere Decoder (SD) we test at the same scenario (i.e., same MIMO size and SNR) is close to their optimal coded BER. We compute the achievable optimal throughput for various scenarios, considering SNR and optimal Frame Error Rate (FER) difference (ratio) against the baseline, for the frame size of 1500-byte and FER obtained from our coded BER. Then we compute throughput of various detectors considering FER difference between SD and corresponding detectors at each scenario. In WLAN scenarios, we maintain the minimum detection latency. For this, the system is expected to have total PEs of on a computing platform, where is the number of subcarriers. For example, for ParaMax to support 2 PMIS runs (i.e., 2 PEs) per subcarrier, a single state-of-the-art CPU (with 128 cores) is enough, while multiple CPUs are required to support more PEs per subcarrier. Since MIMO detection is completely independent across subcarriers, ParaMax’s job scheduling and allocation for multiple devices are quite straightforward.

Figure 13 shows throughput comparison against ZF and FCSD for various scenarios in Massive MIMO. Figure 12(a) demonstrates the impact of SNRs for the given MIMO size. While for ParaMax and 2R-ParaMax, 8 PEs (per subcarrier) are enough to reach the optimal performance for all tested SNRs, ZF could not reach the optimal performance except over SNR 20 dB including trace-driven channel and noise. Figure 12(b) shows the impact of ratio varying receiver antenna numbers at fixed SNR 16 dB. As seen in the previous section, less PEs are required at high ratios to achieve the near-ML performance. We observe that for the same scenarios, even less PEs are required to achieve the near-optimal “throughput” performance (Figure 12(b)) than to achieve the near-optimal “BER” performance (Figure 11(b)) because of the impact of FEC. Figure 12(c) plots ParaMax’s throughput gains versus ZF for various ratios and SNRs. The high gains are achieved at low SNRs and/or low ratios. These throughput gains can be generalized to any (even at different standards with assumptions/modifications), as long as both schemes can support all subcarriers satisfying the corresponding limited time requirements.

In general, cellularnetworked systems, such as 4G or 5G systems, support many more subcarriers, allowing more compute time than WLAN. To project the throughput performance onto 5G scenarios, we plot the maximum that can be supported per millisecond with 5G target, as a function of total available PEs on a computing platform, considering detector’s latency and parallelism based on assigned PEs per subcarrier in Figure 13(a) (we report GPU-ParaMax based on extrapolated data, as discussed in section 6.1). The figure implies how many computing resources are required to support 5G systems. A ZF-based system can support 5G target even with a single state-of-the-art 128-core CPU due to its short latency and minimum PE usage. In the case of ParaMax, tens of CPUs are required to support 2 PEs per subcarrier and hundreds for 16 PEs.

Furthermore, we estimate the best throughput scheme (ZF vs. ParaMax) for various MIMO regimes and SNRs, assuming a computing platform with ten CPUs in Figure 13(b) (left) for 12-user 16-QAM and Figure 13(b) (right) for 128-user QPSK based on achievable ParaMax throughput gains (vs ZF), although 128-user QPSK MIMO is currently unpractical (Figure 9). For gain 1.0, we report the ZF as the best scheme since it takes less compute time, while we report challenging regimes where ZF does not perform well and ParaMax requires at least several tens of PEs per subcarrier for the near-ML performance. We observe that ParaMax enables many challenging regimes of ZF (i.e., low ratio and/or low SNRs) by assigning reasonably more PEs.666Advanced FEC schemes such as LDPC and Polar codes that are applied in 5G systems can enable the near-ML performance with ZF for more MIMO regimes and SNRs. However, even more users and lower SNRs will keep bringing out the same scenarios, where ParaMax outperforms ZF, due to the fundamental detection BER gap. In the case of the QPSK, at (relatively small ratio), ZF can outperform ParaMax at some SNRs, but for this, 256-BS antennas are required to support 128 users, which is the double size of the-state-of-the arts. Of course, ParaMax requires more computing resources (10100) than ZF, but the trend at emerging system-on-chip architectures with more and more PEs, as well as C-RAN architectures promisingly envisioned in 5G, support the direction of massively parallel architectures-based designs requiring low interaction among PEs.

7. Discussion

In this section, we investigate several challenges and opportunities of ParaMax that are likely to further advance the system.

Fully-optimized and adaptive ParaMax. Considering that parallel tempering-related parameters are selected within a challenging scenario (16-QAM) in Section 4.1, ParaMax could be fullyoptimized for many different scenarios based on given user numbers, ratios, SNRs, modulation sizes, available total PEs, and/or wireless standards. Moreover there is a well known trade-off between number of sweeps (latency) and required for near-ML performance (compute resources) that could be explored (e.g., for 5G URLLC).

Higher-order modulations. As modulation size increases, ParaMax’s detection is degraded rapidly and becomes not operable for Large MIMO with high-order modulations such as 64-QAM or higher, requiring over PEs even for Large MIMO to achieve near ML-performance. For Massive MIMO, it is expected that even higher ratios than 16-QAM are required. Perhaps, more replicas or Metropolis sweeps ease the problem along with further optimization on ParaMax’s free-parameters related to parallel tempering such as temperature range for PMIS tuning. However, these gains will be obtained at the expense of longer latency. An implementation of ParaMax on dedicated hardware might improve the performance and reduce the computational cost order further.

Compatibility with specialized hardware. ParaMax does not require any specific hardware. However, another important aspect of ParaMax is that it is immediately compatible with future implementations that aim to deploy programmable specialized hardware (for Physics-based algorithms) designed to optimize problems in the Ising form including quantum devices such as quantum annealers (kim2019leveraging) and gate-model quantum computers running the QAOA algorithm (Hadfield:2017:QAO:3149526.3149530), as well as novel paradigm of classical calculation such as Optical Coherent Ising Machines (hamerly2019), CMOS-based annealers (aramon2019physics), and Oscillator-based platforms (csaba2020coupled).

8. Conclusion

In this work, we present ParaMax, a soft MU-MIMO detector system for Large and Massive MIMO networks that first makes use of parallel tempering for MIMO detection. Our performance evaluation shows that ParaMax enables currently-challenging MIMO regimes for commonly-used linear detectors, achieving the near-ML performance by assigning reasonably more compute resources. ParaMax also outperforms conventional parallel architecture-based detectors such as FCSD and SA-based detectors, requiring less processing elements to achieve the near-ML performance.


We thank the anonymous shepherd and reviewers of this paper, the NASA Quantum AI Laboratory (QuAIL), and the Princeton Advanced Wireless Systems (PAWS) Group for their extensive technical feedback. This research is supported by National Science Foundation (NSF) Award CNS1824357 and CNS1824470, and an award from the Princeton University School of Engineering and Applied Science Innovation Fund. Salvatore Mandrá is supported by NASA Ames Research Center (ARC), particularly the autonomous ATM part of the Transformative Tools and Technologies Project under the NASA Transformative Aeronautic Concepts Program, and DARPA under IAA 8839, Annex 125. Minsung Kim was also supported by the USRA Feynman Quantum Academy funded by the NAMS R&D Student Program at NASA ARC (FA8750-19-3-6101).