# Faster Activity and Data Detection in Massive Random Access: A Multi-armed Bandit Approach

This paper investigates the grant-free random access with massive IoT devices. By embedding the data symbols in the signature sequences, joint device activity detection and data decoding can be achieved, which, however, significantly increases the computational complexity. Coordinate descent algorithms that enjoy a low per-iteration complexity have been employed to solve the detection problem, but previous works typically employ a random coordinate selection policy which leads to slow convergence. In this paper, we develop multi-armed bandit approaches for more efficient detection via coordinate descent, which make a delicate trade-off between exploration and exploitation in coordinate selection. Specifically, we first propose a bandit based strategy, i.e., Bernoulli sampling, to speed up the convergence rate of coordinate descent, by learning which coordinates will result in more aggressive descent of the objective function. To further improve the convergence rate, an inner multi-armed bandit problem is established to learn the exploration policy of Bernoulli sampling. Both convergence rate analysis and simulation results are provided to show that the proposed bandit based algorithms enjoy faster convergence rates with a lower time complexity compared with the state-of-the-art algorithm. Furthermore, our proposed algorithms are applicable to different scenarios, e.g., massive random access with low-precision analog-to-digital converters (ADCs).

## Authors

• 4 publications
• 96 publications
• 36 publications
• 1 publication
• ### An Asymptotically Optimal Strategy for Constrained Multi-armed Bandit Problems

For the stochastic multi-armed bandit (MAB) problem from a constrained m...
05/03/2018 ∙ by Hyeong Soo Chang, et al. ∙ 0

• ### Stochastic Dual Coordinate Descent with Bandit Sampling

Coordinate descent methods minimize a cost function by updating a single...
12/08/2017 ∙ by Farnood Salehi, et al. ∙ 0

• ### Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection

There has been significant recent work on the theory and application of ...
06/01/2015 ∙ by Julie Nutini, et al. ∙ 0

• ### Contextual Bandit-Based Channel Selection for Wireless LANs with Interference-Driven Feature Extraction

This paper proposes a radio channel selection algorithm based on a conte...
03/23/2020 ∙ by Kota Yamashita, et al. ∙ 0

• ### Regularized Contextual Bandits

We consider the stochastic contextual bandit problem with additional reg...
10/11/2018 ∙ by Xavier Fontaine, et al. ∙ 0

• ### Diversifying Database Activity Monitoring with Bandits

Database activity monitoring (DAM) systems are commonly used by organiza...
10/23/2019 ∙ by Hagit Grushka-Cohen, et al. ∙ 0

• ### Cooperation Speeds Surfing: Use Co-Bandit!

In this paper, we explore the benefit of cooperation in adversarial band...
01/23/2019 ∙ by Anuja Meetoo Appavoo, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

The advancements in wireless technologies have enabled connecting sensors, mobile devices, and machines for various mobile applications, leading to an era of Internet-of-Things (IoT) [1]. IoT connectivity involves connecting a massive number of devices, which form the foundation for many applications, e.g., smart home, smart city, healthcare, transportation system, etc. Thus it has been regarded as an indispensable demand for future wireless networks [2]. With a large number of devices to connect with the base station (BS), in the order to , massive connectivity brings formidable technical challenges, and has attracted lots of attentions from both the academia and industry [3, 4].

The sporadic traffic is one unique feature in massive IoT connectivity, which means that only a restricted portion of devices are active at any given time instant [5]. This is because IoT devices are often designed to sleep most of the time to save energy, and are activated only when triggered by external events. Therefore, the BS needs to manage the massive random access via detecting the active users before data transmission. The grant-based random access scheme has been widely applied to allow multiple users to access the network over limited radio resources, e.g., in 4G LTE networks [4]-[6]. Under this scheme, each active device is randomly assigned a pilot sequence from a pre-defined set of preamble sequences to notify the BS of the device’s activity state. A connection between an active device and the BS will be established if the pilot sequence of this device is not engaged by other devices. Besides the overhead caused by the pilot sequence, a major drawback of the grant-based random access scheme is the collision issue due to a massive number of devices [5].

To avoid the excessive access latency due to the collision, a grant-free random access scheme has been proposed [5]

. Under this scheme, the active devices do not need to wait for any grant to access the network, and can directly transmit the payload data following the metadata to the BS. Following activity detection and channel estimation based on the pilot sequences, payload data of the active devices can be decoded. The key idea of activity detection and data decoding under the sporadic pattern is to connect with sparse signal processing and leverage the compressed sensing techniques

[7]. Compared with the grant-based access scheme [5], the grant-free random access paradigm enjoys a much lower access latency. In the scenario where the payload data only contains a few bits, e.g., sending an alarm signal, the efficiency can be further improved by embedding the data symbols in the signature sequences [8, 9]. Nevertheless, with massive devices and massive BS antennas, the resulting high-dimensional detection problem brings formidable computational challenges, which motivates our investigation.

### I-a Related Works

We consider the grant-free massive random access scheme in a network consisting of one multi-antenna BS and a massive number of devices with small data payloads, where each message is assigned a unique signature sequence. By exploiting the sparsity structure in both the device activity state and data transmission, joint device activity detection and data decoding can be achieved by leveraging compressed sensing techniques [10, 7]. Recently, a covariance-based method has been proposed to improve the performance of device activity detection [11], where the detection problem is solved by a coordinate descent algorithm with random sampling, i.e., it randomly selects coordinate-wise iterate to update. This covariance-based method has also been applied for joint detection and data decoding [9]

. Furthermore, the phase transition analysis for covariance-based massive random access with massive MIMO has been provided in

[12].

Although coordinate descent is an effective algorithm to solve the maximum likelihood estimation problem for joint activity detection and data decoding [9], existing works adopted a random coordinate selection strategy, which yields a slow convergence rate. Besides, a rigorous convergence rate analysis for this strategy has not yet been obtained. In this paper, our principle goal is to develop coordinate descent algorithms with more effective coordinate selection strategies for faster activity and data detection in massive random access, supported by rigorous convergence rate analysis.

Coordinate descent algorithms [13] with various coordinate selection strategies have been widely applied to solve optimization problems for which computing the gradient of the objective function is computationally prohibitive. It enjoys a low per-iteration complexity, as one or a few coordinates are updated in each iteration. In most previous works, e.g., [14, 15]

, each coordinate is selected uniformly at random at each time step. Recent studies have proposed more advanced coordinate selection strategies via exploiting the structure of the data and sampling the coordinates from an appropriate non-uniform distribution, e.g.,

[16]-[17], which outperform the random sampling strategy in the convergence rate.

Specifically, a convex optimization problem that minimizes a strongly convex objective function was considered in [16]. It proposed a GaussSouthwell-Lipschitz rule that gives a faster convergence rate than choosing random coordinates. Subsequently, Perekrestenko et al. [18] improved convergence rates of the coordinate descent in an adaptive scheme on general convex objectives. Additionally, Zhao and Zhang [19]

developed an importance sampling rule where the sample distribution depends on the Lipschitz constants of the loss functions. The adaptive sampling strategies in

[18, 19] require the full information of all the coordinates, which yields high computation complexity at each step. To address this issue, a recent study [17] exploited a bandit algorithm to learn a good approximation of the reward function, which characterizes how much the cost function decreases when the corresponding coordinate is updated. The coordinate descent algorithms proposed in all the works mentioned above are to solve convex optimization problems. Different from these works, the covariance-based estimation problem is non-convex. Hence, efficient algorithms with new reward functions and corresponding theoretical analysis are required, which bring unique challenges.

### I-B Contributions

In this paper, we propose coordinate descent algorithms with effective coordinate sampling strategies for faster activity and data detection in massive random access. Specifically, we develop a novel algorithm, i.e., coordinate descent with Bernoulli sampling. Inspired by [17]

, we cast the coordinate selection procedure as a multi-armed bandit (MAB) problem where a reward is received when selecting an arm (i.e., a coordinate), and we aim to maximize the cumulative rewards over iterations. At each iteration, with probability

the coordinate with the largest reward is selected, and otherwise the coordinate is chosen uniformly at random. We provide the convergence rate analysis on the coordinate descent with both Bernoulli sampling and random sampling in Theorem 1, which theoretically validates the advantages of the proposed algorithm. While the algorithm and analysis in [17] only considered convex objective functions, we extend them to the non-convex case.

The value of plays a vital role in the convergence rate and the computational cost. As demonstrated in Theorem 1, the larger the value of is, the higher profitability of selecting the coordinate endowed with the largest reward is. On the other hand, a larger value of leads to a higher computational cost, since the rule of selecting the coordinate with the largest reward requires computing the rewards for all the coordinates. This motivates us to develop a more advanced algorithm called coordinate descent with Thompson sampling, which adaptively adjusts the value of . In this algorithm, an inner MAB problem is established to learn the optimal value of

, which is solved by a Thompson sampling algorithm. Theoretical analysis is provided to demonstrate that the logarithmic expected regret for the inner MAB problem can is achieved. Different from the analysis of Thompson sampling in previous works where the parameters of the beta distribution are required to be integers, i.e.,

[20, 21], our analysis applies to the beta distribution of which the parameters are in the more general and natural forms.

Simulation results show that the proposed algorithms enjoy faster convergence rates with lower time complexity than the state-of-the-art algorithm. It is also demonstrated that coordinate descent with Thompson sampling enables to further improve the convergence rate compared to coordinate descent with Bernoulli sampling. Furthermore, we show that the proposed algorithm can be applied to faster activity and data detection in more general scenarios, i.e., with low precision (e.g., 1 – 4 bits) analog-to-digital converters (ADCs).

## Ii System model and problem formulation

In this section, we introduce the system model for massive random access, a.k.a., massive connectivity. A covariance-based formulation is then presented for joint device activity detection and data decoding, which is solved by a coordinated descent algorithm with random sampling.

### Ii-a System Model

Consider an IoT network consisting of one BS equipped with antennas and

single-antenna IoT devices. The channel state vector from device

to the BS is denoted by

 gihi∈CM,i=1,…,N, (1)

where is the pathloss component depending on the device location, and

is the Rayleigh fading component over multiple antennas that obeys i.i.d. standard complex Gaussian distribution, i.e.,

. Due to the sporadic communications, only a few devices are active out of all devices at a given time instant [22]. For each active device, bits of data are transmitted, where is typically a small number. This is the case for many applications, e.g., sending an alarm signal requires only 1 bit. Our goal is to achieve the joint device activity detection and data detection.

Assume the channel coherence block endows with length . The length of the signature sequences () is generally smaller than the number of devices, i.e., , due to the massive number of devices and a limited channel coherence block [9, 22]. We first define a unique signature sequence set for devices. For each device, we assign each -bit message with a unique sequence. With , this sequence set is known at the BS:

 Q=[Q1 ⋯ QN]∈CL×NR, (2)

where with for . We assume that all the signature sequences are generated from i.i.d. standard complex Gaussian distribution, and are known to the BS. If the -th device is active and aims to send a certain data of bits, the -th device will transmit the corresponding sequence from . Specifically, the indicator that implies whether the -th sequence of -th device is transmitted is defined as follows: if the -th device transmits the -th sequence; otherwise, . By detecting which sequences are transmitted based on the received signal, i.e., estimating , the BS achieves joint activity detection and data decoding. In this way, the information bits are embedded in the transmitted sequence, and no extra payload data need to be transmitted, which is very efficient for transmitting a small number of bits [8]. Since at most one sequence is transmitted by each device, it holds that , where indicates that device is inactive; otherwise, it is active. The received signal at the BS is represented as

 y(ℓ)=N∑i=1R∑r=1hiariqri(ℓ)+n(ℓ), (3)

where is the additive noise such that for all .

Compact the received signal over antennas as , and the additive noise signal over antennas as

 N=[n(1),…,n(L)]∈CL×M. (4)

The channel matrix is concatenated as

 H=[H1,…,HN]⊤∈CNR×M (5)

with consisting of repeated rows for . Recall the signature sequences defined in (2), and then the model (3) can be reformulated as [9]:

 Y=QΓ12H+N, (6)

where the diagonal block matrix is with being the diagonal activity matrix of the -th device. Let denote the diagonal entries of , where for . Our goal is to detect the values of indicators (i.e., ) from the received matrix with the knowledge of the pre-defined sequence matrix .

### Ii-B Problem Analysis

To achieve this goal, recent works have developed a compressed sensing based approach [10, 23, 24] which recovers from via exploiting the group sparsity structure of . The indicator can then be determined from the rows of . However, such an approach usually suffers an algorithmic complexity that is dominated by in massive IoT networks, i.e., the high dimension of . Furthermore, with messages embedded in the signature sequences, there is no need to estimate the channel state information [9], and thus recent papers [9, 11] have focused on directly detecting activity via estimating instead.

Specifically, the estimation of can be formulated as a maximum likelihood estimation problem. Given , each column of , denoted as for , can be termed as an independent sample from a multivariate complex Gaussian distribution such that [9]:

 ym∼CN(0,Σ), (7)

where

with the identity matrix

. Based on (7), the likelihood of given is represented as [9]: where and are operators that return the determinant and the trace of a matrix, respectively. Based on (7), the maximum likelihood estimation problem can be formulated as minimizing :

 minimizeγ∈RNR log|Σ|+1MTr(Σ−1YYH) subjectto γ≥0, ||γi||0≤1,i=1,2,…,N, (8)

where means that each element of is greater or equal to , and denotes the norm. This covariance-based approach was first proposed in [11] for activity detection, and then extended to joint activity and data detection in [9]. Based on the estimated and a pre-defined threshold , the indicator can be determined by

 ari={1, if ^γri≥sth and ^γri=maxRj=1{^γji},0, else. (9)

From that indicates whether the -th sequence is transmitted by the -th device, the activity state of the -th device and the transmitted data can be determined, i.e., achieving joint activity detection and data decoding.

For the ease of algorithm design, an alternative way to solve problem (II-B) was developed in [9]. By eluding the absolute value constraints, it yields

 minimizeγ≥0 F(γ):=log|Σ|+1MTr(Σ−1YYH). (10)

The first term in (10) is a concave function that makes the objective nonconvex, thereby bringing a unique challenge. The paper [9] showed that the estimator of problem (10) by coordinate descent is approximately sparse, thus constraints can be approximately satisfied. Specifically, it demonstrated that as the sample size, i.e., , increases, the estimator of problem (10) concentrates around the ground truth and becomes an approximate sparse vector for large , which implies that constraints are satisfied approximately when is large. Motivated by its low per-iteration complexity, the papers [9, 11] developed a coordinate descent algorithm to solve the relaxed problem (10), which updates the coordinate of randomly until convergence (illustrated in Algorithm 1). However, such a simple coordinate update rule

yields a less aggressive convergence rate, and lacks rigorous convergence rate analysis with theoretical guarantees. In this paper, we aim to design a novel sampling strategy for coordinate descent to improve its convergence rate.

There have been lots of efforts in pushing the efficiency of coordinate descent algorithms by developing more sophisticated coordinate update rules. Concerning supervised learning problems, previous works

[18, 15] have demonstrated that the coordinate descent algorithm can yield better convergence guarantees when exploiting the structure of the data and sampling the coordinates from an appropriate non-uniform distribution. Furthermore, the paper [17]

proposed a multi-armed bandit based coordinate selection method that can be applied to minimize convex objective functions, e.g., Lasso, logistic and ridge regression. Inspired by

[17], we shall apply the idea of Bernoulli sampling to solve the estimation problem (10) with a non-convex objective function for joint activity and data detection. In the remainder of the paper, we first present a basic coordinate descent algorithm with Bernoulli sampling in Section III, followed by proposing a more efficient algorithm with Thompson sampling in Section IV, both with rigorous analysis. Simulation results are provided in Section VI.

## Iii Coordinate Descent with Bernoulli Sampling

In this section, a basic algorithm, coordinate descent with Bernoulli sampling, is developed. We begin with introducing a reward function for each coordinate, which quantifies the decrease of the objective function in (10) by updating the corresponding coordinate. Based on the reward function, a coordinate descent algorithm with Bernoulli sampling (CD-Bernoulli) is proposed for joint device activity and data detection. The convergence rate of the proposed algorithm will be provided, and compared with that of coordinate descent with random sampling [11].

### Iii-a Reward Function

The coordinate selection strategy depends on the update rule for the decision variable for . The update rule with respect to the -th coordinate is denoted as , which is illustrated by Line 5-7 in Algorithm 1. The following lemma quantifies the decrease of updating a coordinate according to the update rule , which is the reward function in our proposed algorithm and denoted as .

###### Lemma 1.

Considering problem (10), and choosing the coordinate and updating with the update rule , we have the following bound: where

 rtk=aHkΣ−1ˆΣyΣ−1ak1+δaHkΣ−1akδ−log(1+δaHkΣ−1ak). (11)
###### Proof.

Please refer to Appendix A for details. ∎

A greedy algorithm based on Lemma 1 is to simply select at time the coordinate with the largest at time . However, the cost of computing reward functions for all the is prohibitively high, especially with a large number of devices. To address this issue, the paper [17] adapted a principled approach using a bandit framework for learning the best ’s, instead of exactly computing all of them. Inspired by this idea, at each step , we select a single coordinate and update it according to the rule . The reward function is computed and used as a feedback to adapt the coordinate selection strategy with Bernoulli sampling. Thus, only partial information is available for coordinate selection, which reduces the computational complexity of each iteration. Details of the algorithm are provided in the following subsection.

### Iii-B Algorithm and Analysis

Consider a multi-armed bandit (MAB) problem where there are arms (coordinates in our setting) from which a bandit algorithm can select for a reward, i.e., as in (11) at time . The MAB aims to maximize the cumulative reward received over rounds, i.e., , where is the arm (coordinate) chosen at time . After the -th round, the MAB only receives the reward of the selected arm (coordinate) which is used to adjust its arm (coordinate) selection strategy for the next round. For more background on the MAB problem, please refer to [25].

Based on the MAB problem introduced above, the CD-Bernoulli algorithm is illustrated in Algorithm 2.

To address the computational complexity issue of the greedy algorithm that requires to compute the reward function for all at each round , Algorithm 2 only computes the reward function of all the coordinates every rounds (please refer to Line 4-6 in Algorithm 2). In the remaining rounds, is estimated based on the most recently observed reward in the MAB. The coordinate selection policy is presented as follows: with probability a coordinate is determined uniformly at random, while with probability the coordinate endowed with the largest is chosen. It mimics the -greedy approach for conventional MAB problems [25]. This is to achieve a tradeoff between exploration and exploitation. That is, whether choosing the coordinate with currently the largest reward or exploring other coordinates. Then the -th coordinate of is updated according to the update rule . The -th entry of the estimated reward function is updated as with the rest unchanged.

The following result shows the convergence rate of coordinate descent for joint activity and data detection with two different coordinate selection strategies, i.e., random sampling and Bernoulli sampling. The estimation error is defined as

 ϵ(γ)=F(γ)−F(γ⋆) (12)

with . In contrast to the previous work [17], which concerns the objective function consisting of a smooth convex function and a regularized convex function, this paper considers in (10) that consists of a concave function and a convex function. Denote the best arm (coordinate) as with the estimated reward in Algorithm 2, we have the following convergence result.

###### Theorem 1.

Assume that at each iteration , for some constant that depends on and , then the iterate at the -th iteration of the CD-Bernoulli algorithm (illustrated in Algorithm 2) for solving problem (10) obeys

 E[ϵ(γt)]≤α1+t−t0, (13)

where with some constant , for all and where with defined in (11). Furthermore, the CD-Random algorithm (illustrated in Algorithm 1) for solving problem (10) yields with some constant .

###### Proof.

Please refer to Appendix E for details. ∎

We conclude from Theorem 1 that by choosing proper values of and (we use and in the experiments of Section VI) to yield sufficiently small , the bound with respect to CD-Bernoulli approaches with , which outperforms the bound with respect to CD-Random, i.e., . Hence, Theorem 1 demonstrates that for solving covariance-based joint device activity detection and data decoding, CD-Bernoulli yields a faster convergence rate than that with CD-Random.

In Algorithm 2, the value of plays a vital role in the balance between exploitation and exploration. The larger the value of is, the higher profitability of selecting the coordinate endowed with the largest current reward function (11) at each iteration is. However, a larger value of leads to insufficient exploration, which may lead to slow convergence rate. Instead of fixing , we prefer to developing a more flexible strategy for choosing . This motivates an improved algorithm to be presented in the next section.

## Iv Coordiante Descent with Thompson sampling

In this section, we improve the convergence rate of CD-Bernoulli Algorithm by incorporating another bandit problem to adaptively choose . Specifically, we formulate the choice of the parameter as a general Bernoulli bandit problem, and develop a Thompson sampling algorithm for solving this bandit problem. The theoretical analysis is also presented to verify the advantage of Algorithm 3 over Algorithm 2.

### Iv-a A Stochastic MAP Problem for Choosing ε

We first introduce a stochastic -armed bandit problem for optimizing the parameter in Algorithm 2. In this paper, we assume that the reward distribution with respect to choosing is Bernoulli, i.e., the rewards are either or . Note that the reward with respect to choosing is different from the reward function of selecting the coordinates defined by (11).

An algorithm for the MAB problem needs to decide which arm to play at each time step , based on the outcomes of the previous plays. Let denote the (unknown) expected reward for arm . The means for the -armed bandit problem, denoted as , are unknown, and are required to be learned by playing the corresponding arms. A general way is to maximize the expected total reward by time , i.e., , where is the arm played at step , and the expectation is over the random choices of made by the algorithm. The expected total regret can be also represented as the loss that is generated due to not playing the optimal arm in each step. Let and Also, let denote the number of times arm has been played up to step Then the expected total regret in time is given by [20]

### Iv-B Thompson Sampling

We first present some background on the Thompson sampling algorithm for the Bernoulli bandit problem, i.e., when the rewards are either or , and for arm the probability of success (reward =) is . More details on Thompson sampling can be found in [26] and [20].

It is convenient to adopt Beta distribution as the Bayesian priors on the Bernoulli means

’s. Specifically, the probability density function (pdf) of

, i.e., the beta distribution with parameters , , is given by with being the gamma function. If the prior is a distribution, then based on a Bernoulli trial, the posterior distribution can be represented as when the trail leads to a success; otherwise, it is updated as .

The previous studies of Thompson sampling algorithm, e.g., [20], generally assumed that and are integers. The algorithm initially assumes that arm has prior as on , which is natural because is the uniform distribution on the interval . At time , having observed successes (reward = ) and failures (reward = ) in plays of arm , the algorithm updates the distribution on as . The algorithm then samples from these posterior distributions of the ’s, and plays an arm according to the probability of its mean being the largest.

Different from previous methods, in this paper, we consider a more general way to update the parameters and by evaluating the reward function , to be presented in the following subsection.

### Iv-C CD-Thompson

The coordinate descent algorithm via Thompson sampling (CD-Thompson) is illustrated in Algorithm 3. In this algorithm, a stochastic MAB problem for learning the best for arms at the -th iteration is established, and a Thompson sampling algorithm is developed to solve this bandit problem. In Algorithm 3, the reward for selecting the -th coordinate at time step is taken into consideration to update the parameters , thereby choosing based on . To be specific, for the index and the Bernoulli variable , if , we update

 αjt=αjt+νtjt⋅rtkt/F(γt); (14)

otherwise, we update

 βjt=βjt+(1−νtjt)rtkt/F(γt), (15)

where is defined in (11) and is defined in (10). For illustration, the main processes of CD-Bernoulli and CD-Thompson are illustrate in Fig. 1.

Recall that denotes the (unknown) expected reward for arm . At time , if arm has been played a sufficient number of times, is tightly concentrated around with high probability. In the following analysis, we assume that the first arm is the unique optimal arm, i.e., . The expected regret for the stochastic MAB problem in Algorithm 3 is presented as follows.

###### Theorem 2.

The -armed stochastic bandit problem for choosing for in Algorithm 3 has an expected regret as in time , where .

###### Proof.

Please refer to Appendix F for a brief summary of the proof. ∎

###### Remark 1.

Algorithm 2 adopts a fixed constant as the probability of updating coordinate with the largest reward function (i.e., coordinate-wise descent value) (11) at time step , which lacks flexibility for better exploration-exploitation trade-off. In contrast, Algorithm 3 improves the strategy of choosing the parameter in Algorithm 2. This is achieved by establishing a stochastic -armed bandit problem for choosing the corresponding probability. This multi-armed stochastic bandit problem studies an exploitation/exploration trade-off by sequentially designing for at time step . During the sequential decision, Algorithm 3 is able to approximate the optimal value of the probability. Theoretically, Theorem 2 demonstrates that Algorithm 3 enjoys a logarithmic expected regret for the stochastic -armed bandit problem, which typically is the best to expect. Furthermore, the exploitation/exploration trade-off in Algorithm 3 eludes the situation where the large value of in Algorithm 3 is maintained in many time steps, and thus avoids high computational cost for computing for all at time step .

###### Remark 2.

Different from the previous MAB based coordinate descent algorithm [17] that solves convex optimization problems, our proposed algorithm solves a covariance-based estimation problem that is non-convex. Beta distribution, i.e., , is a powerful tool to learn the priors for Bernoulli rewards. Specifically, we consider a more general way to update the parameters and based on the reward function . Our proposed algorithms turn out to be enjoying faster convergence rates with modest computational time complexity.

## V Application to Massive Connectivity with Low-precision ADCs

While the formulation in Section II presents a basic massive connectivity system, the proposed algorithms, i.e., CD-Bernoulli and CD-Thompson, can also be applied to solve more general activity detection problems. In this section, we introduce massive connectivity with low-precision analog-to-digital converters (ADCs) as an example. Recently, the use of low precision (e.g., 1–4 bits) ADCs in massive MIMO systems has been proposed to reduce cost and power consumption [27, 28, 29]. In the following, we illustrate how the proposed algorithms can be applied to this new scenario.

At each of receive antennas, the A/D converter samples the received signal and utilizes a finite number of bits to represent corresponding samples. Each entry, i.e., , of (6) for is quantized into a finite set of pre-defined values by a -bit quantizer . The quantized received signal is thus represented by [29]

 Yq=Qc(Y)=Qc(QΓ12H+N), (16)

where the complex-valued quantizer is defined as i.e., the real and imaginary parts are quantized separately. The real valued quantizer maps a real-valued input to one of the bins, which are characterized by the set of thresholds such that For , an element of the output is assigned a value in when the quantizer entry of the input falls in the -th bin, i.e., the interval .

Generally, the quantization operation is nonlinear. For ease of applying coordinate descent algorithms to solve quantized model, we linearize the quantizer. Based on Bussgang’s theorem, the quantizer output can be decomposed into a signal component plus a distortion that is uncorrelated with the signal component [27], i.e.,

 Yq=(IM−ρ)Y+Wq, (17)

where is the real-valued diagonal matrix containing the distortion factors:

 ρ=⎡⎢ ⎢⎣ρ1⋱ρM⎤⎥ ⎥⎦≈⎡⎢ ⎢ ⎢⎣2−2b1⋱2−2bM⎤⎥ ⎥ ⎥⎦, (18)

with for denoting the bit resolution of the scalar quantizer with respect to each antenna.

Since is uncorrelated with the signal component , the covariance matrix of the quantizer can be represented as

 Σq=E[YqYHq]=ρΣρ+ρ(IM−ρ)diag(Σ), (19)

where is defined in (7). Hence, the joint device activity detection and data decoding with low-precision ADCs can be formulated as

 minimizeγ≥0 F(γ):=log|Σq|+1MTr(Σq−1YqYHq). (20)

Problem (20) can be efficiently solved by the proposed algorithms, i.e., Algorithm 2 and Algorithm 3. Simulations will be presented in the next section.

## Vi Simulation Results

In this section, we provide simulation results to demonstrate that the proposed algorithms enjoy faster convergence rates than coordinate descent with random sampling for joint device activity detection and data decoding. Furthermore, we apply our proposed algorithms to massive connectivity with low-precision ADCs.

### Vi-a Simulation Settings and Performance Metric

Consider a single cell of radius m containing devices, among which devices are active. The performance is characterized by the probability of missed detection.

The simulation settings are given as follows:

• The signature matrix (2) with is generated from i.i.d. standard complex Gaussian distribution, followed by normalization, i.e.,

 Q∼N(0,12LIL)+iN(0,12LIL).
• The channel matrix consists of Rayleigh fading components that follow i.i.d. standard complex Gaussian distribution, i.e.,

 H∼N(0,12INR)+iN(0,12INR).

Meanwhile, the fading component in (1) for device with is given as in dB where for all .

• The additive noise matrix is generated from i.i.d. complex Gaussian distribution, i.e.,

 N∼N(0,12σ2nIL)+iN(0,12σ2nIL),

where the variance

is the background noise power normalized by the device transmit power. In the simulations, the background noise power is set as - dBm, and the transmit power of each device is set as dBm.

• Performance metric is defined in the following. The missed detection occurs when a device is active but is detected to be inactive, or a device is active and is detected to be active but the data decoding is incorrect. Different probabilities of missed detection can be obtained by adjusting the value of the threshold in (9). In the simulations, we choose a threshold that enables to determine active devices from the estimated .

The following three algorithms are compared:

• Proposed coordinate descent with Bernoulli sampling (CD-Bernoulli): Problem (10) is solved by Algorithm 2 with the setting of and . Note that the computational time will increase as the value of increases. The convergence rate of CD-Bernoulli will decrease as the value of decreases. We thus pick a modest value to illustrate the performance of CD-Bernoulli.

• Proposed coordinate descent with Thompson sampling (CD-Thompson): Problem (10) is solved by Algorithm 3 with the setting of and .

• Coordinate descent with random sampling (CD-Random): Problem (10) is solved by Algorithm 1 with uniformly randomly choosing a coordinate to update.

All the algorithms stop when the relative change of the objective function is lower than a certain level, i.e.,

 |F(γt+1)−F(γt)||F(γt)|≤10−6

or the number of iterations exceeds .

### Vi-B Convergence Rate

In the simulations, the length of the signature sequences is , the number of antenna is , and each device transmits a message of bit or bits. The convergence rates of different algorithms are illustrated in Fig. 2. We validate the convergence rate analysis in Theorem 1 by comparing CD-Bernoulli (i.e., Algorithm 2) with CD-Random (i.e., Algorithm 1). Furthermore, Fig. 2 shows that CD-Thompson with a more sophisticated strategy on choosing the probability of updating the coordinate has better performance than Algorithm 2. As illustrated in Fig. 2 and demonstrated in Theorem 1, a larger value of yields a large value of , which leads to a slower convergence rate. In summary, this simulation shows that the proposed algorithms yield faster convergence rates than the state-of-the-art algorithm [9].

### Vi-C Probability of Missed Detection

Under the setting of , the computational time of three algorithms is further illustrated in Fig. 3. It shows that the proposed algorithms achieve the same level of detection accuracy with much less computational time than the algorithm in [9]. The reason is that the coordinate selection with Bernoulli sampling or Thompson sampling is able to choose the coordinate that yields a larger descent in the objective value. Additionally, Fig. 3 also shows that Algorithm 3 can further reduce the computational time, compared to Algorithm 2. This is achieved by a better exploitation/exploration trade-off in Algorithm 3 which eludes the situation where the large value of in Algorithm 3 is maintained in many time steps, which leads to a high computational cost for computing for all in time step .

### Vi-D Applications in Low-precision ADCs

In this part, we test the proposed algorithms with low-precision ADCs. For the quantization procedure, we use the typical uniform quantizer with the quantization step-size . For -bit quantization, the threshold of this uniform quantizer is given by

 rz=(−2b−1+z)sq,for z=1,…,2b−1, (21)

and the element of the quantization output (17) is assigned the value when the input falls in the -th bin, i.e., .

Under the same setting as Section VI-C, Fig. 4 shows the unquantization case, and the quantization case with different quantization levels, i.e., . To further illustrate the computational cost of the proposed algorithm applied to the low-precision ADCs, Fig. 5 shows the probability of missed detection with respect to computational time. These results demonstrate that -bit quantization is sufficient to achieve similar convergence rate and accuracy as the unquantization scenario.

## Vii Conclusions

In this paper, we developed efficient algorithms based on multi-armed bandit to solve the joint device activity detection and data decoding problem in massive random access. Specifically, we exploited a multi-armed bandit algorithm to learn to update the coordinate, thereby resulting in more aggressive descent of the objective function. To further improve the convergence rate, an inner multi-armed bandit problem was established to improve the exploration policy. The performance gains in the convergence rate and time complexity of the proposed algorithms over the start-of-the-art algorithm were demonstrated both theoretically and empirically. Furthermore, our proposed algorithms can be applied to a more general scenario, i.e., activity and data detection in the low-precision analog-to-digital converters (ADCs), thereby saving energy and reducing the power consumption.

Our proposed algorithm only updates a single coordinate at each time step

. It is interesting to further investigate the effect of choosing multiple coordinates from a budget at each time step. At a high level, the proposed approach can be regarded as an instance of “learning to optimize”, i.e., applying machine learning to solve optimization problems. Specifically, it belongs to optimization policy learning

[30], which learns a specific policy for some optimization algorithm. One related work is [31], which learns the pruning policy of the branch-and-bound algorithm. It is interesting to apply such an approach to other optimization algorithms to improve the computational efficiency for massive connectivity.

## Appendix A Computation of the Reward Function

In this section, we derive the reward function for the multiple-armed bandit problem for coordinate descent. Define as the index of the selected coordinate and define where denotes the -th canonical basis with a single at its -th coordinate and zeros elsewhere. We can simplify as follows

 Fk(d)= log∣∣Σ∣∣+1MTr(Σ−1YYH)+log(1+daHkΣ−1ak)−aHkΣ−1ˆΣyΣ−1ak1+daHkΣ−1akd. (22)

According to [11], the global minimum of in is so the descent value of the cost function is:

 F(γ)−Fk(δ) =F(γ)−F(γ+δ)=aHkΣ−1ˆΣyΣ−1ak1+δaHkΣ−1akδ−log(1+δaHkΣ−1ak). (23)

Hence, the reward function is defined as

## Appendix B Primary theorems for the Proof of Theorem 1

Several theorems are needed to pave the way for the proof of Theorem 1.

###### Theorem 3.

Recall the reward function defined in (11). Under the assumptions of Lemma 1, if we choose the coordinate with the largest at the -th iteration, it yields the following linear convergence guarantee:

 ϵ(γt)≤ϵ(γ0)t∏j=1(1−maxk∈[d]rtk∑ℓrtℓ), (24)

for all , where is the sub-optimality gap at .

###### Proof.

Please refer to Appendix C for details. ∎

###### Theorem 4.

Under the assumptions of Lemma 1, we have the following convergence guarantee:

 ϵ(γt)≤η2NR+t−t0 (25)

for all , where , is the sub-optimality gap at and is an upper bound on for all iterations .

## Appendix C Proof of Theorem 3

The selection strategy concerned in this proof is to choose the coordinated with the largest reward function defined in (11), which is denoted by . Hence, based on the fact it yields that

 ϵ(γt+1)−ϵ(γt)=F(γt+1)−F(γt)≤−rtk⋆−∑ℓrtℓmaxk∈[NR]rtk∑ℓrtℓ≤−ϵ(γt)maxk∈[NR]rtk∑ℓrtℓ, (26)

that induces

 ϵ(γt+1)≤ϵ(γt)−ϵ(γt)maxk∈[NR]rtk∑ℓrtℓ, (27)