I Introduction
Ia Cognitive Radio
The emerging new wireless technologies, such as 5G cellular networks and machinetomachine enabled industrial Internet of Things, are fueling an everincreasing demand for access to the radio frequency spectrum. Cognitive radio (CR), an intelligent wireless technology able to recognize the surrounding radio environments [1], creates a potential communication paradigm to achieve more efficient and flexible spectrum usage. A secondary user (SU) with CR capability monitors the spectrum utilization of a primary user (PU) and determines its access to such spectrum accordingly. Two fundamental challenges arise in the process: how to explore possible spectrum opportunities (spectrum sensing) and how to exploit such opportunities efficiently (spectrum access).
Spectrum sensing measures and percepts the surrounding radio spectrum state based on various signal processing methods, including matched filter detection [2, 3], cyclostationary detection [4, 5], and energy detection [6, 7, 8, 9, 10]. Matched filter can achieve the optimum performance, but the SU requires perfect knowledge of the PU signaling features a priori. Cyclostationary detection takes advantage of the signal cyclostationary feature to distinguish signals from the stationary noise. In contrast, energy detection carries out the hypothesis test to determine the PU spectrum state, based on the energy of the collected PU signals. It features a low computational complexity, and is widely adopted in the literature. Upon obtaining the radio spectrum state, spectrum access dynamically adjusts the resources available to the SU, including frequency band, transmission time and transmission power, and accesses the licensed spectrum by taking into account the interference to the PU. The SU can access the licensed spectrum either when the PU is idle (opportunistic access) [11], or concurrently with the PU following a power control strategy to constrain the interference to the PU (spectrum sharing) [12, 13]. It is clear that appropriately designed spectrum sharing achieves higher throughput for the SU compared with the opportunistic spectrum access.
It is worth noting that many contemporary wireless standards, such as IEEE 802.11 [14], GSM [15], and LTE [16], have specified multiple transmit power levels to dynamically adapt to the fast changing environment and varying quality of service (QoS). The majority of CR studies in the literature did not take this into account, and the SU usually adopts a binary approach in reporting the radio spectrum state as idle or busy. In fact, for this multiple power level scenario, multiparameter cognition is required, and the binary approach may not represent the most efficient spectrum utilization for the SU. The question then arises: how to exploit the variation in the PU power levels to design an intelligent spectrum sharing strategy? Naturally the answer to this question shall consist of spectrum sensing and spectrum access, which will be elaborated on in the next two subsections.
IB Multilevel Spectrum Sensing
The priority in the design of an agile spectrum sensing method should aim to accurately map the sensing samples received by the secondary transmitter (ST) to the corresponding primary transmitter (PT) power levels. In a conventional binary case (PT is ON/OFF), there are two kinds of errors, and the goal of the spectrum sensing is to determine a detection threshold (the upper part of Fig. 1
). For example, given the target probability of false alarm and the noise power,
can be simply determined by the NeymanPearson criterion[17]. By contrast, the goal of the multilevel case is to jointly determine multiple thresholds to separate different power levels through a multiple hypothesis test (the middle part of Fig. 1), which is far more complicated than the binary one. In essence, there are kinds of errors, which are intertwined to exacerbate the complexity in threshold calculation. In [18] and [19], the authors proposed an energy detection based multiple hypothesis test to derive the decision thresholds for the multiple power level identification. The results were extended to the scenarios with noise uncertainty [20] and nonGaussian transmission signals [21]. However, the calculation of the thresholds in [18, 19, 20, 21]requires a large amount of prior knowledge at the ST, including the noise power and the PT transmit power mode (i.e., the number and exact value of different transmit powers, and the prior probability of each hypothesis). In practice, these parameters are unlikely to be available to the ST
a priori.In this paper, we aim to break the limits of the existing work, and achieve multilevel spectrum sensing with no or minimal prior information. We deviate from the above classical signal processing approaches, and machine learning arises as the tool of our choice for knowledge discovery to mine and extract the latent patterns reflecting the PT power level variation in the PT traffic data flow. On this basis, we propose a datadriven/machine learning based multilevel spectrum sensing scheme. It is fully blind in the sense that the ST does not require any prior knowledge of the noise power and the PT transmit power mode. Specifically, the proposed spectrum sensing scheme spans across two stages as shown in Fig. 3
. In Stage I (spectrum learning, a.k.a. the training phase in machine learning), the ST collects a multitude of signals that experience multiple PT transmit power level variation, and uses the Gaussian mixture models (GMMs) to capture the multilevel power characteristics inherent in the signals. Then, we introduce a Bayesian nonparametric method, referred to as conditionally conjugate Dirichlet process GMM (DPGMM), to automatically cluster the signals with the same PT transmit power level (the lower part of Fig.
1) and infer the model parameters (GMM parameters and PT power level duration distribution parameters). With the model parameters inferred in Stage I, the prediction part in Stage II (see Fig. 3) can easily identify the current PT power level through collecting PT signal samples. In this case, Stage I together with the prediction part in Stage II achieve fully blind multilevel spectrum sensing. Note that the second part in Stage II (ST transmission) will be detailed next.IC Multilevel Spectrum Access
With the big picture of multilevel PT radio environments learned in Section IB, we next aim to establish a multilevel spectrum access strategy, which is the ultimate goal of spectrum sensing. To the best of our knowledge, this represents the first effort in such direction. For the SU, there is a fundamental tradeoff between two conflicting goals, namely, maximization of its own throughput and minimization of its interference to the PU. As typically there is no cooperation between the PU and SU, it is extremely difficult to optimize this tradeoff in practice. To provide a pragmatic solution to this dilemma, we first propose a new metric, referred to as the normalized power level alignment (NPLA), and it is defined as the time proportion that the ST matches its transmit power level to that of the PT.
To optimize the NPLA performance, we propose two predictiontransmission structures (periodic and nonperiodic) in Stage II for spectrum access which enables the ST to closely follow the PT power level variation. As discussed before, the prediction part can identify the current PT power level. On this basis, the transmission part adjusts the ST power according to the required signaltointerferenceplusnoise ratio (SINR) at the primary receiver (PR). Specifically, the periodic structure features a fixed prediction interval, and is straightforward in implementation. By contrast, the nonperiodic structure dynamically determines the interval, which can be formulated as a partially observable decision problem. This motivates us to develop a new algorithm based on reinforcement learning [22], exploiting the PT power level duration distribution. This structure further improves the NPLA performance. Finally, we extend the predictiontransmission structure to an online scenario, where the number of PT power levels might change as a consequence of PT adapting to the environment fluctuation or QoS variation.
ID Contribution
In a nutshell, we propose a learningbased twostage spectrum sharing strategy for a CR network, enabling fully blind spectrum sensing when the PT power varies with time in multiple levels, and designs an adaptive spectrum access strategy for the NPLA optimization. The main contributions of this paper can be summarized as follows.

We propose a novel datadriven/machine learning based multilevel blind spectrum sensing. The conditionally conjugate DPGMM with Bayesian inference is introduced to automatically cluster the signals and infer the model parameters, which is key to predict the PT power levels.

We propose a new metric, NPLA, to strike an excellent tradeoff between the secondary network throughput and the interference to the primary network.

To optimize the NPLA performance, we propose a predictiontransmission structure for spectrum access which enables the ST to closely follow the PT power level variation. Furthermore, the ST prediction interval is dynamically adjusted to achieve better performance.

The spectrum access method is extended to the online scenario to accommodate a more realistic situation, where the number of PT power levels might change after the inital spectrum learning.
IE Related Work
Machine learning technology has recently played an important role in improving spectrum sensing. The work in [23]
presented an adversarial machine learning approach to launch jamming attacks on CR and introduces a defense strategy. Several supervised and unsupervised machine learning algorithms for cooperative spectrum sensing (CSS) were investigated in
[24]. In [25], the combination of infinite GMM and CSS was proposed to detect the primary user emulation attacks. In [26], a convolutional neural networkbased CSS scheme was developed to detect multiple bands simultaneously. A mobile CSS framework was proposed in
[27] for largescale heterogeneous cognitive networks.There are many efforts in sensing policy design for realtime decisions on which channel(s) to sense (dynamic multichannel selection). By contrast, our paper considers the single user and single channel case, and focuses on the design of multilevel spectrum sensing to differentiate different PT power levels. On this basis, we also consider the policy design to dynamically adjust the sensing intervals to improve the NPLA performance.
The dynamic multichannel selection can be modelled as a partially observable Markov decision process (POMDP)
[28]. The partial observation in [29, 30, 31, 32, 33, 34, 35] originates from each SU being unable to scan all the channels at any one time due to energy and hardware constraints. Therefore, a sensing policy needs to be developed to balance between utilizing a spectrum opportunity for immediate access and collecting spectrum occupancy statistics to track spectrum opportunity for future exploitation. As finding an optimal channel sensing policy in general is computationally prohibitive with the increased number of channels, several efforts endeavor to find the optimal/nearoptimal policy with low computational cost. In [29, 30, 31], the dynamic multichannel access problem is modelled as a restless multiarmed bandit problem. The time horizon is divided by interleaving exploration and exploitation epochs with growing lengths, and the optimal policy can be translated into determining the length and allocation of each epoch. Recently, deep reinforcement learning (DRL) based channel selection
[32, 35, 33, 34] has attracted great attention, and it aims to handle the correlated channels with unknown channel dynamics. The essence of DRL is to provide a good approximation of the objective value (Qvalue), facilitating the handling of the large state and action spaces.It is worth noting that the access policy design (sensing interval) in our paper is also formulated as a POMDP, but the nature of our formulation is fundamentally different from that in [29, 30, 31, 32, 35, 33, 34]. The partial observation in our work comes from imperfect multilevel sensing results and access feedback. To tackle this challenging POMDP, we reduce the infinite time horizon to a finite one, leading to a computationally tractable solution. Most importantly, we mathematically prove that such practice does not sacrifice the optimality in the utility.
IF Organization and Notation
The rest of the paper is organized as follows. In Section II, we discuss the system model for our proposed twostage spectrum sharing strategy. In Section III, we introduce a Bayesian nonparametric method and its inference for the model parameters. The predictiontransmission structure with an online extension, which are adaptive to the PT power level variation, is presented in Section IV. Simulation results and discussion are presented in Section V followed by conclusions in Section VI.
denotes the Gaussian distribution with mean
and precision , denotes the complex Gaussian distribution with mean and precision , anddenotes the Gamma distribution with shape parameter
and scale parameter . denotes the Gamma function and denotes the incomplete Gamma function. is the floor function. For convenience, we list most important symbols in Table I.Symbol  Definition 

,  The index and total number of the actual PT transmit power levels. 
,  The index and total number of the PT transmit power level estimated by the ST. 
,  The index of the PT hypothesis and the total number of the PT hypotheses in Stage I. 
,  The index of the ST action and the total number of sensing slots in Stage I. 
The PT transmission power value on the th level.  
The hypothesis that the PT transmits with .  
The hypothesis determined by the ST that the PT transmits with .  
The test statistic in the th single sensing slot. 

The total number of samples collected by the ST in a single sensing slot.  
The duration of a ST sensing slot, a ST transmission block, and a PT hypothesis.  
,  The discretized time of a ST transmission block and a PT hypothesis. 
,  The concentration parameter and the base probability distribution of the Dirichlet process. 
The latent variable indicating which component that is associated with.  
The total number of observations assigned to the th component.  
The value, precision, and mixing proportion of the th component in the GMM.  
, , ,  The hyperparameters in the conditionally conjugate DPGMM. 
The probability that the PT is operating under while the detection by the ST is in favor of .  
The probability that the PT transfers from the th transmit power level to the th level.  
The probability that the PT keeps operating with the th transmit power level at time .  
The action that the ST will take at time .  
The longest time that the SU should transmit when operating in the th transmit power level.  
The probability of correct PT power level prediction in Stage II.  
The NPLA performance from time 0 to . 
Ii System Model
We consider a spectrum sharing CR network in Fig. 2, with a primary network consisting of a PT and a PR, a secondary network consisting of a ST and a secondary receiver (SR), and a central site (broadcast tower). Transmission happens simultaneously in both networks sharing the same frequency band. Different from most CR networks considered in the literature, the PT operates with multiple (instead of binary: ON/OFF) power levels. The ST attempts to learn the model parameters, which will be defined later, and then optimize the spectrum access accordingly. The ultimate goal is to optimize the NPLA performance.
To achieve this target, we propose a novel twostage spectrum sharing strategy, as illustrated in Fig. 3. Let , be the transmit powers of the PT, where and indicates an idle PT. For convenience, hypothesis indicates that the PT transmits with power . It is assumed that
undergoes a slow change, as shown in the figure. We define the time duration of each hypothesis as a random variable
, which is usually much larger than that of the ST sensing slot and the ST transmission block . In this paper, we consider a time discretized model, where is the minimum time unit. We define as the fixed discretized time duration of the ST transmission block, and as the varied discretized PT power level duration. As discussed before, the prior knowledge on the PT transmit power mode, defined as the number of transmit power levels , the exact values , and the prior probability of each hypothesis , is unknown to the ST, which is fundamentally different from the assumptions in [18, 19, 20, 21]. In the sequel, we describe in detail the operations of these two stages in Fig. 3.Iia Stage I
In this stage, the ST samples the received PT signals at a sampling frequency and collects samples in each sensing slot with duration (for notation simplicity, we assume that is an integer). The ST observes sensing slots in Stage I and collects a total of samples. It is assumed that the learning period is reasonably large so that it covers all hypotheses^{1}^{1}1There is a nonezero probability that some transmit power levels do not happen and are not observed by the ST during Stage I, even if the learning period is relatively large. In this case, these missed hypotheses can be viewed as small probability events. Consequently, they will have negligible impact on the performance of the proposed spectrum access method, and can be ignored.. Thus, the th sample in the th sensing slot under hypothesis can be given by
(1) 
where is the received primary signal in the th sensing slot with average power , and is the additive white Gaussian noise. Following [36], we assume that is an independent and identically distributed (i.i.d.) random variable with mean
and variance
. Following [36] and without loss of generality, we assume that is complex PSK modulated signal^{2}^{2}2For other modulation schemes, the test statistic still follows a Gaussian distribution [36]. Therefore, our proposed method is still valid for other modulation and/or adaptive modulation schemes..The test statistic in the th sensing slot can be calculated as
(2) 
When
is large, according to the central limit theorem, the distribution of
under hypothesis can be approximated by a Gaussian one, and we have(3) 
where is the received signaltonoise ratio (SNR) at the ST. Considering all the hypotheses, we establish that follows a mixed Gaussian distribution [37]
(4) 
where is the mixing coefficients with . Each Gaussian density is a component of the mixture with mean value and precision .
Given the observation set , the proposed Bayesian nonparametric method aims to infer the GMM parameter set , where with and . In other words, our method automatically clusters the signals with the same PT transmit power levels. In summary, Stage I establishes a big picture of the PT activities at the cost of an oneoff overhead. After learning, the ST allocates the same number of power levels as that in the PT, with an initial ST power value for each level .
IiB Stage II
In this stage, we propose two predictiontransmission structures (periodic and nonperiodic) adapting to the PT power level variation for spectrum access. The main features of the structures can be summarized as follows.

As shown in Fig. 3, Stage II consists of two parts: prediction and transmission actions. In the prediction action (the sensing slots with purple color), the ST can easily identify the current PT power level , which is jointly determined by the test statistic , in (2) and the inferred GMM parameter set . In the transmission action, the ST allocates its transmit power level to match the latest identified PT power level (). Here, the corresponding can be determined as follows. Assume that the required SINR for the PR is and the current received SINR is . A nearby monitoring station (see Fig. 2) of the PR transmits to the central site, likely through optical fiber or microwave, which then broadcasts this information on a dedicated frequency. We assume that the ST is able to decode broadcast signals from the central site and communicate with the SR on different frequency bands. Through the broadcast nature, the ST obtains . If , the ST should reduce the transmit power and vice versa. In other words, for each level will gradually approach a desired power value. This guarantees that the PR is well protected, while the secondary network obtains the highest possible throughput^{3}^{3}3The similar idea of using broadcasting mechanism was also suggested by Federal Communications Commission [38, p. 6] and adopted in 4G LTE systems in the form of intercell interference overload indicator [39].. On this basis, it is clear that the alignment between ST and PT power levels can optimize the tradeoff between the interference to primary network and throughput of the secondary network^{4}^{4}4Note that the channel state information for PTST and STPR is not required in our approach. In addition, the use of the received SINR has already included the impact of the channel fading.. In the case that the ST mismatches the PT power level variation, either the PR is interfered below the required SINR or unnecessarily lower secondary network throughput is obtained.

For the periodic structure, the prediction intervals are fixed. By contrast, in the nonperiodic structure, the intervals are dynamically determined to enhance the NPLA performance. As shown in Fig. 3, zero intervals are used to track the PT power level variation, while long intervals are selected to avoid unnecessary prediction. The nonperiodic structure will be elaborated on in Section IVC.
Iii Spectrum Learning Based on Bayesian Inference
In this section, we focus on Stage I, and introduce a Bayesian nonparametric method to infer the GMM parameter set based on the observation set . As is unknown a priori
, the traditional methods, such as the Kmean and expectation maximization, are inapplicable. This motivates us to resort to Dirichlet process Gaussian mixture model (DPGMM)
[40], which takes into account the Gaussian mixture property and is able to identify the unknown number of Gaussian components. For specific Bayesian inference, we choose the Markov chain MonteCarlo based Gibbs sampling method
[41] considering its simplicity.In the following, we first review the preliminary knowledge on the Dirichlet process mixture model. On this basis, we introduce the DPGMM considering the specific distribution of the observation set . Furthermore, we modify the DPGMM to the conditionally conjugate case to simplify the inference process. Finally, we carry out Bayesian inference with Gibbs sampling method to infer .
Iiia Dirichlet Process Mixture Model
The Dirichlet distribution is an extension of the Beta distribution for multivariate cases. It represents the probability of
events given that the th event () has been observedtimes. The probability density function can be expressed as
(5) 
where is the probability of the th event with and .
In our application, the event represents the th possible PT transmit power level, which can not be observed explicitly. Instead, the explicit observation is the test statistic . As is drawn from a distribution based on event , we introduce the Dirichlet process (DP) to define the distribution of . A random measure is said to be a Dirichlet process distributed with a base probability distribution and a concentration parameter , if we have
(6) 
for every finite measurable partition of . It is written as .
Next we model the observation set using the parameter based on the DP mixture model. A DP mixture model is suitable for the clustering purposes, where the number of Gaussian components is not known a priori. Here, can be regarded as an independent draw from the distribution , where each is an i.i.d. draw from a DP . Mathematically, the DP mixture model can be expressed as [42]
(7) 
Note that the formulation (7) represents the most general case. In our case, two different observations and () may follow the same distribution, and (7) can not explicitly reveal such property. Therefore, following [41], a latent variable is introduced to explicitly indicate which transmit power level that is associated with, and will be referred to as an indicator hereafter. Accordingly, an equivalent model can be obtained as
(8) 
where is the set of indicators, is the set of unique values in , and . Hereafter, refers to the total number of Gaussian components, and each component consists of the observations that are determined by the ST as having the same transmit power level. Note that is assumed to have a symmetric Dirichlet distribution, where all the concentration parameters are . This assumption is widely adopted when there is no prior knowledge of the mixing proportions [41]. Let denote the number of observations assigned to the th component, then follows a multinomial distribution
(9) 
where , and the distribution of the indicators is
(10) 
We can integrate out the mixing proportions of the product of in (8) and in (10), and the prior on in terms of is expressed as [40]
(11) 
As all the observations are exchangeable, if we assume that has been obtained, the conditional distribution for the individual indicator can be given by
(12) 
where is the number of samples excluding in the th component. Similarly, the prior distribution of over can be written as
(13) 
IiiB Conditionally Conjugate Dirichlet Process Gaussian Mixture Model
Recall that the observation in (4) follows a mixed Gaussian distribution given by
(14) 
where replaces , as denotes the index of inferred hypotheses while is the index of the real hypothesis. Therefore, can be modeled as a DPGMM and expressed as
(15) 
In (15), represents a prior guess of the distributions of and in the DPGMM. Its choice is usually guided by mathematical convenience, and the conjugate form is widely adopted. In our case, the distribution of specifies the prior on the mixture Gaussian distributions parameters and , and it can be expressed in a conjugate form [37]
(16) 
where , , and are the hyperparameters for the DPGMM. It is clear that the prior distribution of is dependent on . This undesirable property is inevitable due to the conjugacy requirement for . To remove such dependency, we modify the original conjugate feature in the DPGMM and introduce the conditionally conjugate version of DPGMM to model . In a conditionally conjugate DPGMM, (16) can be rewritten as [37]
(17) 
where is the hyperparameter. To complete the conditionally conjugate DPGMM and capture the features inherent in X, we need to give suitable values for the hyperparameters. However, the exact values are hard to know a priori, and small changes on them will dramatically affect the model performance. To achieve the robustness for the model, we impose vague priors for the hyperparameters following [40],
(18) 
where the hyperpriors
and refer to the empirical mean and precision of , respectively. In theory, the prior should not depend on the observations. However, as shown in [40], the formulation for the priors in (18) is equivalent to normalizing observations, and a wide range of parameters in the priors lead to similar inference results.IiiC Inference Using Gibbs Sampling
Given the conditionally conjugate DPGMM and observation X, now we use the Markov chain MonteCarlo algorithm based on Gibbs sampling to infer . With the latent variable set , we can obtain the mixing proportion set . In the algorithm, we update the variables iteratively by sampling each variable from the posterior distribution conditioned on the others.
Given the likelihood of and in (14) and their priors in (17), we can multiply the priors by the likelihood conditioned on and obtain the conditional posterior distributions of and , which can be given by
(19) 
where and the indicator function when and otherwise. Similarly, with the likelihood of the hyperparameters , , and given in (17), and their priors given in (18), the conditional posteriors of the hyperparameters can be written as
(20) 
Note that the exact distribution of is not given, but its samples can be obtained following [37]. In detail, we capture the logconcave property of and generate samples independently from the distribution of . Then, we use the adaptive rejection sampling technique [43] to transform these samples and obtain the value of .
Before introducing the conditional posterior for , we note that its prior in (12) only suits the case where is a fixed finite parameter. To make the conditionally conjugate DPGMM applicable to the scenario with an infinite number of Gaussian components, we let in (12) and the conditional prior reaches the following limits
(21) 
Now we combine the priors of with its likelihood given in (12), and the conditional posterior can be given by (22).
(22) 
Unfortunately, the integral in the second case of (22) is not analytically tractable. Therefore, the auxiliary variable sampling algorithm [41] is employed. In detail, we add
auxiliary components in each sampling iteration to represent the effect of the auxiliary components. Note that the posterior probability that
belongs to the th component is proportional to , we use for the auxiliary components and rewrite (22) as(23) 
where is the index of unique components in each iteration during the Gibbs sampling algorithm, is the appropriate constant for normalization, and is the number of active components. To this end, we summarize the sampling algorithm in Algorithm 1.
Iv Proposed PredictionTransmission Spectrum Access Structure
In this section, we propose two predictiontransmission structures (periodic and nonperiodic) for spectrum access. We first introduce the functions of the prediction and transmission parts. Then, we present the details of how to determine the prediction intervals. As directly optimizing the NPLA performance is intractable, we propose to maximize an expected average utility by imposing reward (penalty) for power level match (mismatch). At last, we extend the predictiontransmission structure to an online scenario, which can handle the case where the number of the PT power levels changes after Stage I.
Iva Functions of the Prediction and Transmission Parts
In the prediction part, with the inferred GMM parameters , the ST can easily identify the current PT power level by a single sensing slot with test statistic , based on the following criterion
(24) 
In the transmission part, the ST allocates its transmit power level to match the PT power level , which means . Note that is an estimate of . In the simulation, we find that the conditionally conjugate DPGMM is able to identify () with a high probability^{5}^{5}5In the rare case of , the NPLA performance will degrade and another round of learning will be necessary.. When the ST matches the PT power level, it will adjust its transmit power of each level, which has been explained in Section IIB.
IvB Periodic Structure for the PredictionTransmission Structure
In periodic structure, the ST periodically predicts the PT power level in a sensing slot with duration , and then transmits for a transmission block with duration . This structure can be implemented in a straightforward way, and the prediction interval remains constant whether the SR decodes the signal successfully or not.
IvC Nonperiodic Structure for the PredictionTransmission Structure
For a nonperiodic structure, an essential question is how to determine the prediction intervals. Basically, this needs to find out the distribution of the PT power level duration , and the corresponding observation of each action. If the prediction action is taken, the observation will be the PT power level identified from the received test statistic according to (24). If the transmission action is taken, the observation will be a positive or negative acknowledgment (ACK) received by the ST from the SR, which indicates whether the SR correctly receives the signal from the ST. Based on these observations, the ST can infer the PT transmit power level, and then dynamically adjust the prediction intervals. As the inference may not be correct all the time, the dynamic adjustment of intervals can be formulated as a partial observable decision problem [44]. To address this problem, we first estimate the distribution parameter of based on . Then, we develop a reinforcement learning algorithm that correlates the observations with the current PT power level identification.
IvC1 PT Power Level Duration Distribution
Without loss of generality, the discretized PT power level duration
of all hypotheses is assumed to follow a Poisson distribution with the same mean value
^{6}^{6}6If the time duration is correlated between adjacent hypotheses, the proposed algorithm also works, but at the cost of significantly increased computational complexity.. Its cumulative distribution function can be given by
(25) 
where is the mean value of the Poisson distribution. In Stage I, we have attributed all signals to different components by learning the GMM parameter set , thus we can easily obtain the samples of the PT operation durations . Then, can be estimated through maximum likelihood estimation as
(26) 
where is the discretized duration of the th PT hypothesis, and is the number of PT hypotheses detected in Stage I.
If the PT has been keeping the same power level for time immediately after a power level change, the probability that the PT will continue staying in the same power level during the following discretized time duration can be expressed as
(27) 
IvC2 PT Power Transition Probability
In Stage I, the observations in the conditionally conjugate DPGMM are indefinite exchangeable. Therefore, the ST can not infer the PT transmit power level transition probability directly. Alternatively, we take a pragmatic approach and use the mixing proportion set , obtained by counting the number of observations in each power level, to represent the occupancy frequency of different power levels. Thus, we define an transition probability matrix for the PT transmit power levels. The element of refers to the probability that the PT transfers from the th to the th transmit power level, and is given by
(28) 
If the PT does not transmit with the same power level under two consecutive hypotheses then
. We also define the vector
as the th row of matrix .IvC3 Estimation Probability Matrix of the PT Power Level
If the PT transmits with binary power levels, the prediction performance of the ST is dictated by the detection and false alarm probabilities. However, this is no longer the case when the PT has multiple transmit power levels. Instead, we defined an prediction probability matrix , with the element , representing the probability that the PT is operating under hypothesis while the detection by the ST is in favor of hypothesis . represents that the ST identifies the PT operating in the th transmit power level following (24). Thus the element refers to the detection probability for hypothesis . We also define vector as the th column of matrix .
IvC4 Benefits of the ST from Prediction and Transmission Actions
We denote the prediction action of the ST at time as , and its observation as . We also denote the transmission action at time as , and its observation as . It is assumed that the positive/negative ACK is returned to the ST through a dedicated feedback channel. Let denote the conditional probability that the PT keeps operating with the th transmit power level at time given and , where . Based on Bayesian rule, the probabilities of PT staying in the th power level at time , can be given as follows.
When , we have
(29) 
When , we have
(30) 
where we assume the positive and negative ACKs from the SR can be received by the ST error free. Besides, denotes the probability that the SR decodes signals correctly when ST is on the th level and PT is one the th level. For simplicity, we set when and otherwise. In (29) and (30), and , where the superscripts and represent the prediction and transmission, respectively. The expected utility that the ST obtains at time with the PT operating with the th transmit power level, which is , can be given by
(31) 
Hereafter, denotes the ST transmit power level, is the PT transmit power level, and . In (31), is the reward that the ST will receive when the ST transmits with the th power level and . Meanwhile, is the penalty that the ST will receive when the ST transmits with the th power and .
IvC5 PredictionTransmission Structure Optimization
An ST access policy maps the ST belief space to the action space . Thus, the optimal predictiontransmission policy aims to maximize the expected average utility, which can be given by
(32) 
Note that if and , is not defined in Section IVC4. This is because the ST keeps transmitting between time and , thus no action is taken during this period. Therefore, we assign if and in the action space
. The total utility obtained by the ST during each PT hypothesis is i.i.d.. Thus, by the law of large numbers, the maximization problem in (
32) can be rewritten as(33) 
In other words, instead of maximizing the average expected utility, we translate the problem to the maximization of the utility in each PT hypothesis. For convenience, let denote the expected utility that can be achieved in each PT hypothesis following policy , which is
(34) 
Then, we can define the maximum utility that can be achieved by the ST in each PT hypothesis as
(35) 
In (35), is directly determined by the action taken at time , thus it can be expressed as
(36) 
where and are the expected utilities that can be obtained by the ST through prediction and transmission, respectively. We have
(37) 
and
(38) 
Lemma 1.
is a convex function of for given and .
The proof of the above lemma can be easily obtained following [45, page 5859], and is omitted here.
It is clear that is derived backward in time domain in (36), (37) and (38). Thus it will be helpful for the derivation of the optimal policy if an upper bound of can be established, which is given by
(39) 
In each PT hypothesis with the th transmit power level, the transmission action will not be taken by the ST after . This is because, , which means the ST will certainly receive negative reward if it transmits after , even if the PT is estimated as idle at time . Therefore, . When it comes to the range of , as follows a Poisson distribution in (25), we find that always holds according to [46].
Lemma 2.
The optimal utility function increases with for given and .
Proof.
The proof is given in Appendix VI. ∎
Lemma 3.
and are convex functions increasing with for given and .
Proof.
We have proved that and increase with in Lemma 2. Next, to prove their convexity, we derive their second order derivatives with regard to . Combining (29) and (37), the second order derivative of can be given by
(40) 
which is positive as is convex. The second derivative of can be proved positive similarly. Therefore, we complete the proof. ∎
We note that in (37) and
Comments
There are no comments yet.