I-a Cognitive Radio
The emerging new wireless technologies, such as 5G cellular networks and machine-to-machine enabled industrial Internet of Things, are fueling an ever-increasing demand for access to the radio frequency spectrum. Cognitive radio (CR), an intelligent wireless technology able to recognize the surrounding radio environments , creates a potential communication paradigm to achieve more efficient and flexible spectrum usage. A secondary user (SU) with CR capability monitors the spectrum utilization of a primary user (PU) and determines its access to such spectrum accordingly. Two fundamental challenges arise in the process: how to explore possible spectrum opportunities (spectrum sensing) and how to exploit such opportunities efficiently (spectrum access).
Spectrum sensing measures and percepts the surrounding radio spectrum state based on various signal processing methods, including matched filter detection [2, 3], cyclostationary detection [4, 5], and energy detection [6, 7, 8, 9, 10]. Matched filter can achieve the optimum performance, but the SU requires perfect knowledge of the PU signaling features a priori. Cyclostationary detection takes advantage of the signal cyclostationary feature to distinguish signals from the stationary noise. In contrast, energy detection carries out the hypothesis test to determine the PU spectrum state, based on the energy of the collected PU signals. It features a low computational complexity, and is widely adopted in the literature. Upon obtaining the radio spectrum state, spectrum access dynamically adjusts the resources available to the SU, including frequency band, transmission time and transmission power, and accesses the licensed spectrum by taking into account the interference to the PU. The SU can access the licensed spectrum either when the PU is idle (opportunistic access) , or concurrently with the PU following a power control strategy to constrain the interference to the PU (spectrum sharing) [12, 13]. It is clear that appropriately designed spectrum sharing achieves higher throughput for the SU compared with the opportunistic spectrum access.
It is worth noting that many contemporary wireless standards, such as IEEE 802.11 , GSM , and LTE , have specified multiple transmit power levels to dynamically adapt to the fast changing environment and varying quality of service (QoS). The majority of CR studies in the literature did not take this into account, and the SU usually adopts a binary approach in reporting the radio spectrum state as idle or busy. In fact, for this multiple power level scenario, multi-parameter cognition is required, and the binary approach may not represent the most efficient spectrum utilization for the SU. The question then arises: how to exploit the variation in the PU power levels to design an intelligent spectrum sharing strategy? Naturally the answer to this question shall consist of spectrum sensing and spectrum access, which will be elaborated on in the next two subsections.
I-B Multi-level Spectrum Sensing
The priority in the design of an agile spectrum sensing method should aim to accurately map the sensing samples received by the secondary transmitter (ST) to the corresponding primary transmitter (PT) power levels. In a conventional binary case (PT is ON/OFF), there are two kinds of errors, and the goal of the spectrum sensing is to determine a detection threshold (the upper part of Fig. 1
). For example, given the target probability of false alarm and the noise power,can be simply determined by the Neyman-Pearson criterion. By contrast, the goal of the multi-level case is to jointly determine multiple thresholds to separate different power levels through a multiple hypothesis test (the middle part of Fig. 1), which is far more complicated than the binary one. In essence, there are kinds of errors, which are intertwined to exacerbate the complexity in threshold calculation. In  and , the authors proposed an energy detection based multiple hypothesis test to derive the decision thresholds for the multiple power level identification. The results were extended to the scenarios with noise uncertainty  and non-Gaussian transmission signals . However, the calculation of the thresholds in [18, 19, 20, 21]
requires a large amount of prior knowledge at the ST, including the noise power and the PT transmit power mode (i.e., the number and exact value of different transmit powers, and the prior probability of each hypothesis). In practice, these parameters are unlikely to be available to the STa priori.
In this paper, we aim to break the limits of the existing work, and achieve multi-level spectrum sensing with no or minimal prior information. We deviate from the above classical signal processing approaches, and machine learning arises as the tool of our choice for knowledge discovery to mine and extract the latent patterns reflecting the PT power level variation in the PT traffic data flow. On this basis, we propose a data-driven/machine learning based multi-level spectrum sensing scheme. It is fully blind in the sense that the ST does not require any prior knowledge of the noise power and the PT transmit power mode. Specifically, the proposed spectrum sensing scheme spans across two stages as shown in Fig. 3
. In Stage I (spectrum learning, a.k.a. the training phase in machine learning), the ST collects a multitude of signals that experience multiple PT transmit power level variation, and uses the Gaussian mixture models (GMMs) to capture the multi-level power characteristics inherent in the signals. Then, we introduce a Bayesian nonparametric method, referred to as conditionally conjugate Dirichlet process GMM (DPGMM), to automatically cluster the signals with the same PT transmit power level (the lower part of Fig.1) and infer the model parameters (GMM parameters and PT power level duration distribution parameters). With the model parameters inferred in Stage I, the prediction part in Stage II (see Fig. 3) can easily identify the current PT power level through collecting PT signal samples. In this case, Stage I together with the prediction part in Stage II achieve fully blind multi-level spectrum sensing. Note that the second part in Stage II (ST transmission) will be detailed next.
I-C Multi-level Spectrum Access
With the big picture of multi-level PT radio environments learned in Section I-B, we next aim to establish a multi-level spectrum access strategy, which is the ultimate goal of spectrum sensing. To the best of our knowledge, this represents the first effort in such direction. For the SU, there is a fundamental tradeoff between two conflicting goals, namely, maximization of its own throughput and minimization of its interference to the PU. As typically there is no cooperation between the PU and SU, it is extremely difficult to optimize this tradeoff in practice. To provide a pragmatic solution to this dilemma, we first propose a new metric, referred to as the normalized power level alignment (NPLA), and it is defined as the time proportion that the ST matches its transmit power level to that of the PT.
To optimize the NPLA performance, we propose two prediction-transmission structures (periodic and non-periodic) in Stage II for spectrum access which enables the ST to closely follow the PT power level variation. As discussed before, the prediction part can identify the current PT power level. On this basis, the transmission part adjusts the ST power according to the required signal-to-interference-plus-noise ratio (SINR) at the primary receiver (PR). Specifically, the periodic structure features a fixed prediction interval, and is straightforward in implementation. By contrast, the non-periodic structure dynamically determines the interval, which can be formulated as a partially observable decision problem. This motivates us to develop a new algorithm based on reinforcement learning , exploiting the PT power level duration distribution. This structure further improves the NPLA performance. Finally, we extend the prediction-transmission structure to an online scenario, where the number of PT power levels might change as a consequence of PT adapting to the environment fluctuation or QoS variation.
In a nutshell, we propose a learning-based two-stage spectrum sharing strategy for a CR network, enabling fully blind spectrum sensing when the PT power varies with time in multiple levels, and designs an adaptive spectrum access strategy for the NPLA optimization. The main contributions of this paper can be summarized as follows.
We propose a novel data-driven/machine learning based multi-level blind spectrum sensing. The conditionally conjugate DPGMM with Bayesian inference is introduced to automatically cluster the signals and infer the model parameters, which is key to predict the PT power levels.
We propose a new metric, NPLA, to strike an excellent tradeoff between the secondary network throughput and the interference to the primary network.
To optimize the NPLA performance, we propose a prediction-transmission structure for spectrum access which enables the ST to closely follow the PT power level variation. Furthermore, the ST prediction interval is dynamically adjusted to achieve better performance.
The spectrum access method is extended to the online scenario to accommodate a more realistic situation, where the number of PT power levels might change after the inital spectrum learning.
I-E Related Work
Machine learning technology has recently played an important role in improving spectrum sensing. The work in 
presented an adversarial machine learning approach to launch jamming attacks on CR and introduces a defense strategy. Several supervised and unsupervised machine learning algorithms for cooperative spectrum sensing (CSS) were investigated in. In , the combination of infinite GMM and CSS was proposed to detect the primary user emulation attacks. In 
, a convolutional neural network-based CSS scheme was developed to detect multiple bands simultaneously. A mobile CSS framework was proposed in for large-scale heterogeneous cognitive networks.
There are many efforts in sensing policy design for real-time decisions on which channel(s) to sense (dynamic multichannel selection). By contrast, our paper considers the single user and single channel case, and focuses on the design of multi-level spectrum sensing to differentiate different PT power levels. On this basis, we also consider the policy design to dynamically adjust the sensing intervals to improve the NPLA performance.
The dynamic multichannel selection can be modelled as a partially observable Markov decision process (POMDP). The partial observation in [29, 30, 31, 32, 33, 34, 35] originates from each SU being unable to scan all the channels at any one time due to energy and hardware constraints. Therefore, a sensing policy needs to be developed to balance between utilizing a spectrum opportunity for immediate access and collecting spectrum occupancy statistics to track spectrum opportunity for future exploitation. As finding an optimal channel sensing policy in general is computationally prohibitive with the increased number of channels, several efforts endeavor to find the optimal/near-optimal policy with low computational cost. In [29, 30, 31]
, the dynamic multichannel access problem is modelled as a restless multi-armed bandit problem. The time horizon is divided by interleaving exploration and exploitation epochs with growing lengths, and the optimal policy can be translated into determining the length and allocation of each epoch. Recently, deep reinforcement learning (DRL) based channel selection[32, 35, 33, 34] has attracted great attention, and it aims to handle the correlated channels with unknown channel dynamics. The essence of DRL is to provide a good approximation of the objective value (Q-value), facilitating the handling of the large state and action spaces.
It is worth noting that the access policy design (sensing interval) in our paper is also formulated as a POMDP, but the nature of our formulation is fundamentally different from that in [29, 30, 31, 32, 35, 33, 34]. The partial observation in our work comes from imperfect multi-level sensing results and access feedback. To tackle this challenging POMDP, we reduce the infinite time horizon to a finite one, leading to a computationally tractable solution. Most importantly, we mathematically prove that such practice does not sacrifice the optimality in the utility.
I-F Organization and Notation
The rest of the paper is organized as follows. In Section II, we discuss the system model for our proposed two-stage spectrum sharing strategy. In Section III, we introduce a Bayesian nonparametric method and its inference for the model parameters. The prediction-transmission structure with an online extension, which are adaptive to the PT power level variation, is presented in Section IV. Simulation results and discussion are presented in Section V followed by conclusions in Section VI.
denotes the Gaussian distribution with meanand precision , denotes the complex Gaussian distribution with mean and precision , and
denotes the Gamma distribution with shape parameterand scale parameter . denotes the Gamma function and denotes the incomplete Gamma function. is the floor function. For convenience, we list most important symbols in Table I.
|,||The index and total number of the actual PT transmit power levels.|
The index and total number of the PT transmit power level estimated by the ST.
|,||The index of the PT hypothesis and the total number of the PT hypotheses in Stage I.|
|,||The index of the ST action and the total number of sensing slots in Stage I.|
|The PT transmission power value on the -th level.|
|The hypothesis that the PT transmits with .|
|The hypothesis determined by the ST that the PT transmits with .|
The test statistic in the-th single sensing slot.
|The total number of samples collected by the ST in a single sensing slot.|
|The duration of a ST sensing slot, a ST transmission block, and a PT hypothesis.|
|,||The discretized time of a ST transmission block and a PT hypothesis.|
The concentration parameter and the base probability distribution of the Dirichlet process.
|The latent variable indicating which component that is associated with.|
|The total number of observations assigned to the -th component.|
|The value, precision, and mixing proportion of the -th component in the GMM.|
|, , ,||
The hyperparameters in the conditionally conjugate DPGMM.
|The probability that the PT is operating under while the detection by the ST is in favor of .|
|The probability that the PT transfers from the -th transmit power level to the -th level.|
|The probability that the PT keeps operating with the -th transmit power level at time .|
|The action that the ST will take at time .|
|The longest time that the SU should transmit when operating in the -th transmit power level.|
|The probability of correct PT power level prediction in Stage II.|
|The NPLA performance from time 0 to .|
Ii System Model
We consider a spectrum sharing CR network in Fig. 2, with a primary network consisting of a PT and a PR, a secondary network consisting of a ST and a secondary receiver (SR), and a central site (broadcast tower). Transmission happens simultaneously in both networks sharing the same frequency band. Different from most CR networks considered in the literature, the PT operates with multiple (instead of binary: ON/OFF) power levels. The ST attempts to learn the model parameters, which will be defined later, and then optimize the spectrum access accordingly. The ultimate goal is to optimize the NPLA performance.
To achieve this target, we propose a novel two-stage spectrum sharing strategy, as illustrated in Fig. 3. Let , be the transmit powers of the PT, where and indicates an idle PT. For convenience, hypothesis indicates that the PT transmits with power . It is assumed that
undergoes a slow change, as shown in the figure. We define the time duration of each hypothesis as a random variable, which is usually much larger than that of the ST sensing slot and the ST transmission block . In this paper, we consider a time discretized model, where is the minimum time unit. We define as the fixed discretized time duration of the ST transmission block, and as the varied discretized PT power level duration. As discussed before, the prior knowledge on the PT transmit power mode, defined as the number of transmit power levels , the exact values , and the prior probability of each hypothesis , is unknown to the ST, which is fundamentally different from the assumptions in [18, 19, 20, 21]. In the sequel, we describe in detail the operations of these two stages in Fig. 3.
Ii-a Stage I
In this stage, the ST samples the received PT signals at a sampling frequency and collects samples in each sensing slot with duration (for notation simplicity, we assume that is an integer). The ST observes sensing slots in Stage I and collects a total of samples. It is assumed that the learning period is reasonably large so that it covers all hypotheses111There is a none-zero probability that some transmit power levels do not happen and are not observed by the ST during Stage I, even if the learning period is relatively large. In this case, these missed hypotheses can be viewed as small probability events. Consequently, they will have negligible impact on the performance of the proposed spectrum access method, and can be ignored.. Thus, the -th sample in the -th sensing slot under hypothesis can be given by
where is the received primary signal in the -th sensing slot with average power , and is the additive white Gaussian noise. Following , we assume that is an independent and identically distributed (i.i.d.) random variable with mean
and variance. Following  and without loss of generality, we assume that is complex PSK modulated signal222For other modulation schemes, the test statistic still follows a Gaussian distribution . Therefore, our proposed method is still valid for other modulation and/or adaptive modulation schemes..
The test statistic in the -th sensing slot can be calculated as
is large, according to the central limit theorem, the distribution ofunder hypothesis can be approximated by a Gaussian one, and we have
where is the received signal-to-noise ratio (SNR) at the ST. Considering all the hypotheses, we establish that follows a mixed Gaussian distribution 
where is the mixing coefficients with . Each Gaussian density is a component of the mixture with mean value and precision .
Given the observation set , the proposed Bayesian nonparametric method aims to infer the GMM parameter set , where with and . In other words, our method automatically clusters the signals with the same PT transmit power levels. In summary, Stage I establishes a big picture of the PT activities at the cost of an one-off overhead. After learning, the ST allocates the same number of power levels as that in the PT, with an initial ST power value for each level .
Ii-B Stage II
In this stage, we propose two prediction-transmission structures (periodic and non-periodic) adapting to the PT power level variation for spectrum access. The main features of the structures can be summarized as follows.
As shown in Fig. 3, Stage II consists of two parts: prediction and transmission actions. In the prediction action (the sensing slots with purple color), the ST can easily identify the current PT power level , which is jointly determined by the test statistic , in (2) and the inferred GMM parameter set . In the transmission action, the ST allocates its transmit power level to match the latest identified PT power level (). Here, the corresponding can be determined as follows. Assume that the required SINR for the PR is and the current received SINR is . A nearby monitoring station (see Fig. 2) of the PR transmits to the central site, likely through optical fiber or microwave, which then broadcasts this information on a dedicated frequency. We assume that the ST is able to decode broadcast signals from the central site and communicate with the SR on different frequency bands. Through the broadcast nature, the ST obtains . If , the ST should reduce the transmit power and vice versa. In other words, for each level will gradually approach a desired power value. This guarantees that the PR is well protected, while the secondary network obtains the highest possible throughput333The similar idea of using broadcasting mechanism was also suggested by Federal Communications Commission [38, p. 6] and adopted in 4G LTE systems in the form of inter-cell interference overload indicator .. On this basis, it is clear that the alignment between ST and PT power levels can optimize the trade-off between the interference to primary network and throughput of the secondary network444Note that the channel state information for PT-ST and ST-PR is not required in our approach. In addition, the use of the received SINR has already included the impact of the channel fading.. In the case that the ST mismatches the PT power level variation, either the PR is interfered below the required SINR or unnecessarily lower secondary network throughput is obtained.
For the periodic structure, the prediction intervals are fixed. By contrast, in the non-periodic structure, the intervals are dynamically determined to enhance the NPLA performance. As shown in Fig. 3, zero intervals are used to track the PT power level variation, while long intervals are selected to avoid unnecessary prediction. The non-periodic structure will be elaborated on in Section IV-C.
Iii Spectrum Learning Based on Bayesian Inference
In this section, we focus on Stage I, and introduce a Bayesian nonparametric method to infer the GMM parameter set based on the observation set . As is unknown a priori40]
, which takes into account the Gaussian mixture property and is able to identify the unknown number of Gaussian components. For specific Bayesian inference, we choose the Markov chain Monte-Carlo based Gibbs sampling method considering its simplicity.
In the following, we first review the preliminary knowledge on the Dirichlet process mixture model. On this basis, we introduce the DPGMM considering the specific distribution of the observation set . Furthermore, we modify the DPGMM to the conditionally conjugate case to simplify the inference process. Finally, we carry out Bayesian inference with Gibbs sampling method to infer .
Iii-a Dirichlet Process Mixture Model
The Dirichlet distribution is an extension of the Beta distribution for multivariate cases. It represents the probability ofevents given that the -th event () has been observed
times. The probability density function can be expressed as
where is the probability of the -th event with and .
In our application, the event represents the -th possible PT transmit power level, which can not be observed explicitly. Instead, the explicit observation is the test statistic . As is drawn from a distribution based on event , we introduce the Dirichlet process (DP) to define the distribution of . A random measure is said to be a Dirichlet process distributed with a base probability distribution and a concentration parameter , if we have
for every finite measurable partition of . It is written as .
Next we model the observation set using the parameter based on the DP mixture model. A DP mixture model is suitable for the clustering purposes, where the number of Gaussian components is not known a priori. Here, can be regarded as an independent draw from the distribution , where each is an i.i.d. draw from a DP . Mathematically, the DP mixture model can be expressed as 
Note that the formulation (7) represents the most general case. In our case, two different observations and () may follow the same distribution, and (7) can not explicitly reveal such property. Therefore, following , a latent variable is introduced to explicitly indicate which transmit power level that is associated with, and will be referred to as an indicator hereafter. Accordingly, an equivalent model can be obtained as
where is the set of indicators, is the set of unique values in , and . Hereafter, refers to the total number of Gaussian components, and each component consists of the observations that are determined by the ST as having the same transmit power level. Note that is assumed to have a symmetric Dirichlet distribution, where all the concentration parameters are . This assumption is widely adopted when there is no prior knowledge of the mixing proportions . Let denote the number of observations assigned to the -th component, then follows a multinomial distribution
where , and the distribution of the indicators is
As all the observations are exchangeable, if we assume that has been obtained, the conditional distribution for the individual indicator can be given by
where is the number of samples excluding in the -th component. Similarly, the prior distribution of over can be written as
Iii-B Conditionally Conjugate Dirichlet Process Gaussian Mixture Model
Recall that the observation in (4) follows a mixed Gaussian distribution given by
where replaces , as denotes the index of inferred hypotheses while is the index of the real hypothesis. Therefore, can be modeled as a DPGMM and expressed as
In (15), represents a prior guess of the distributions of and in the DPGMM. Its choice is usually guided by mathematical convenience, and the conjugate form is widely adopted. In our case, the distribution of specifies the prior on the mixture Gaussian distributions parameters and , and it can be expressed in a conjugate form 
where , , and are the hyperparameters for the DPGMM. It is clear that the prior distribution of is dependent on . This undesirable property is inevitable due to the conjugacy requirement for . To remove such dependency, we modify the original conjugate feature in the DPGMM and introduce the conditionally conjugate version of DPGMM to model . In a conditionally conjugate DPGMM, (16) can be rewritten as 
where is the hyperparameter. To complete the conditionally conjugate DPGMM and capture the features inherent in X, we need to give suitable values for the hyperparameters. However, the exact values are hard to know a priori, and small changes on them will dramatically affect the model performance. To achieve the robustness for the model, we impose vague priors for the hyperparameters following ,
where the hyperpriorsand refer to the empirical mean and precision of , respectively. In theory, the prior should not depend on the observations. However, as shown in , the formulation for the priors in (18) is equivalent to normalizing observations, and a wide range of parameters in the priors lead to similar inference results.
Iii-C Inference Using Gibbs Sampling
Given the conditionally conjugate DPGMM and observation X, now we use the Markov chain Monte-Carlo algorithm based on Gibbs sampling to infer . With the latent variable set , we can obtain the mixing proportion set . In the algorithm, we update the variables iteratively by sampling each variable from the posterior distribution conditioned on the others.
Given the likelihood of and in (14) and their priors in (17), we can multiply the priors by the likelihood conditioned on and obtain the conditional posterior distributions of and , which can be given by
where and the indicator function when and otherwise. Similarly, with the likelihood of the hyperparameters , , and given in (17), and their priors given in (18), the conditional posteriors of the hyperparameters can be written as
Note that the exact distribution of is not given, but its samples can be obtained following . In detail, we capture the log-concave property of and generate samples independently from the distribution of . Then, we use the adaptive rejection sampling technique  to transform these samples and obtain the value of .
Before introducing the conditional posterior for , we note that its prior in (12) only suits the case where is a fixed finite parameter. To make the conditionally conjugate DPGMM applicable to the scenario with an infinite number of Gaussian components, we let in (12) and the conditional prior reaches the following limits
auxiliary components in each sampling iteration to represent the effect of the auxiliary components. Note that the posterior probability thatbelongs to the -th component is proportional to , we use for the auxiliary components and rewrite (22) as
where is the index of unique components in each iteration during the Gibbs sampling algorithm, is the appropriate constant for normalization, and is the number of active components. To this end, we summarize the sampling algorithm in Algorithm 1.
Iv Proposed Prediction-Transmission Spectrum Access Structure
In this section, we propose two prediction-transmission structures (periodic and non-periodic) for spectrum access. We first introduce the functions of the prediction and transmission parts. Then, we present the details of how to determine the prediction intervals. As directly optimizing the NPLA performance is intractable, we propose to maximize an expected average utility by imposing reward (penalty) for power level match (mismatch). At last, we extend the prediction-transmission structure to an online scenario, which can handle the case where the number of the PT power levels changes after Stage I.
Iv-a Functions of the Prediction and Transmission Parts
In the prediction part, with the inferred GMM parameters , the ST can easily identify the current PT power level by a single sensing slot with test statistic , based on the following criterion
In the transmission part, the ST allocates its transmit power level to match the PT power level , which means . Note that is an estimate of . In the simulation, we find that the conditionally conjugate DPGMM is able to identify () with a high probability555In the rare case of , the NPLA performance will degrade and another round of learning will be necessary.. When the ST matches the PT power level, it will adjust its transmit power of each level, which has been explained in Section II-B.
Iv-B Periodic Structure for the Prediction-Transmission Structure
In periodic structure, the ST periodically predicts the PT power level in a sensing slot with duration , and then transmits for a transmission block with duration . This structure can be implemented in a straightforward way, and the prediction interval remains constant whether the SR decodes the signal successfully or not.
Iv-C Non-periodic Structure for the Prediction-Transmission Structure
For a non-periodic structure, an essential question is how to determine the prediction intervals. Basically, this needs to find out the distribution of the PT power level duration , and the corresponding observation of each action. If the prediction action is taken, the observation will be the PT power level identified from the received test statistic according to (24). If the transmission action is taken, the observation will be a positive or negative acknowledgment (ACK) received by the ST from the SR, which indicates whether the SR correctly receives the signal from the ST. Based on these observations, the ST can infer the PT transmit power level, and then dynamically adjust the prediction intervals. As the inference may not be correct all the time, the dynamic adjustment of intervals can be formulated as a partial observable decision problem . To address this problem, we first estimate the distribution parameter of based on . Then, we develop a reinforcement learning algorithm that correlates the observations with the current PT power level identification.
Iv-C1 PT Power Level Duration Distribution
Without loss of generality, the discretized PT power level duration
of all hypotheses is assumed to follow a Poisson distribution with the same mean value666If the time duration is correlated between adjacent hypotheses, the proposed algorithm also works, but at the cost of significantly increased computational complexity.
. Its cumulative distribution function can be given by
where is the mean value of the Poisson distribution. In Stage I, we have attributed all signals to different components by learning the GMM parameter set , thus we can easily obtain the samples of the PT operation durations . Then, can be estimated through maximum likelihood estimation as
where is the discretized duration of the -th PT hypothesis, and is the number of PT hypotheses detected in Stage I.
If the PT has been keeping the same power level for time immediately after a power level change, the probability that the PT will continue staying in the same power level during the following discretized time duration can be expressed as
Iv-C2 PT Power Transition Probability
In Stage I, the observations in the conditionally conjugate DPGMM are indefinite exchangeable. Therefore, the ST can not infer the PT transmit power level transition probability directly. Alternatively, we take a pragmatic approach and use the mixing proportion set , obtained by counting the number of observations in each power level, to represent the occupancy frequency of different power levels. Thus, we define an transition probability matrix for the PT transmit power levels. The element of refers to the probability that the PT transfers from the -th to the -th transmit power level, and is given by
If the PT does not transmit with the same power level under two consecutive hypotheses then
. We also define the vectoras the -th row of matrix .
Iv-C3 Estimation Probability Matrix of the PT Power Level
If the PT transmits with binary power levels, the prediction performance of the ST is dictated by the detection and false alarm probabilities. However, this is no longer the case when the PT has multiple transmit power levels. Instead, we defined an prediction probability matrix , with the element , representing the probability that the PT is operating under hypothesis while the detection by the ST is in favor of hypothesis . represents that the ST identifies the PT operating in the -th transmit power level following (24). Thus the element refers to the detection probability for hypothesis . We also define vector as the -th column of matrix .
Iv-C4 Benefits of the ST from Prediction and Transmission Actions
We denote the prediction action of the ST at time as , and its observation as . We also denote the transmission action at time as , and its observation as . It is assumed that the positive/negative ACK is returned to the ST through a dedicated feedback channel. Let denote the conditional probability that the PT keeps operating with the -th transmit power level at time given and , where . Based on Bayesian rule, the probabilities of PT staying in the -th power level at time , can be given as follows.
When , we have
When , we have
where we assume the positive and negative ACKs from the SR can be received by the ST error free. Besides, denotes the probability that the SR decodes signals correctly when ST is on the -th level and PT is one the -th level. For simplicity, we set when and otherwise. In (29) and (30), and , where the superscripts and represent the prediction and transmission, respectively. The expected utility that the ST obtains at time with the PT operating with the -th transmit power level, which is , can be given by
Hereafter, denotes the ST transmit power level, is the PT transmit power level, and . In (31), is the reward that the ST will receive when the ST transmits with the -th power level and . Meanwhile, is the penalty that the ST will receive when the ST transmits with the -th power and .
Iv-C5 Prediction-Transmission Structure Optimization
An ST access policy maps the ST belief space to the action space . Thus, the optimal prediction-transmission policy aims to maximize the expected average utility, which can be given by
Note that if and , is not defined in Section IV-C4. This is because the ST keeps transmitting between time and , thus no action is taken during this period. Therefore, we assign if and in the action space
. The total utility obtained by the ST during each PT hypothesis is i.i.d.. Thus, by the law of large numbers, the maximization problem in (32) can be rewritten as
In other words, instead of maximizing the average expected utility, we translate the problem to the maximization of the utility in each PT hypothesis. For convenience, let denote the expected utility that can be achieved in each PT hypothesis following policy , which is
Then, we can define the maximum utility that can be achieved by the ST in each PT hypothesis as
In (35), is directly determined by the action taken at time , thus it can be expressed as
where and are the expected utilities that can be obtained by the ST through prediction and transmission, respectively. We have
is a convex function of for given and .
The proof of the above lemma can be easily obtained following [45, page 58-59], and is omitted here.
In each PT hypothesis with the -th transmit power level, the transmission action will not be taken by the ST after . This is because, , which means the ST will certainly receive negative reward if it transmits after , even if the PT is estimated as idle at time . Therefore, . When it comes to the range of , as follows a Poisson distribution in (25), we find that always holds according to .
The optimal utility function increases with for given and .
The proof is given in Appendix VI. ∎
and are convex functions increasing with for given and .
We have proved that and increase with in Lemma 2. Next, to prove their convexity, we derive their second order derivatives with regard to . Combining (29) and (37), the second order derivative of can be given by
which is positive as is convex. The second derivative of can be proved positive similarly. Therefore, we complete the proof. ∎
We note that in (37) and