I Introduction
Dynamic spectrum sharing (DSS) has emerged as an effective solution for a smooth transition from 4G to 5G by introducing 5G systems in existing 4G bands without hard/static refarming spectrum [1]. Using DSS, 4G LTE [2] and 5G NR [3] can operate in the same frequency band where a controller distributes the available spectrum resources dynamically over time between the two radio access technologies (RATs). For instance, in LTE-NR downlink (DL) sharing, LTE scheduler loans resources during a certain time to NR and NR avoids symbols used in LTE for cell specific signals. Moreover, DSS helps ease the transition from non-standalone 5G networks to standalone 5G. That said, it is important to investigate an effective scheme for the bandwidth (BW) split between LTE and NR to reap the benefits of DSS.
While some literature has recently studied the problem of spectrum sharing between LTE and WiFi (i.e., LTE-unlicensed) [4], NR and WiFi (i.e., NR-unlicensed) [5], aerial and ground networks [6], and radars and communication systems [7], the performance analysis of 4G/5G DSS remains relatively scarce [8]. For instance, an instant spectrum sharing technique at subframe time scale has been proposed [8]. The proposed scheme takes into account several information about the cell, such as the amount of data in the buffer, thus splitting the BW between 4G and 5G in every transmission time internal (TTI). Despite the promising results, this work considers a reactive spectrum sharing approach that does not account for the future network states and thus resulting in performance degradation. On the other hand, in a proactive approach, rather than reactively splitting the BW based on incoming demands and serving them when requested, the network takes into account future states for 4G/5G spectrum sharing thus improving the overall system level performance.
The main contribution of this paper is to introduce a novel model-based deep reinforcement learning (RL) based algorithm for DSS between LTE and NR. The main scope of the proposed scheme is planning in the time domain whereby the controller distributes the communication resources dynamically over time and frequency between LTE and NR at a subframe level while accounting for future network states over a specific time horizon. To enable an efficient planning, we propose a deep RL technique based on Monte Carlo Tree Search (MCTS) [9]. When a model of the environment is available, algorithms like AlphaZero [10] have been used with great success. However, in the case of DSS, the LTE and NR schedulers are part of the environment, and these are not easily modelled. Inspired by the MuZero work [11], we use a learned model of the environment for planning in the time domain. When applied iteratively, the proposed solution predicts the quantities most directly relevant to planning, i.e., the reward, the action probabilities, and the value for each state. This in turn enables the controller to predict a sequence of future states of the wireless network by simulating hypothetical communication resource assignments over time starting from the current network state and evaluating a reward function for each hypothetical communication resource assignment over the time window. As such, the communication resources in the current subframe are assigned based on the simulated hypothetical BW split action associated with maximized reward over the time window. To our best knowledge, this is the first work that exploits the framework of deep RL for DSS between 4G and 5G systems. Simulation results show that the proposed approach improves quality of service in terms of latency. Results also show that the proposed algorithm results in gain in different scenarios by accounting for several features while planning in the time domain, such as multimedia broadcast single frequency network (MBSFN) subframes and diverse user service requirements.
Ii System Model
Consider the downlink of a wireless cellular system composed of a co-located cell operating over NR and LTE serving a set of users. NR and LTE are assumed to operate in the 3.5 GHz frequency band and apply FDD as the duplexing method. We consider a 15 kHz NR numerology and that LTE and NR subframes are aligned in time and frequency. Each RAT, , serves a set of of UEs. The total system bandwidth, , is divided into a set of resource blocks (RBs). Each RAT, , is allocated a set of RBs, and each UE is allocated a set of RBs by its serving RAT .
For an efficient spectrum sharing model of LTE and RAT, one must design a mechanism for dividing the available bandwidth for data and control transmission for each of the RATs. For the control region, we consider the following:
-
LTE PDCCH is restricted to symbols #0 and #1 (if NR PDCCH is present).
-
NR has no signals/channels in symbols #0 and #1.
-
NR PDCCH is limited to symbol 2, assuming that the UE only supports type-A scheduling (no mini-slots).
-
In LTE subframes where no NR PDCCH is transmitted in the overlapped NR slots, LTE PDCCH could span 3 symbols.
For data transmission, a controller decides on the resource split, , between NR and LTE every subframe.
Ii-a Channel Model
We assume the 3GPP Urban Macro propagation model [12] with Rayleigh fading. The path loss between UE at location and its serving BS , , is given by Model1 [13], considering 3.5 GHz frequency band:
(1) |
where d is the distance between the UE and the BS in meters. The signal-to-noise ratio (SNR),
of the UE-BS link between UE at location served by RAT over RB will be:(2) |
where is the transmit power of BS/RAT to UE at location over RB and is the total transmit power of BS/RAT to UE location . Here, the total transmit power of RAT is assumed to be distributed uniformly among all of its associated RBs. is the channel gain between UE and BS/RAT on RB at location where is the Rayleigh fading complex channel coefficient. is the noise power spectral density and is the bandwidth of an RB . Therefore, the achievable data rate of UE at location associated with RAT can be defined as:
(3) |
Ii-B Traffic Model
We assume a periodic traffic arrival rate per UE with a fixed periodicity and a fixed packet size . Time domain scheduling is typically governed by a scheduling weight whereby a high weight corresponds to a high priority for scheduling that particular UE. We adopt a similar mechanism for measuring the quality of bandwidth splits between LTE and NR where a UE not fulfilling its QoS is associated with a high weight. The weight for user in subframe can be calculated as:
(4) |
where is the time the oldest packet has been waiting in the buffer, and correspond to the step delay and step weight of the delay weight function of user , respectively and is a small positive factor that makes the weight non-zero when there is data in the buffer. Note that a UE with zero weight will not be scheduled. Here, the step delay corresponds to the maximum tolerable delay in order to maintain QoS. If a packet remains in the buffer for a time period larger than , the weight for user increases by .
Given this system model, next, we develop an effective spectrum sharing scheme that can allocate the appropriate bandwidth to each RAT, at a subframe time scale, while accounting for future network states.
Iii Deep Reinforcement Learning for Dynamic Spectrum Sharing
In this section, we propose a proactive approach for DSS enabling LTE and NR to operate on the same BW simultaneously. In this regard, we propose a deep RL framework that enables the controller to learn the BW split between LTE and NR during subframe while accounting for future network states over a time window . To realize that, first, we propose the adopted RL algorithm for training the controller to learn the optimal policy for BW split. Then, we introduce the RL architecture and components for the DSS problem.
Iii-a Deep RL Algorithm
To enable a proactive BW split between LTE and NR, we adopt in this paper the MuZero algorithm [11]. One of the main challenges of the proposed solution technique is that it requires a model for the individual schedulers for LTE and NR, which is hard to devise. Instead, we propose in this paper to learn the scheduling dynamics via a model-based reinforcement learning algorithm that aims to address this issue by simultaneously learning a model of the environment’s dynamics and planning with respect to the learned model [11]. This approach is more data efficient compared to model-free methods where current state-of-the-art algorithms may require millions of samples before any near-optimal policy is learned.
During the training phase of the proposed algorithm, the prediction comprises performing a MCTS over the action space and over the time window
to find the sequence of actions that maximizes the reward function. MCTS iteratively explores the action space, gradually biasing the exploration towards regions of states and actions where an optimal policy might exist. To enable our model to learn the best explored sequence of actions for each network state, we define three neural networks - the representation function (
), dynamics function (), and prediction function (). The motivation for incorporating each of these neural networks in the proposed algorithm is described as follows:-
A representation function () encodes the observation in subframe into an initial hidden state ().
-
A dynamics function () computes a new hidden state () and reward () given the current state () and an action ().
-
A prediction function () outputs a policy () and a value () from a hidden state ().
During the training phase, the model predicts the quantities most directly relevant to planning, i.e., the reward, the action probabilities and the value for each state. The proposed training algorithm is summarized in Algorithm 1 and the main steps are given as follows:
-
Step 1: The model receives the observation of the network state as an input and transforms it into a hidden state ().
-
Step 2: The prediction function (), is then used to predict the value
and policy vector
for the current hidden state . -
Step 3: The hidden state is then updated iteratively to a next hidden state by a recurrent process consisting of steps, using the dynamics function (), with an input representing the previous hidden state and a hypothetical next action , i.e., a communications resource assignment selected from the action space comprising allowable bandwidth splits between LTE and NR.
-
Step 4: Having defined a policy target, reward and value, the representation function (), dynamics function (), and prediction function (
) are trained jointly, end-to-end by backpropagation-through-time (BPTT).
Meanwhile, the testing algorithm refers to the actual execution of the algorithm after which the weights of (), (), and () have been optimized and is implemented for execution during run time. Given that DSS is performed on a 1 ms basis, it is too demanding to run MCTS online. As such, we use the representation () and prediction () functions only during test time. The main steps performed by the controller at test time are summarized in Algorithm 2.
Iii-B Deep RL Components
In this subsection, we define the RL framework components, namely the observations, actions, and rewards.
-
Action space: BW split between LTE and NR for DL transmission for subframe , denoted as where is the size of the action space. Here, an action corresponds to a horizontal line splitting the BW on one side to LTE and the other side to NR. The possible BW splits are chosen by grouping a set of multiple RBs thereby resulting in a quantized action set. This would in turn reduce the action space size and is valid due to the fact that the gain between bandwidth splits from consecutive RBs is negligible.
-
Observation: the observation for subframe , denoted as , is divided into two parts, where the first part, (), consists of components with size whereas the second part, (), consists of components with size , where is the time window consisting of a set of future subframes. The different observations components are summarized as follows:
-
NR support: a vector with x1 elements that indicates if a user is NR user or not.
-
Buffer state: a vector with x1 elements containing the number of bits in the buffer of user .
-
MBSFN subframe: a matrix with x elements that indicates for each subframe , , if a UE is configured with MBSFN or not. By configuring LTE UEs with MBSFN subframes, some broadcast signalling can be avoided at the cost of decreased scheduling flexibility.
-
Predicted number of bits per PRB and TTI for each UE : a matrix with x
elements, where each element contains the estimated number of the average bits that can be transmitted for user
in subframe , taking into account the estimated channel quality of user during subframe , . -
Predicted packet arrivals: a matrix with x elements indicating the number of bits that will arrive in the buffer for each user over a set of future subframes .
-
-
Reward function: the reward function is modelled as a summation of the exponential of the most delayed packet per user and can be expressed as:
(5) where is the delay weight function of user in subframe , as described in (4). The intuition behind this reward function is that high total weight is penalized with a low reward in subframe . Meanwhile, if the controller manages to keep the user buffers empty, the reward per subframe will be one. If a highly prioritized UE is queued for several subframes, its weight will increase and thus the reward will approach zero.

Figure 1 summarizes the relationship between the network state, controller, and LTE and NR schedulers. At each subframe, the LTE scheduler, NR scheduler, and controller receive the network state information. This information is then used by the controller to generate observations and thus take an action for the BW split between LTE and NR. This action is then conveyed to the LTE and NR schedulers. Given the network state information and the corresponding BW split, each of the schedulers allocates their respective users to the corresponding BW portion for the current subframe. Finally, the weights for the users are fed to the controller and used as an input for the calculation of the reward. Next, we provide simulation results and analysis for the proposed RL framework.
Parameter | Value |
---|---|
Frequency | 3.5 GHz |
Bandwidth | 25 PRBs (5 MHz) |
Traffic Model | Periodic |
UE speed | 3 m/s |
Transmit power | 0.8W/PRB |
Noise power () | 112.5 dBm/PRB |
Antenna config | 1 Tx, 2 Rx |
Iv Simulation Results and Analysis
In this section, we provide simulation results and analysis for the performance of the proposed algorithm under four different scenarios where planning in the time domain for dynamic spectrum sharing is relevant. Tables I and II provide a summary of the main simulation parameters.

The structure of the representation, dynamics, and prediction neural networks is depicted in Figure 2
. All dense layers except for the output layer use 64 activations with ReLU activation. The representation outputs (
) use 10 activations with activation. The reward () and value () outputs are scalar with linear activation, and the policy () has the same number of activations as the number of actions with activation.Parameter | Value |
---|---|
Number of MCTS simulations () | 64 |
Episode length () | 16 subframes |
Discount factor () | 0.99 |
Window size (T) | 10 subframes |
Batch size | 32 examples |
Number of unroll steps () | 3 |
Number of TD steps () | 16 |
Optimizer | Adam |
Learning rate | |
Number of episodes per iteration () | 100 |
Representation size | 10 |
Next, we provide a detailed description for the simulation results and analysis of each of the four studied scenarios. Note that in all of the scenarios, the episode length is 16 and thus the evaluation score for a perfectly solved scenario is also 16. Moreover, we assume that LTE users (if any) are scheduled on the lower part of the spectrum band and NR users (if any) are scheduled on the high part of the band. As for the baseline, we split the available spectrum proportionally to the number of required RBs between LTE and NR users, as summarized in Algorithm 3. We also compare the performance of the proposed algorithm to equal BW split and alternating BW between LTE and NR. The user weight is calculated using Eq. 4, with and for all users. The step delay, , is set appropriately for the different users in the different scenarios as specified below.
Iv-a Scenario 1: MBSFN subframes
LTE requires CRSes to enable demodulation of data. Therefore, if only NR UEs are scheduled, the CRSes are not needed and are hence an overhead. If there is a lot of NR traffic to be scheduled, LTE can be configured with so called MBSFN subframes. In these subframes, no CRSes are transmitted and it is therefore not possible to schedule LTE users but this can result in improved efficiency for NR users. This scenario aims to investigate if the controller can learn to account for MBSFN subframes during planning thus enabling time critical LTE traffic to be served before MBSFN subframes.
Iv-A1 Scenario description
We consider two users, one NR user and one LTE user, both having a traffic arrival periodicity of 4 ms and a step delay =3ms. The packet size is 45000 bits and 15000 bits for the NR and LTE users, respectively. The system is configured with a repeating MBSFN pattern with a periodicity of 4 subframes, where the first two subframes are non-MBSFN (i.e., both LTE and NR UEs can be scheduled) and the last two subframes in the pattern are MBSFN subframes (i.e., only NR UEs can be scheduled).
Iv-A2 Optimal bandwidth split
To solve this scenario optimally, both packets must be served within 3 ms. As such, the LTE user should be served in the non-MBSFN subframes to make resources available for the NR user later in the cycle. Therefore, the optimal strategy is to start scheduling LTE such that its buffer is emptied before the MBSFN subframes.
Iv-A3 Results and analysis
From Figure 3, we can see that the proposed algorithm converges to the optimal strategy in 12 iterations. Also, note that the performance of the proposed scheme exceeds that of equal bandwidth split between LTE and NR and the case where MBSFN subframes are allocated to the NR user and non-MBSFN subframes are allocated to the LTE user. With this MBSFN configuration the amount of overhead due to e.g. LTE CRSes can be minimized which results in improved efficiency on network level. The controller can learn to account for the MBSFN subframes by scheduling in such a way that maximizes the quality of service despite the reduced scheduling flexibility due to the MBSFN subframes.

Iv-B Scenario 2: Periodic high interference
In this scenario, we investigate the controller’s ability to learn to account for future high interference on one of the users during planning. Periodic high interference can, for instance, occur in case a user is at the cell edge and is interfered by another base station or in case of unsynchronized time division duplexing scenarios.
Iv-B1 Scenario description
We consider two users, one NR user and one LTE user, both having a traffic arrival periodicity of 2 ms. We assume a larger packet size for NR user compared to that of LTE so that we can observe the gain of NR benefiting from the 2 extra symbols of LTE PDCCH if it is allocated the full bandwidth. Users have a small weight value when the delay is less than 2 ms but then it increases abruptly to 2 after 2 ms (i.e., ). Moreover, a periodic high interference is observed on LTE user every 3 subframes. Here, the periodic interference term is added artificially for analysis purposes.
Iv-B2 Optimal bandwidth split
The optimal strategy for this scenario is to allocate the full bandwidth to NR during subframes with high interference on the LTE user.
Iv-B3 Results and analysis
From Figure 4, we can see that the proposed algorithm converges to the optimal strategy in 18 and 28 iterations for the case of 2 and 3 action space, respectively. The proposed approach outperforms the baseline algorithm, equal bandwidth split, and alternating bandwidth split where the controller learns to allocate the full bandwidth to NR during subframes with high interference for the LTE user as opposed to taking actions based on buffer status only. This allows the controller to split the bandwidth between LTE and NR such that the impact of the interference level from neighboring cells is reduced thus resulting in an improved system level performance.
Iv-C Scenario 3: Mixed services
In this scenario, we investigate the controller’s ability to handle users with different delay requirements.
Iv-C1 Scenario description
We consider two users, one high priority NR user () with 90000 bits, and one low priority LTE user () with 90000 bits. Data arrives in subframe 1 for both users.
Iv-C2 Optimal bandwidth split
The optimal strategy for this scenario is to postpone the scheduling of the low priority LTE user in order to allow the high priority NR user to be scheduled. When the buffer of the high priority users is emptied, the controller can schedule the LTE user.
Iv-C3 Results and analysis
From Figure 5, we can see that the proposed approach converges to the optimal policy within 5 iterations. The controller learns to prioritize the NR user with a tight delay requirement over the LTE user thus outperforming the baseline algorithm as well as the equal BW split.

Iv-D Scenario 4: Time multiplexing
In this scenario, we investigate the controller’s ability to learn to do time multiplexing (as opposed to frequency multiplexing) between LTE and NR. Time multiplexing can result in two extra symbols for NR when no LTE is scheduled due to the fact that no LTE PDCCH needs to be transmitted. This in turn results in an increased efficiency when the RATs are scheduled in a time multiplexed fashion.
Iv-D1 Scenario description
We consider two users, one NR user and one LTE user, both having a traffic arrival periodicity of 2 ms. The packet size for the NR user is larger (14000 bits) compared to that of the LTE user (10000 bits). Users have a small weight when delay is less than 2 ms but then increases abruptly to 5 after 2 msec (i.e. ).
Iv-D2 Optimal bandwidth split
The optimal strategy for this scenario is to perform time multiplexing whereby the full bandwidth is allocated to a particular RAT every other subframe. As such, NR could benefit from the 2 extra symbols of LTE PDCCH when it is given the full bandwidth. This results in a larger transport block size and thus the large NR packet size can be served within one subframe.
Iv-D3 Results and analysis
From Figure 6, we can see that the proposed approach converges to the optimal action strategy within 14 and 15 iterations for the case of 3 and 4 actions, respectively. The proposed approach outperforms the baseline algorithm, equal BW split, and alternating BW split where the network learns to perform time multiplexing between LTE and NR resulting in an increased spectrum efficiency. For the studied scenario, when NR is scheduled alone, i.e. without overhead from LTE PDCCH, the maximum transport block size is 14112 bits. On the other hand, when LTE is scheduled with NR, there is an extra overhead for LTE PDCCH and thus the maximum transport block size decreases to 12576 bits. Consequently, the NR packet can be scheduled in one subframe given that NR is scheduled alone during that subframe.
V Conclusion
In this paper, we have proposed a novel AI planning framework for dynamic spectrum sharing of LTE and NR. Results have shown that the controller can split the bandwidth between LTE and NR in an intelligent way while accounting for future network states, such as MBSFN subframes and high interference level, thus resulting in an improved system level performance. This gain comes from the fact that the proposed algorithm uses knowledge (or beliefs) about future network states to make decisions that perform well on a longer timescale rather than being greedy in the current subframe. As part of future work, we aim to further investigate if the suggested algorithm can learn to account for uncertainties in the observations.
References
- [1] Ericsson, “Sharing for the best performance - stay ahead of the game with ericsson spectrum sharing,” Ericsson white paper, 2019.
- [2] E. Dahlman, S. Parkvall, and J. Skold, 4G: LTE/LTE-Advanced for Mobile Broadband, 1st ed. USA: Academic Press, Inc., 2011.
- [3] E. Dahlman, S. Parkvall, and J. Skold, 5G NR: The Next Generation Wireless Access Technology, 1st ed. USA: Academic Press, Inc., 2018.
-
[4]
U. Challita, L. Dong, and W. Saad, “Proactive resource management for LTE in unlicensed spectrum: A deep learning perspective,”
IEEE transactions on wireless communications, vol. 17, no. 7, pp. 4674–4689, July 2018. - [5] 3GPP TR 38.889, “Study on NR-based access to unlicensed spectrum.”
- [6] U. Challita, W. Saad, and C. Bettstetter, “Interference management for cellular-connected UAVs: A deep reinforcement learning approach,” IEEE Transactions on Wireless Communications, vol. 18, no. 4, pp. 2125–2140, March 2019.
- [7] A. Khawar, A. Abdelhadi, and C. Clancy, Spectrum Sharing Between Radars and Communication Systems. Springer, 2018.
- [8] S. Kinney, “Dynamic spectrum sharing vs. static spectrum sharing,” RCR wireless, March 2020.
- [9] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, “A survey of monte carlo tree search methods,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 4, no. 1, pp. 1–43, 2012.
- [10] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, “Mastering chess and shogi by self-play with a general reinforcement learning algorithm,” 2017.
- [11] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver, “Mastering atari, go, chess and shogi by planning with a learned model,” arXiv:1911.08265, Nov. 2019.
- [12] 3GPP, “Study on channel model for frequencies from 0.5 GHz to 100 GHz,” 3GPP TR 38.901, V15.0.0, June 2018.
- [13] 3GPP, “Evolved Universal Terrestrial Radio Access (E-UTRA); Further advancements for E-UTRA physical layer aspects,” 3rd Generation Partnership Project (3GPP), Technical Report (TR) 36.814, 03 2017, version 9.2.0.
Comments
There are no comments yet.