I Introduction
To enable gigabit wireless access with reliable communication, a number of candidate solutions are currently investigated for G: ) higher frequency spectrum, e.g., millimeter wave (mmWave); ) advanced spectralefficient techniques, e.g., massive multipleinput multipleoutput (MIMO); and ) ultradense small cells [1]. This work explores the above techniques to enhance the wireless access [1, 2, 3]. Massive MIMO yields remarkable properties such as high signaltointerferenceplusnoise ratio due to large antenna gains, and extreme spatial multiplexing gain [3, 4]. Specially, mmWave frequency bands offer huge bandwidth [5], while it allows for packing a massive antennas for highly directional beamforming [5]. A unique peculiarity of mmWave is that mmWave links are very sensitive to blockage, which gives rise to unstable connectivity and unreliable communication [5]. To overcome such challenge, we leverage principles of risksensitive reinforcement learning (RSL) and exploit the multiple antennas diversity and higher bandwidth to optimize transmission to achieve gigabit data rates, while considering the sensitivity of mmWave links to provide ultrareliable communication (URC). The prime motivation behind using RSL stems from the fact that the risksensitive
utility function to be optimized is a function of not only the average but also the variance
[6], and thus it captures the tail of rate distribution to enable URC. While our proposed algorithm is fully distributed, which does not require full network observation, and thus the cost of channel estimation and signaling synchronization is reduced. Via numerical experiments, we showcase the inherently key tradeoffs between (
) reliability/data rates and network density, and () availability and network density.Related work: In [7, 8] authors provided the principles of ultrareliable and low latency communication (URLLC) and described some techniques to support URLLC. Recently, the problem of low latency communication [9] and URLLC [10, 11] for G mmWave network was studied to evaluate the performance under the impact of traffic dispersion and network densification. Moreover, a reinforcement learning (RL) approach to power control and rate adaptation was studied in [12]
. All these works focus on maximizing the time average of network throughput or minimizing the mean delay without providing any guarantees for higher order moments (e.g., variance, skewness, kurtosis, etc.). In this work, we depart from the classical averagebased system design and instead take into account higher order moments in the utility function to formulate a RSL framework through which every small cell optimizes its transmission while taking into account signal fluctuations.
Ii System Model
Let us consider a mmWave downlink (DL) transmission of a small cell network consisting of a set of small cells (SCs), and a set of user equipments (UEs) equipped with antennas. We assume that each SC is equipped with a large number of antennas to exploit massive MIMO gain and adopt a hybrid beamforming architecture [13], and assume that . Without loss of generality, one UE per one SC is considered^{1}^{1}1For the multiple UEs case, addition channel estimation and user scheduling need to be considered, one example was studied in [3].. The data traffic is generated from SC to UE via mmWave communication. A cochannel timedivision duplexing protocol is considered, in which the DL channel can be obtained via the uplink training phase.
Each SC adopts the hybrid beamforming architecture, which enjoys both analog and digital beamforming techniques [13]. Let and denote the analog transmitter and receiver beamforming gains at the SC and UE , respectively. In addition, we use and to represent the angles deviating from the strongest path between the SC and UE . Also, let and denote the beamwidth at the SC and UE, respectively. We denote
as a vector of the transmitter beamwidth of all SCs. We adopt the widely used antenna radiation pattern model
[13] to determine the analog beamforming gain as(1) 
where is the side lobe gain.
Let denote the channel state from the SC to UE
. We assume a timevarying channel state described by a Markov chain and there are
states, i.e., for each . Considering imperfect channel state information (CSI), the estimated channel state between the SC and UE is modeled as [10]where is the spatial channel correlation matrix that accounts for path loss and shadow fading. Here,
is the smallscale fading channel matrix, modeled as a random matrix with zero mean and variance of
. Here reflects the estimation accuracy for UE , if , we assume that perfect channel state information. is the estimated noise vector, also modeled as a random matrix with zero mean and variance of . We denote as the network state.By applying a linear precoding scheme [13], i.e, for the conjugate precoding, the achievable rate^{2}^{2}2Note that we omit the beam search/track time, since it can be done in a short time as compared with transmission time [14]. We assume that each BS sends a single stream to its users via the main beams. of UE from SC can be calculated as
where and are the transmit powers of of SC and SC , respectively. In addition, W denotes the system bandwidth of the mmWave frequency band. The thermal noise of user served by SC is . Here, we denote as the maximum transmit power of SC and as the transmit power vector.
Iii Problem Formulation
We model a decentralized optimization problem and harness tools from RSL to solve, whereby SCs autonomously respond to the network states based on the historical data. Let us consider a joint optimization of transmitter beamwidth^{3}^{3}3As studied in [13], for , the problem of selecting beamwidth for the transmitter and receiver can be done by adjusting the transmitter beamwidth with a fixed receiver beamwidth. and transmit power allocation . We denote , which takes values in , where . Assume that each SC
selects its beamwidth and transmit power drawn from a given probability distribution
in which is the cardinality of the set of all combinations , i.e., . For each and the mixedstrategy probability is defined as(2) 
We denote , in which is the set of all possible probability mass functions (PMF). Let denote the instantaneous rates, in which . Let denote the rate region, which is defined as the convex hull of the rates [15], i.e., . Inspired by the RSL [6], we consider the following utility function, given by
(3) 
where the parameter denotes the desired risksensitivity, which will penalize the variability [6] and the operator denotes the expectation operation.
Remark 1
The Taylor expansion of the utility function given in (3) yields
Remark 1 basically shows that the utility function (3) considers both mean and variance terms (Var) of the mmWave links. We formulate the following distributed optimization problem for every SC as:
(4a)  
subject to  (4b) 
Iv Proposed Algorithm
In Fig. 1 each SC acts as an agent which selects an action to maximize a longterm reward based on user feedback and probability distribution for each action. The action is defined as the selection of , while the longterm utility in (4) is the reward, and the environment here contains the network state. To this end, we build the probability distribution for every action and provide a RL procedure to solve (4).
We denote as a utility function of SC when selecting . Here, denotes the composite variable of other agents’ actions excluding SC . From (3), the utility of SC at time slot , i.e., , is rewritten as
(5) 
where is the instantaneous rate of SC when choosing with probability .
Remark 2
For a small (3) is approximated via the Taylor approximation^{4}^{4}4For a small , the Taylor approximation of is . of around as
(6)  
(7) 
where (7) is obtained by expanding the time average of (6). Each SC determines from based on the probability distribution from the previous stage , i.e.,
(8) 
We introduce the BoltzmannGibbs distribution to capture the exploitation and exploration, , given by
(9)  
where is the utility vector of SC for , and the tradeoff factor is used to balance between exploration and exploitation. If is small, the SC selects with highest payoff. For all decisions have equal chance.
For a given and , we solve (9
) to find the probability distribution, by adopting the notion of logit equilibrium
[16], we have(10) 
where . Finally, we propose two coupled RL processes that run in parallel and allow SCs to decide their optimal strategies at each time instant as follows [16].
RiskSensitive Learning procedure: We denote as the estimate utility of SC , in which the estimate utility and probability mass function are updated for each action as follows:
where and are the learning rates which satisfy the following conditions (due to space limits please see [16] for convergence proof):
Finally, each SC determines as per (8).
V Numerical Results
A dense SCs are randomly deployed in a area and we assume one UE per each SC and a fixed user association. We assume that each SC adjusts its beamwidth with a step of radian from the range , where radian and radian denote the minimum and maximum beamwidths of each SC, respectively. The transmit power level set of each SC is dBm and the SC antenna gain is dBi. The number of transmit antennas and receive antennas at the SC and UE are set to and , respectively. The blockage is modeled as a distancedependent probability state where the channel is either lineofsight (LOS) or nonLOS for urban environments at GHz and the system bandwidth is GHz [17]. Numerical results are obtained via MonteCarlo simulations over different random topologies. The risksensitive parameter is set to . For the learning algorithm, the tradeoff factor is set to , while the learning rates and are set to and , respectively [16]. Furthermore, we compare our proposed RSL scheme with the following baselines:

Classical Learning (CSL) refers to the RL framework in which the utility function only considers the mean value of mmWave links [16].

Baseline 1 (BL1) refers to [13] optimizing the beamwidth with maximum transmit power.
In Fig. 3
, we plot the complementary cumulative distribution function (tail distribution  CCDF) of user throughput (UT) at
GHz when the number of SCs is per . The CCDF curves reflect the reliable probability (in both linear and logarithmic scales), defined as the probability that the UT is higher than a target rate Gbps, i.e, Pr. We also study the impact of imperfect CSI with and feedback with noise from UEs. We observe that the performance of our proposed RSL framework is reduced under these impacts. We next compare our proposed RSL method with other baselines with perfect CSI and user feedback. It is observed that the RSL scheme achieves better reliability, Pr, of more than , whereas the baselines CSL and BL1 obtain less than and , respectively. However, at very low rate (less than Gbps) or very high rate ( Gbps) captured by the crosspoint, the RSL obtains a lower probability as compared to the baselines. In other words, our proposed solution provides a UT which is more concentrated around its median in order to provide uniformly great service for all users. For instance, the UT distribution of our proposed algorithm has a small variance of , while the CSL has a higher variance of .Va Impact of network density
Fig. 3 reports the impact of network density on the reliability, which is defined as the fraction of UEs who achieve a given target rate , i.e., . Here, the number of SCs is varying from to per . For given target rates of , , and Gbps, our proposed algorithm guarantees higher reliability as compared to the baselines. Moreover, the higher the target rate, the bigger the performance gap between our proposed algorithm and the baselines. A linear increase in network density decreases reliability, for example, when the density increases from to , the fraction of users that achieve Gbps of the RSL, CSL, and BL1 are reduced by , and , respectively. This highlights a key tradeoff between reliability and network density.
In Fig. 5 we show the impact of network density on the availability, which defines how much rate is obtained for a target probability. We plot the and probabilities in which the system achieves a rate of at least Gbps. For a given target probability of , our proposed algorithm guarantees more than Gbps of UT, whereas the baselines guarantee less than Gbps of UT for , while if we lower the target probability to , the achievable rate is increased by . This gives rise to a tradeoff between the reliability and the data rate. In addition, for a given probability, the achievable rate is reduced with the increase in network density. For instance, when the network density increases from to , the achievable rate is reduced by . This highlights the tradeoff between availability and network density.
We numerically observe that is long enough for agents to learn and enjoy the optimal solution. We assume that the channel condition is changed after every . Our proposed algorithm converges faster than the classical learning baseline as shown in Fig. 5. By harnessing the notion of riskaverse, the agents try to find the best strategy subject to the variations of the mmWave rates.
Vi Conclusions
In this letter, we studied the problem of providing multigigabit wireless access with reliable communication by optimizing the transmit beam and considering the link sensitivity in G mmWave networks. A distributed risksensitive RL based approach was proposed taking into account both mean and variance values of the mmWave links. Numerical results show that our proposed approach provides better services for all users. For instance, our proposed approach achieves a Pr is higher than , whereas the baselines obtain less than and with small cells.
References
 [1] J. G. Andrews et al., “What Will 5G Be?” IEEE Journal on Selected Areas in Communications, vol. 32, no. 6, pp. 1065–1082, June 2014.
 [2] A. Anpalagan, M. Bennis, and R. Vannithamby, Design and Deployment of Small Cell Networks. Cambridge University Press, 2015.
 [3] T. K. Vu et al., “Joint load balancing and interference mitigation in 5G heterogeneous networks,” IEEE Transactions on Wireless Communications, vol. 16, no. 9, pp. 6032–6046, Sep. 2017.
 [4] Y. Wu, R. Schober, D. W. K. Ng, C. Xiao, and G. Caire, “Secure massive MIMO transmission with an active eavesdropper,” IEEE Transactions on Information Theory, vol. 62, no. 7, pp. 3880–3900, 2016.
 [5] T. S. Rappaport et al., “Millimeter wave mobile communications for 5G cellular: It will work!” IEEE Access, vol. 1, pp. 335–349, 2013.
 [6] O. Mihatsch and R. Neuneier, “Risksensitive reinforcement learning,” Machine learning, vol. 49, no. 23, pp. 267–290, 2002.
 [7] P. Popovski et al., “Wireless access for ultrareliable lowlatency communication (urllc): Principles and building blocks,” submitted to IEEE Network, 2017.
 [8] M. Bennis, M. Debbah, and H. V. Poor, “UltraReliable and LowLatency Wireless Communication: Tail, Risk and Scale,” submitted to Proceedings of the IEEE, 2018.
 [9] G. Yang, M. Xiao, and H. V. Poor, “Lowlatency millimeterwave communications: Traffic dispersion or network densification?” submitted to IEEE Transactions on Communication, 2017.
 [10] T. K. Vu et al., “Ultrareliable and low latency communication in mmwaveenabled massive MIMO networks,” IEEE Communications Letters, vol. 21, no. 9, pp. 2041–2044, Sep. 2017.
 [11] ——, “Path Selection and Rate Allocation in SelfBackhauled mmWave Networks,” in Proc. IEEE Int. Conf. on Wireless Communications and Networking Conference (WCNC), Barcelona, Spain, 2018, pp. 1–6.
 [12] E. Ghadimi, F. D. Calabrese, G. Peters, and P. Soldati, “A reinforcement learning approach to power control and rate adaptation in cellular networks,” in 2017 IEEE International Conference on Communications, Paris, France, 2017, pp. 1–7.
 [13] J. Liu and E. S. Bentley, “HybridBeamformingBased MillimeterWave Cellular Network Optimization,” in Proc. 15th IEEE Int. Sym. on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), Paris, France, 2017, pp. 1–8.
 [14] J. Palacios et al., “Tracking mmWave Channel Dynamics: Fast Beam Training Strategies under Mobility,” in Proc. 36th Annual IEEE Int. Conf. on Computer Communications (INFOCOM), Atlanta, GA, USA, 2017, pp. 1–9.
 [15] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
 [16] M. Bennis, S. M. Perlaza, P. Blasco, Z. Han, and H. V. Poor, “Selforganization in small cell networks: A reinforcement learning approach,” IEEE Transactions on Wireless Communications, vol. 12, no. 7, pp. 3202–3212, 2013.
 [17] T. Bai, V. Desai, and R. W. Heath, “Millimeter wave cellular channel models for system evaluation,” in 2014 IEEE Int. Conf. on Computing, Networking and Communications (ICNC), Honolulu, HI, USA, 2014, pp. 178–182.
Comments
There are no comments yet.