## I Introduction

The state-of-the-art protocols for vehicular communication address vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communication systems, generally termed V2X. Currently, these communication systems enable a maximum data rate of Mbps for high mobility (using 4G) [1, 2], which are not deemed sufficient to support applications such as autonomous driving, augmented reality and infotainment, which will populate next-generation vehicular networks. Therefore, future vehicular communication networks call for new solutions to support their capacity demands, by leveraging the huge amount of bandwidth in the GHz band, the so called millimeter-wave (mm-wave) spectrum.
While communication at these frequencies is ideal to support high capacity demands, it relies on highly directional transmissions, which are extremely susceptible to the vehicle mobility. Therefore, a key question is: *How do we leverage mobility information to optimize the trade-off between directive Data Transmission (DT) and directional Beam Training (BT), which enables it, to optimize the communication performance? How much do we gain by doing so?* To address these questions and optimize this trade-off, in this paper we envision the use of learning tools. We demonstrate significant gains compared to common-sense beam alignment schemes.

Compared to conventional lower frequencies, propagation at mm-waves poses several challenges, such as high propagation loss and sensitivity to blockage. To counteract these effects, mm-wave systems are expected to use large antenna arrays to achieve a large beamforming gain via directional transmissions. However, these techniques demand extensive beam training, such as beam sweeping, estimation of angles of arrival and of departure, and data-assisted schemes [3], as well as beam tracking [4]. Despite their simplicity, the overhead incurred by these algorithms may ultimately offset the benefits of beamforming in highly mobile environments [1, 2]. While wider beams require less beam training, they result in a lower beamforming gain, hence smaller achievable capacity [5]. While contextual information, such as GPS readings of vehicles [6], may alleviate this overhead, it does not eliminate the need for beam training due to noise and GPS acquisition inaccuracy. Thus, the design of schemes that alleviate this overhead is of great importance.

In all of the aforementioned works, a priori information on the vehicle’s mobility is not leveraged in the design of BT/DT protocols.
In contrast, *we contend that leveraging such information via adaptive beam design techniques can greatly improve the performance of automotive networks* [7, 8].
In this paper, we bridge this gap by designing adaptive strategies for BT/DT that leverage a priori mobility information via Partially Observable (PO) Markov Decision Processes (MDPs). Our numerical evaluations demonstrate that these optimized policies significantly outperform common-sense heuristic schemes, which are not tailored to the vehicle’s observed mobility pattern.
Compared to [3], which develops an
analytical framework to optimize
the BT/DT trade-off and the BT parameters based on
the ”worst-case” mobility pattern, in this work, we assume a statistical mobility model.

In the proposed scenario, a Base Station (BS) attempts to establish a mm-wave directive link towards a Mobile User (MU) moving along a road. To this end, it alternates between BT and DT. The goal is to maximize the number of bits delivered by the BS to the MU over the communication session, under a power constraint. To manage the BT/DT trade-off, we exploit a POMDP formulation, where the system state corresponds to the position of the MU within the road link. Specifically, we implement a POMDP with temporally extended actions (i.e., actions with different durations) to model the different temporal scales of BT and DT, and a constraint on the available resources of the system. POMDPs model an agent decision process in which the system dynamics are determined by the underlying MDP (in this case, the MU dynamics), but the agent cannot directly observe the system state. Instead, it maintains a probability distribution (called

*belief*) over the world states, based on observations and their distribution, and the underlying MDP. An exact solution to a POMDP yields the optimal action for each possible belief over the world states. POMDPs have been successfully implemented in a variety of real-world sequential decision processes, including robot navigation problems, machine maintenance, and planning under uncertainty [9, 10]. To address the complexity of POMDPs, we use [11], an approximate solution technique which uses a sub-set of belief points as representative of the belief state. However, in contrast to the original formulation using random belief point selection, we tailor it by selecting a deterministic set of belief points representing uncertainty in MU position, and demonstrate significant performance gains. A unified approach for constrained MDP is given by [12, 13]. Notably, there has been relatively little development in the literature for incorporating constraints into the POMDP [14, 15, 16, 17]. In order to address the resource constraints in our problem, we propose a Lagrangian method, and an online algorithm to optimize the Lagrangian variable based on the target cost constraint.

## Ii System Model

We consider a scenario where a BS aims at establishing a mm-wave directive link with a MU moving along a road. To this end, it alternates between BT and DT: with BT, the BS refines its knowledge on the position of the MU within the road link, to perform more directive DT. Our goal is to maximize the number of bits that the BS delivers to the MU during a transmission episode, defined as the time interval between the two instants when the MU enters and exits the coverage range of the BS, under a power constraint.

### Ii-a Problem formulation

We consider a *dense* cell deployment, as shown in Fig. 1. The MU is associated with its closest BS, at a distance from the road link. The road link served by the reference BS is divided into road sub-links of equal length , where is the maximum coverage range of the BS. We let be the set of indices of the road sub-links. The BS associates a beam with each one of the road sub-links, with angular support, for the -th beam,

(1) |

and beamwidth , so that and .

The time is discretized into micro time-slots of duration , with being the time for a Primary Synchronization Signal (PSS), which allows a proper channel estimation at the receiver [5].
At time , the MU is located in one of the road sub-links, until it exits the coverage area of the BS, denoted by the absorbing state . We denote the sub-link occupied by the MU at time as .
We assume that the position of the MU within the road link evolves among the road sub-links following a random walk with probabilities ,
with ,
where and . Under this model, the MU will exit the BS coverage area at some point.
We can view such random walk as an abstraction of the following *physical* mobility model, where the MU moves with
average speed

and speed variance

: assume that the MU moves at speed at time , with . Also, let , , and . Note that the maximum speed supported by this model is (otherwise, the MU may move more than one sub-link within a single micro-slot). It follows that and . Thus, given average and , we obtain and as(2) |

To meet the conditions for the probabilities , with , the following inequalities must hold:

(3) |

which defines a region of feasible pairs . This model can be extended, e.g., to account for multiple speeds and memory in the velocity process, although we leave it for future work.

During BT or DT, at time , the BS transmits using a beam that covers a sub-set of sub-links, , part of our design. Assuming a large antenna array, which allows for arbitrarily sharp beam patterns, the beam is designed in order to support a target SNR on the beam support, . To this end, we let be the power per radian projected in the angular direction , and . To attain the target SNR constraint, we must have that

(4) |

where is the SNR scaling factor, is the wavelength, is the noise power spectral density, is the antenna efficiency, is the bandwidth, and is the distance of the point in the road link at angular direction , so that models distance dependent path loss. It follows that the total transmit power is given by

(5) |

Using the change of variables , we then obtain

(6) |

In other words, the total transmit power is independent of the sub-link indices, but depends solely on the number of sub-links and on the target SNR. This result is in line with the intuition that larger distances are achievable via smaller beamwidths, and vice versa [5].

During DT, assuming isotropic reception at the MU, such target SNR implies an achievable rate given by

(7) |

During BT, the SNR is set so as to achieve target mis-detection and false-alarm probabilities. To design this parameter, the generic signal detection problem corresponds to receiving a signal , , over a noisy channel. The two hypotheses are

(8) |

where ,

, are independent random variables,

, with . Our task is to decide in favor of or on the basis of the measurements , , i.e.,or equivalently,

(9) |

where is the energy of the pilot signal . If the Neyman-Pearson formulation is used, then the right hand side of Eq. (9) is replaced by a decision threshold , function of the target error probability. According to the Neyman-Pearson Lemma [18], for a given target error probability, we can derive a decision rule as follows. The false-alarm probability, (accept when is true), is given as , where is the Q-function. The mis-detection probability, (accept when is true), is given as , where the probability of correct detection is given by , which shows that is a function of . Applying the inverse to both sides of the last equation, leads to a measure of the SNR required to attain the target error performance:

(10) |

which is plugged into Eq. (6) to find the transmit power as a function of the number of sub-links covered, .

### Ii-B Partially Observable Markov Decision Process

Next, we define a constrained Partially Observable (PO) Markov Decision Process (MDP).

States: is a finite set of states describing the position of the MU within the road link, along with the absorbing state when the MU exits the coverage area of the BS. Therefore, , where is the set of road sub-links.

Actions: is a finite set of actions that the BS can perform. Specifically, the BS can perform actions for Beam Training (BT) and actions for Data Transmission (DT),
which involve selection of the transmission beam, power, and duration.
In general, is in the form , where: refers to the action class; is a sub-set of sub-links, defining the support of the transmission beam;
is the transmission power per beam, such that in Eq. (6) is the total transmit power,
is the transmission duration of action (number of micro time-slots of duration ).
If , then , where and are fixed parameters of the model. Specifically, we assume that BT actions perform simultaneous beamforming over in one interval of seconds (i.e., ). Also, , where is the power per beam required to attain the target SNR constraint, i.e., , and is a function of false-alarm and mis-detection probabilities and , which are also fixed parameters of the model, via Eq. (10).
If , then , where and are part of the optimization. Specifically, we assume that DT actions perform simultaneous data communication over for micro time-slots, where the last interval of seconds is dedicated to the ACK/NACK feedback transmission from the MU to the BS. During DT, the transmission rate follows from Eq. (7).
Note that the action space
grows as . To reduce its cardinality, we restrict such that is a sub-set of consecutive indices in , i.e., the beam directions specified by define a compact range of transmission for the BS. Thus, .

Observations: is a finite set of observations,
defined as . Specifically, means that
the MU exited the coverage area of the BS; for simplicity, in this work we assume that such event is observable, i.e., the BS knows when the MU exited its coverage area.

Transition probabilities: is the transition probability from to given . Note that these probabilities are a function of the duration of action . If the transmission duration of is , then we store the -step transition probabilities into matrix , with elements given by the 1-step mobility model, as:

, | (11) | ||||

, | |||||

, | |||||

, | |||||

, . |

If the transmission duration of is , then we compute the -step transition probabilities into matrix , i.e., we take the -th power of matrix so as to account for the -step evolution of the system state under with transmission duration .

Observation model: is the probability of observing given and with transmission duration , ending in . We assume that the BS can successfully perform if the MU remains within for subsequent micro time-slots, i.e., the MU does not exit from the beam support, so that all signal is received. In this case, the MU feeds back an ACK to the BS, . Therefore, we define , as the system state path from time to time , and the event , meaning that the system state path remains within for subsequent micro time-slots, given that , , . In order to compute it, we also define matrix as the transition probability matrix restricted to the beam support , i.e., , with elements if , otherwise . We derive as:

(12) |

Given with transmission duration , is defined as follows. If , we account for false-alarm and mis-detection errors in the beam detection process. In particular, if (i.e., the MU is within the beam support during the duration of BT) then (correct detection) and (mis-detection); on the other hand, if (i.e., the MU is outside of the beam support during the duration of BT), then (false-alarm) and . If , then the transmission is successful if the event occurs, so that for , and . Finally, whenever either or , i.e., the BS knows when the MU exited its coverage area.

Rewards: is the expected reward given and , defined as the transmission rate (number of bits transmitted from the BS to the MU) during DT if the MU remains within for subsequent micro time-slots. Formally, , where iff the event is true (thus if the MU exits from the beam support). Note that , which is computed in Eq. (II-B). The transmission rate follows from Eq. (7) when . Finally, , where refers to the fact that we reserve one micro time-slot over the total DT duration for the feedback transmission. If , then , as no bits of data are transmitted.

Costs: is the expected energy cost given and . The total expected cost during a transmission episode is subject to the constraint . If , then , (we reserve one micro time-slot for the feedback transmission). If , then .

## Iii Optimization Problem

Since the agent cannot directly observe the system state, we introduce the notion of *belief*. A belief is a probability distribution over . The state estimator must compute a new belief, , given an old belief , an action , and an observation , i.e., . It can be obtained via Bayes’ rule as:

(13) |

where is a normalizing factor, .

Our goal is to determine a policy (i.e., a map from beliefs to actions) that maximizes the total expected reward the agent can gather, under a constraint on the total expected cost during a transmission episode, following and starting from :

(14) |

where we have defined expected rate and cost metrics under belief as

(15) |

At this point, we opt for a Lagrangian relaxation approach such that is the metric to be maximized, for some Lagrangian multiplier , and the total expected cost during a transmission episode is subject to the constraint . Hereinafter, according to the notation, . At the end of Section III-B, we will consider an online algorithm to optimize parameter so as to solve the original problem in Eq. (14).

### Iii-a Value Iteration for POMDPs

In POMDPs, a policy is a function over a continuous set of probability distributions over . A policy is characterized by a value function , which is defined as:

(16) |

A policy that maximizes is called an optimal policy . The value of an optimal policy is the optimal value function , that satisfies the Bellman optimality equation (with Bellman backup operator ):

(17) |

where (see Eq. (13)). When Eq. (17) holds for every belief we are ensured the solution is optimal. can be arbitrarily well approximated by iterating over a number of stages, at each stage considering a step further into the future. Also, for problems with an infinite planning horizon, can be approximated, to any degree of accuracy, by a PieceWise Linear and Convex (PWLC) value function [11]. Thus, and we parameterize a value function at stage

by a finite set of vectors (hyperplanes)

, such that , where denotes inner product. Each vector in , is associated with an action , which is the optimal one to take at stage , and defines a region in the belief space for which this vector is the maximizing element of (thus ). The key idea is that for a given value function at stage and a belief , we can compute the vector in such that:(18) |

where is the (unknown) set of vectors for . We will denote this operation . It computes the optimal vector for a given belief by back-projecting all vectors in the current horizon value function one step from the future and returning the vector that maximizes the value of . Defining vectors such that and such that ( is a projected vector given action , observation , and current horizon vector ), we have [11]:

(19) |

In general, computing optimal planning solutions for POMDPs is an intractable problem for any reasonably sized task. This calls for approximate solution techniques, e.g., [11], which we introduce next.

### Iii-B Randomized Point-based Value Iteration for POMDPs

is an approximate Point-Based Value Iteration (PBVI) algorithm for POMDPs. It implements a randomized approximate backup operator that increases (or at least does not decrease) the value of all beliefs . The key idea is that for a given value function at stage , we can build a value function that improves the value of all beliefs by only updating the value of a (randomly selected) subset of beliefs , i.e., we can build a value function that upper bounds over (but not necessarily over ): , . Starting with , performs a number of backup stages until some convergence criterion is met. Each backup stage is defined as in Algorithm 1 (where is an auxiliary set containing the non-improved beliefs).

Key to the performance of is the design of . Several standard schemes to select beliefs have been proposed for PBVI, mainly based on grids of points in the belief space. A different option to select beliefs is to simulate the model, i.e., sampling random actions and observations, and generating trajectories through the belief space, as suggested in [11]. Although this approach may seem reasonable, one may argue that the probability distributions collected in are not very representative of the system dynamics history, where actions and observations must also depend on beliefs. Hereinafter, we leverage the structure of the POMDP presented in Section II and provide an algorithm (Algorithm 2) to collect beliefs in in a smarter fashion. Our approach is simple but effective, and does not require any prior knowledge of the system dynamics: according to Algorithm 2, is made of uniform probability distributions over

, which are uniformly distributed over at most

*consecutive*road sub-links. Then, this design of reflects the compact range of transmission for the BS, where the BS degree of uncertainty on the MU state scales with .

The basic routine for PBVI is given in Algorithm 3, where approximates the optimal value function for a given value of . Note that we are interested in the optimal policy when is such that , i.e., the agent knows when the MU enters the maximum coverage range of the BS.

To find the optimal multiplier , we have to run the routine for different values of . performs a number of backup stages until some convergence criterion is met. At this point, we check if the constraint is satisfied, update if , and repeat the routine for different values of . These values of can be sequentially selected from a sorted sequence or properly tuned at the end of the routine in a smarter fashion. However, in both cases we have to wait until convergence. To speed up the search for the optimal multiplier , we formulate an online version of Algorithm 3, which is presented in Algorithm 4. Here, is properly tuned within the main loop of the routine according to a gradient descent technique [19]
^{1}^{1}1Note that a gradient descent technique adjusts the parameter after each iteration in the direction that would reduce the error on that iteration the most. The target here depends on the parameter , but if we ignore that dependence when we take the derivative, then what we get is a semi-gradient update [20].: , where the discount factor is . Finally, given , we update the Lagrangian relaxation as .
In addition to the convergence criterion of the standard PBVI, we also consider the requirement .

## Iv Numerical results

We set s. We consider the following parameters: , m, MHz, GHz, , dBm, , m/s, dBm, (number of micro time-slots, i.e., ms), , , where follows from (specified below). Finally, we compare different sets . Let be the average duration of a transmission episode. The average rate (bit/s) and power (dBm) are computed as and .

The average rate and power as a function of are plotted in Fig. 2, with

Comments

There are no comments yet.