AUTONOMOUS vehicles (AVs) are required to navigate efficiently and safely in complex and uncontrolled environments . To meet these requirements, Dual-Functional Radar-Communication (DFRC) system design has been recently proposed as a promising technology for AVs. The DFRC allows an AV to jointly implement radar and communication functions. In particular, with the radar function, the AV is able to accurately detect the presence of distant objects or unexpected events even under the bad weather conditions and poor visibility. With the communication function, the AV can use communication channels to communicate with road-side units, base stations, and edge computing systems, e.g., by using vehicle-to-infrastructure (V2I) and vehicle-to-network (V2N), to facilitate intelligent road management, route selection, and data analysis [2, 3].
Since the DFRC system implements both radar and communications using a single hardware device, these functionalities share some system resources such as antennas, spectrum, and power. As a result, one major problem of the AV is how to optimize the resource sharing between the radar function and communication function. In particular, the problem of the AV is how to optimize the selection between the radar mode and communication mode.
Recently, some resource sharing approaches have been proposed to solve the problem. In particular, the authors in 
proposed to adopt the IEEE 802.11ad standard for the joint radar-communication in an AV system. Accordingly, the AV reserves preamble blocks in the IEEE 802.11ad frame for the radar mode, i.e., to estimate their ranges and velocities, and uses data blocks for the data transmission. Different from, the time sharing approach proposed in  uses time cycles instead of the standard frames. Then, time portions in the time cycle are allocated to the radar mode and communication mode to maximize the radar estimate rate and communication rate of the radar-communication system. Consider the communication system for the AVs in the V2I scenario, the authors in  proposed a method to reduce the beam alignment overhead between the AVs and infrastructures. However, the radar’s performance on object detection is not considered.
are fixed schedule schemes that are not appropriate to implement in practice because the surrounding environment of the AV is uncertain and dynamic. To maximize the resource efficiency under uncertain environment, adaptive algorithms for the radar and communication mode selection are required. For example, when the weather is in a bad condition, e.g., heavy rain, the AV can select the radar mode more frequently to improve the radar performance to detect unexpected events on the road. In contrast, when the weather and the communication channel are in good conditions, the AV can select the communication mode more frequently to transmit its data. However, it is challenging for the AV to determine optimal decisions because the environment states, e.g., weather and road states as well as the communication channel state are dynamic and uncertain. In this letter, we thus develop a deep reinforcement learning (DRL) technique that enables the AV to find the optimal selection of the radar mode and communication mode without prior knowledge of the environment. To the best of our knowledge, this is the first approach using DRL to solve the mode selection problem of the DFRC in AV. For this, we first formulate the AV’s problem as a Markov decision process (MDP). Then, we develop the DRL with Deep Q-Network (DQN) algorithm to achieve the optimal policy for the AV. Simulation results show that the proposed DRL outperforms baseline schemes in terms of higher data throughput, miss detection probability, and shorter convergence time.
Ii System Model
The system model with an AV as shown in Fig. 1. The AV is equipped with a DFRC equipment that enables the AV to work in two modes, i.e., the radar mode and the communication mode. Typically, the radar and communication modes can be allocated in time cycles, in which each time cycle is separated to radar mode and communication mode . Unlike , we consider that each time cycle/step is allocated to either radar mode or communication mode. This enables the AV to effectively change the mode based on the current observation of environment, rather than based on the previous time cycle as in .
Ii-a Dual-functional Radar-Communication Model
In the communication mode, the AV uses the V2I capability to transmit the data, e.g., of current road traffic or live on-board video streaming, to the Base Stations (BSs) distributed along the road. Assume that the AV uses a single channel for the data transmission and has a data queue for storing incoming data packets, e.g., from its sensor devices. Let be the capacity of the data queue. In the radar mode, the AV performs an automotive millimeter-wave radar to detect unexpected events. As shown in Fig. 1, the radar mode can be used to detect unexpected events, e.g., a car coming from another road obscured by a truck. In particular, we define an unexpected event as an event that can possibility cause collisions with the AV. We consider that the occurrence of an unexpected event is influenced by four main factors: the road condition, weather condition, speed of the AV, and nearby moving object [5, 6]. Note that the values of these factors can be obtained by the AV’s sensing system, e.g., road friction sensor, weather station instrument, speedometer, and cameras.
Let , , , and be the road state, weather state, speed state, and moving object state, respectively. In particular, , , , and represent unfavorable conditions, e.g., slippery road, rainy weather, high speed of the AV, and a moving object nearby, respectively. In contrast, and express favorable conditions, e.g., straight road, good weather, low speed and without a moving object nearby, respectively. Let denote the probability to occur an unexpected event at the current condition (where corresponds to favorable or unfavorable conditions, respectively) of factor , . For example, expresses the probability of an unexpected event to occur given the slippery road condition, i.e., . Note that the generalization of the states beyond 0 and 1 is straightforward. For example, the speed of the AV can be divided into multiple levels, e.g., low, medium, and high.
Ii-B Environment Model
To model the dynamic of environment, the probabilities , are taken from the real-world data in [5, 6], and other probabilities are assumed to be pre-defined. Then, we can determine the probability of an unexpected event to occur given factor states
using the Bayes’ theorem. For this, letdenote the occurrence of an unexpected event, and denote that no unexpected event occurs. Let be the probability that factor is at state , where . Thus, the probability that factor at state is . By using the Bayes’ theorem, the probability of an unexpected event to occur given factor states is determined by:
In general, when the probability of an unexpected event to occur, , is high, the environment is more dynamic and uncertain. We introduce a metric, i.e., the miss detection probability, to evaluate the performance of the proposed system. The miss detection probability is defined by the ratio of the number of unexpected events that the AV cannot detect to the total number of unexpected events on the road. A high miss detection probability results in a high risk of accident for the AV. We also introduce the second metric to evaluate the performance of the proposed system that is the data throughput. The data throughput is defined as the average number of packets per time unit that is successfully transmitted from the AV to the BSs. Note that, we assume that the accuracy of the autonomous radar system is perfect, i.e., there is no miss detection or false alarm, when the AV uses the radar mode. However, the system model can be straightforwardly extended by considering the miss detection and false alarm caused by sensing accuracy of the radar. In this case, the proposed DRL scheme still can work well as it can learn these parameters through real-time interactions with the environment.
Intuitively, to minimize the miss detection probability, the AV can use the radar mode more frequently to detect unexpected events, but this reduces the data throughput. Conversely, to increase the throughput, the AV can use the communication mode more frequently, but this may increase the miss detection probability. Consider this tradeoff with the uncertainty of environment, the AV’s decision making problem can be modeled as an MDP. We then develop a DRL algorithm to quickly obtain the optimal policy for the AV without requiring completed information about environment. The details about the DRL scheme that enables the AV to quickly find the optimal policy will be discussed in Section V-B.
Iii Problem Formulation
To formulate the problem by using the MDP, we define a tuple of , where , , , and are the state space, action space, reward function, and state transition probability of the AV, respectively. Note that the transition probability is unknown to the AV in advance.
Iii-a Action Space and State Space
At each time step, the AV decides to use either the communication mode or the radar mode. Let denote the action space of the AV, , where means that the AV chooses the communication mode, and means that the AV chooses the radar mode. The state of the AV is the combination of (i) the state of the data queue, (ii) the state of the channel that the AV uses for its data communication, (iii) the state of the road, (iv) the weather state, (v) the speed state of the AV, and (vi) the nearby moving object state. Thus, the state space of the AV can be defined as
where represents the state of the data queue, i.e., the number of packets in the data queue, refers to the state of the communication channel that the AV uses to transmit data to the BSs. if the channel is good, i.e., low interference, and if the channel is bad, i.e., high interference. , and are defined in Section II-A. The state of the system at time step is defined as .
Iii-B Reward Function
At each time step , the AV chooses an action at state and receives an immediate reward . The reward is designed to encourage the AV to increase the data throughput and at the same time decrease its miss detection probability. For this, we define the reward function as follows.
When the AV selects the communication mode and if the channel state is good, the AV successfully transmits packets and receives a reward . Otherwise, when the AV selects the communication mode and if the channel is bad, the AV successfully transmits packets and receives a reward . Moreover, when the AV selects the communication mode and an unexpected event occurs, the AV receives a penalty of . When the AV selects the radar mode and if the AV does not detect any unexpected event, the AV receives no reward. Otherwise, when the AV selects the radar mode and if the AV detects an unexpected event, the AV receives a reward that is proportional to the number of unfavorable conditions in , i.e., the number of values in . This means that the AV receives a high reward if the probability of an unexpected event to occur is high, e.g., the AV is under very unfavorable conditions, and if the unexpected event is detected. This definition is to encourage the AV to use the radar mode when the environment conditions are unfavorable. In summary, the immediate reward can be defined as follows:
where is the number of values of in the set . Note that the probability of an unexpected event to occur given , , is defined in (1).
In this paper, we aim to find the optimal policy for the AV, denoted by , to maximize its long-term discounted cumulative reward, i.e., discounted return, as defined by
where is the expected discounted return under the policy , is the immediate reward under policy at time step , is the time horizon, and , is the discount factor. The optimal policy will allow the AV to make optimal decisions at any state , i.e., .
To find the optimal policy for the AV, standard Q-learning  can be adopted by estimating Q-values of all state-action pairs, i.e., . The Q-values are iteratively updated in a Q-table, and thus the Q-learning suffers the large state space problem. Therefore, we propose to use the DRL with DQN to quickly find the optimal policy.
Iv Deep Reinforcement Learning Algorithm
The DQN algorithm uses a deep neural network, called Q-network, with weightsto derive an approximate value of . The input of the Q-network is one of the states of the AV, and the output includes Q-values ) of all possible actions. The approximate Q-values allow the AV to map its state to an optimal action. For this, the Q-network needs to be trained to update the weights as follows.
At the beginning of iteration , given state , the AV obtains the Q-values for all possible actions . The AV then takes an action according to the -greedy policy  and observes the reward and next state . The AV stores the transition to a replay memory . Then, the AV randomly samples a mini-batch of the transitions from to update as follows:
where is the learning rate, is the gradient of with respect to the online network weights , and is the target value. is defined as , where is the discount factor, and are the target network weights that are copied periodically from the online network weights. The above steps are repeated in iteration to update the weights . Note that the training process is considered to be an episodic task, and the algorithm converges when the cumulative reward is stable over episodes.
V Performance evaluation
V-a Experiment Setup
For the comparison purpose, the capacity of the data queue is set to
packets, and the arrival packets follow a Poisson distribution with an average arrival rate ofpacket/time step. If the channel state is good, i.e., , the AV can transmit packets, if the channel state is bad, i.e., , the AV can transmit packets. We assume that the probability that the channel is at the bad state is , and the probability that the channel is at the good state is . For the reward values, to minimize the miss detection probability, the value of should be much higher than other values, i.e., , , and . In particular, we set the values to be . The values of and are taken from  in which if the AV’s speed exceeds km/h, the AV’s speed is high and otherwise the AV’s speed is low. Specifically, the values and are set to be and , respectively. Rain can be considered to be a common unfavorable weather state, and thus the values of and can be taken from  in which and
. The parameters of the DQN scheme are set as follows. The neural network used in DQN is a Multilayer Perceptron withinput layer, hidden layers and output layer. The input layer contains units which correspond to the number of dimensions of the state space. The output layer contains
units corresponding to the number of dimensions of the action space of the AV. The DQN and the environment for the AV are implemented by using Keras library and OpenAI Gym environment, respectively. To evaluate the DQN scheme, we introduce the Q-learning and Round-robin scheme, i.e., the AV switches back and forth between the radar mode and the communication mode, as baseline schemes.
V-B Simulation Results
We first compare the total rewards obtained by the schemes. As shown in Fig. 2(a), the total rewards obtained by the DQN and Q-learning are much higher than that of the Round-robin. Furthermore, the DQN and Q-learning converge to the same reward. However, the convergence speed of the DQN is much faster than that of the Q-learning. In particular, the DQN requires episodes to approach the optimal value, while the Q-learning scheme requires episodes. The reason is that the DQN updates multiple Q-values in a mini-batch at each training iteration , while Q-learning performs only one Q-values update at each training iteration . As a result, the convergence rate of the Q-learning is usually much lower than that of the DQN, especially for the large state/action spaces .
Next, we evaluate the DQN scheme by varying the environmental factors. Without loss of generality, we evaluate the proposed scheme when the probability to occur an unexpected event given the high speed of the AV, , varies from to . As shown in Fig. 2(b), as increases, the average reward obtained by the Round-robin scheme decreases, while those obtained by the DQN and Q-learning schemes increase. The reason can be explained as follows. With the Round-robin scheme, the radar mode is chosen according to a fixed policy, meaning that the radar mode may not be frequently used even if the occurrence probability of an unexpected event is high. Thus, the AV may receive high penalties that results in a decrease of the average reward. With the DQN and Q-learning schemes, the AV uses the radar mode more frequently as increases to minimize the penalties. As a result, the DQN and Q-learning schemes can achieve higher average rewards compared with that of the Round-robin scheme.
Following the optimal policy, the DQN and Q-learning can significantly outperform the Round-robin in terms of throughput (see Fig. 2(c)) and miss detection probability (see Fig. 2(d)). As shown in Fig. 2(d), the miss detection probabilities obtained by the DQN and Q-learning decrease as increases. The reason is that the optimal policies obtained by the DQN and Q-learning enable the AV to select the radar mode more frequently as unexpected events are likely to occur. Thus, the AV can detect more unexpected events and reduce the miss detection probability. Note that our simulation results presented in this section are especially useful to design key parameters for real AV systems to ensure the safety for the users. In particular, given the current simulation setting , the AV can achieve a miss detection probability ranging from to . We can further reduce the miss detection probability of the AV to meet its requirement by increasing the reward when the AV selects the radar mode, e.g., increasing from to or .
In this paper, we have proposed the iRDRC system which enables the AV to optimize the radar mode and communication mode selection automatically in a real-time manner. To deal with the uncertainty of the environment, we have formulated the optimization problem based on the MDP framework and developed the DQN algorithm to obtain the optimal policy. The results show that the proposed system can simultaneously maximize the data throughput and minimize miss detection probability. In our future work, continuous actions and cooperation between the AVs in V2I networks can also be considered.
-  D. Ma, et al, “Joint Radar-Communications Strategies for Autonomous Vehicles.” arXiv preprint, arXiv:1909.01729 (2019).
-  P. Kumari, et al, “Investigating the IEEE 802.11ad standard for millimeter wave automotive radar,” IEEE VTC, Sep 2015, pp. 1–5.
-  J. Choi, et al, “Millimeter-wave vehicular communication to support massive automotive sensing.” IEEE Commun. Mag., vol. 54, no. 12, pp. 160-167, Dec 2016.
-  A. R. Chiriyath, et al, “Radar-communications convergence: Coexistence, cooperation, and co-design,” IEEE Trans. Cogn. Commun., vol. 3, no. 1, pp. 1–12, Dec 2017.
-  C. N. Kloeden, et al, “Travelling speed and risk of crash involvement on rural roads”, Australian Transport Safety Bureau, 2001.
-  How Do Weather Events Impact Roads. Accessed: February 2020. [Online]. Available: https://ops.fhwa.dot.gov/weather/q1_roadimpact.htm.
-  V. Mnih et al, “Human-level control through deep reinforcement learning.” Nature vol. 518, no. 7540, pp. 529-533, Feb 2015.
-  C. J. C. H. Watkins, et al. “Q-learning.” Mach. Learn., vol. 8, no. 3-4, pp. 279-292, 1992.