Internet of Things (IoT) represents a pervasive ecosystem of innovative services within different application areas . Low-power and Lossy Networks (LLNs) constitute the wireless infrastructure for a variety of IoT applications. The LLN is a set of interconnected resource-constrained devices, featuring limitations on power, processing and storage capabilities .
The IPv6 Routing Protocol for LLNs (RPL)  was standardized as the de-facto routing protocol to meet the requirements of LLN applications. RPL organizes an IoT network as a Destination-Oriented Acyclic Graph (DODAG) rooted at the sink node, namely the DODAG root. The DODAG root represents the final destination for the traffic within the network level, bridging the topology with other IPv6 domains such as the Internet. Uplink routes are constructed and maintained using the periodic DODAG Information Object (DIO) messages . RPL is a single path routing protocol where each node transmits its traffic directly to a preferred parent, which is exclusively selected based on the adopted Objective Function (OF) . Although RPL is originally designed for low traffic scenarios, performance issues arise under heavy traffic loads due to congestion, such as load imbalance, high packet loss and fast energy depletion of bottleneck nodes.
Most of the proposed approaches to improve RPL against congestion are based on fixed strategies that incur expensive overhead 
; as a result, the network does not work properly for dynamic IoT environments. In this respect, using machine learning frameworks could enable self-organizing operation of LLN devices in dynamic IoT environments. Reinforcement learning (RL) is a machine learning technique capable of training an agent (IoT device) to interact intelligently and autonomously with an environment . Ultimately, adopting learning-based protocols to improve network performance can provide support for advanced IoT applications.
In this paper, we introduce an RL-based routing strategy to tackle the congestion problem in RPL networks. The proposed method is based on Q-learning where each node maintains a Q-table that represents the routing information of all its neighbours. The Q-values at each node are updated according to the received feedback function which is formulated as a composite routing metric that reflects the congestion level and link quality of each neighbour. Based on the received feedback, the node selects the neighbour with the minimum Q-value as its preferred parent. In order to comply with the standard RPL and keep the overhead to a minimum, the congestion information is embedded in the periodic DIO control messages. The period of DIO message exchanges is governed by the RPL Trickle timer , whose reset strategy is modified to achieve a balance between overhead and fast dissemination of the feedback function. To the best of our knowledge, this is the first work that adopts machine learning to tackle the congestion problem in RPL-based networks. Performance evaluations and comparative analysis with respect to solutions available in the literature are carried out, showing that the proposed method achieves load balancing with improved performance in terms of packet delivery and average delay with a marginal increase in the signalling overhead.
The remainder of the text is organized as follows. Section II presents the background and related work. Section III introduces the load-balancing problem in RPL and the proposed Q-learning method. The performance evaluation is given in Section IV, followed by the conclusions in Section V.
Ii Background and Related Work
In RPL standard, the OF is used to describe the set of rules and policies that governs the process of route selection and optimization in a way that meets the different requirements of various applications. RPL decouples the route selection and optimization mechanisms from the core protocol specifications to enable its users to define routing strategies and local policies to fulfill the conflicting requirements of different LLN applications . Currently, two OFs have been standardized for RPL, namely, the Objective Function Zero (OF0)  and the Minimum Rank with Hysteresis Objective Function (MRHOF) . The two OFs rely solely on a single metric as the routing decision metric, hop-count in OF0 and Expected Transmission Count (ETX) in MRHOF. In heavy-traffic scenarios, LLN nodes are triggered to forward high amount of traffic which may cause congestion at particular parent nodes when the offered load exceeds the available queuing capacity . Consequently, the network suffers from consecutive packet delivery failures due to output queue overflows, also denoted as node congestion (henceforth, also simply referred to as congestion). Node congestion occurs even in the case of low-rate traffic because nodes near to the DODAG root have to relay high amount of traffic. Therefore, node congestion and imbalanced routing tree topology have a profound impact on the network performance in terms of packet loss, delay, and energy consumption. In RPL specifications , there is no explicit mechanism to detect and react to node congestion. Instead, all traffic will be forwarded through the selected preferred parent as long as it is reachable, without any attempt to perform load balancing, which ultimately leads to poor quality of service (QoS). Therefore, incorporating load-balancing mechanism in RPL routing is essential to maintain efficient network performance in terms of delay and reliability.
Several rule-based approaches have been proposed in the literature to tackle the congestion problem in RPL networks. The authors in  developed an OF using fuzzy logic to improve packet delivery and energy consumption. The method utilizes the hop-count, ETX and residual energy to select the best parent, however, it does not consider the congestion level in the selection process. In 
, the authors proposed a non-cooperative game theory-based solution to mitigate node congestion in RPL, which attempts to find the optimal sending rate of each node based on the information of buffer loss and channel loss. Besides the increased overhead, adjusting the sending rate may violate the potential requirement of IoT applications where a fixed refresh rate is required. A context-aware OF was proposed in to mitigate node congestion and improve network lifetime under heavy traffic. The method considers the residual energy and queue utilization when selecting the preferred parent, however, it suffers from increased overhead when distributing the routing information among neighbour nodes. A load-balancing parent selection was proposed in , where all feasible parents compete for packet forwarding when a node transmits a packet. The method incurs a significant energy waste when all the feasible parents wake up their radio, and also does not guarantee that the winner is the one with best link quality.
A number of machine learning-based protocols was introduced to improve routing in next-generation networks [16, 17, 18], however, these works are not compliant with the RPL specification. The authors in 
introduced iCPLA, a cross-layer optimization method using reinforcement learning to improve RPL performance under heterogeneous traffic. The proposed method utilizes the collision probability to select the optimal parent. However, the method considers only the channel congestion due to collisions and neglects node congestion which has a crucial impact on the network performance, as shown in the next section.
Iii The Proposed Method
First, we emphasize the load balancing problem in the standard RPL. Then, we introduce the proposed Q-learning method to mitigate node congestion and achieve load balancing.
Iii-a Load-Balancing Problem in RPL
We consider a typical LLN model in an IoT application, e.g., industrial automation as depicted in Fig. 1. In this model, a set of sensor nodes are distributed to form the LLN that is rooted at the DODAG root (border router). The DODAG root connects the LLN to the public Internet or a private IP-based network. The nodes utilize IEEE 802.15.4 links to communicate with each other, and use RPL to construct routes towards the DODAG root. We first investigate the packet delivery performance towards the DODAG root of an LLN consisting of 30 nodes via MATLAB simulations. The LLN structure shown in Fig. 1 depicts a characteristic snapshot of the RPL topology using MRHOF at the end of a simulation. Fig. 2 shows the percentage of packets lost at the link layer and also at the output queue under different traffic rates with each node having a buffer size of 10 packets; note that the assumed traffic-load range covers values that can be expected in practice. The results shown in Fig. 2 represents the average over all nodes in the network. The figure demonstrates that, as the traffic rate increases, the losses due to the node congestion tend to increasingly dominate over the channel losses, quickly becoming the main reason of packet delivery degradation.
Fig. 3 shows the Queue Loss Ratio (QLR) of each node according to its hop-distance from the DODAG root. The QLR is calculated as the ratio between the packets dropped due to buffer overflow and the total transmitted packets. Initially, it might be intuitive that nodes that are one-hop distance from the DODAG root experience the highest queue loss levels. However, the results presented in Fig. 3 (corresponding to the RPL topology in Fig. 1) reveal that, under heavy load, there is an imbalance among queue losses among nodes at the same hop-distance from the root. For instance, node has the highest QLR among five nodes that are 2 hops from the root. Similarly, node has the highest QLR among four nodes that are a single hop from the root. This is due to the inefficient parent selection mechanism in RPL, which leads to imbalanced sub-tree sizes among nodes at the same hop-distance. Since LLN nodes have small queue sizes, typically 4 to 10 packets, their queues start to overflow before the congestion level is heavy enough to be detected through the ETX parameter. Therefore, children nodes are not aware of the backlog status of their parents, continuing to transmit packets even if the parents suffer from consecutive queue losses. In summary, there is a need for an efficient load-balancing strategy in RPL.
Iii-B The Proposed Q-learning for Load-balancing in RPL
Q-learning is a model-free RL technique, where the agent learns to select the best action according to the maintained Q-table . In our approach, the Q-table corresponds to the routing table at each node (agent) and the action represents the selection of the preferred parent. The learning goal is to find the optimal parent selection policy according to the network dynamics, e.g congestion level and link quality.
Each node maintains a candidate neighbour set which includes all nodes that are within the communication range of . Denote by a Q-value that is the preference of node to select node as its parent, where . The Q-values are periodically updated in the following way
where and are the Q-values for the current and the previous intervals, respectively; the interval duration is determined by the Trickle timer, as will be elaborated later. is the learning rate, and
is the estimate of the feedback function, i.e., the reinforcement signal. Note that the Q-values are initialized to zeros for all nodes upon network deployment. The valuerepresents the routing metric of the path between and . The main idea of the proposed method is to enable the nodes to learn the congestion levels at their neighbours through the feedback , and utilize this information to select the best parent in order to achieve load-balancing. We represent the congestion level at node through the Backlog Factor , which denotes the ratio between the current queue length and the total queue size. An exponentially weighted moving average filter is applied for calculation of . The congestion status is distributed to all nodes in the set , as illustrated later. To obtain the optimal network performance, in heavy traffic mode, the node learns to select the parent that is less congested, while in light traffic mode, i.e., no congestion, the node learns to select the parent with the best link quality and the shortest hop-distance. Thus, we formulate as a composite routing metric as follows
where denotes the hop-count of node towards the DODAG root, and is the ETX measurement between and .111More complex formulations for can be devised. However, the simple metric in (2) fosters an excellent performance, as shown in Section IV. In RPL standard, is exchanged periodically, and is calculated as 
The parameter is used to control the weight of in (2), such that it reflects congestion level of the node . Since it holds that , we define as
where is a design parameter that represents the threshold after which the parent node is assumed to be congested. That way, in congestion situations, i.e., when , will have a noticeable effect in (2) compared to and . Conversely, in light traffic scenarios, when , the feedback function is influenced more by and compared to . Based on the received feedback , the node updates the value of according to (1).
A straightforward approach would be to use the greedy policy and select the neighboring node with the minimum Q-value as its preferred parent. However, the approach in which all nodes utilize only the exploitation phase that follows the greedy policy may lead to the thundering herd problem , illustrated in Fig. 4. The four leaf nodes in Fig. 4 may select node as their preferred parent as it has the minimum Q-value, which may incur congestion. Upon detecting congestion and selecting a new preferred parent, the four nodes may change simultaneously to the same parent node causing congestion again; such cycle may be repeated indefinitely without achieving load balancing. To overcome this problem, the nodes have to explore other alternatives using a certain probability. Specifically, we refer to as the probability of node selecting node as the preferred parent, given by
where is a parameter that determines the amount of exploration. The probabilistic parent selection in (5) represents a modified version of the softmax action selection strategy. In other words, it implies that a neighbour with low Q-value will be most likely to be selected as a preferred parent, while other neighbours with higher Q-values will be selected with smaller probabilities. Accordingly, the nodes will obtain an efficient experience and avoid thundering herd problem as well.
The congestion level needs to be distributed to the set in order to make the right action according to the network status. In RPL standard, DIO messages are periodically exchanged between neighbour nodes and carry the RANK of the transmitted node, i.e, . In our proposed approach, we implicitly embed into the RANK value in the transmitted DIO message as follows
where is a positive integer that enables decoding two values ( and ) from a single numeric field (). The value of should be selected in a way to keep within its 16 bit boundary . Specifically, a neighbor node that receives the DIO message from extracts the two values and from as
where is the modulo operation. Therefore, in the proposed method, the congestion information are distributed among neighbour nodes without changing the DIO message format nor adding any additional control message overhead.
The transmission interval of the DIO messages is controlled by the Trickle timer algorithm . According to the standard, the Trickle timer is reset to its minimum interval only when changes occur in the network topology. As long as the network is consistent, the DIO message interval is doubled up to a certain maximum value. With such strategy, the nodes may have inaccurate and outdated congestion information from neighbor nodes. On the other hand, frequent resetting of the Trickle timer may increase the routing overhead. To tackle this conflict, we introduce a modified reset strategy to the Trickle timer. The proposed strategy is shown in Fig. 5 and is described as follows. A node resets its Trickle timer to when it experiences a certain number of consecutive queue losses. Specifically, the small queues of LLN nodes my fill up temporarily even when there is no congestion, that often results in false positive for congestion if it is declared too early, which in turn incurs unnecessary DIO overhead. For this reason, we use consecutive queue losses to reset the Trickle timer. After resetting the timer to , the value of is increased by a fixed value to limit the reset frequency and the DIO overhead. If no queue losses are detected within a timer , the algorithm values are reinitialized as shown in Fig. 5.
|Traffic load per node||, , , ppm|
|Propagation model||Shadowing (Log-normal)|
|PHY and MAC protocol||IEEE 802.15.4 with CSMA/CA|
|Time slot duration|
|No. of retransmissions|
We consider a typical IoT network in a monitoring scenario, where a set of 30 randomly placed nodes are reporting to a single DODAG root. The nodes communicate using IEEE 802.15.4 PHY layer and CSMA/CA for channel access. Each node generates a 100-byte packet following a Poisson arrival model and forwards it to its preferred parent; the traffic arrival intensity per node (the traffic load) is a parameter whose values are given in Table I. We consider a log-normal shadowing propagation model with the standard deviation specified in Table I . The table also lists the values of other relevant parameters.222The values of the parameters , and were found via optimization and are independent of the traffic load. Our investigations also showed that these values are robust to the change in the number of network nodes. To verify the effectiveness of the proposed method, we compare its performance with iCPLA  and RPL using MRHOF (RPL-MRHOF) . The evaluation was performed using Monte Carlo simulations in MATLAB; the presented results are the averages of 10 simulation runs for each value of the traffic load, where each run lasted for a duration of 1000 consecutive slotframes.
Fig. 6 shows a sample of the routing topology of both RPL-MRHOF and the proposed method. Obviously, the proposed method achieves load balancing between intermediate nodes, as the nodes learn to avoid congested nodes when selecting their parents. Quantitatively, the method reduces the standard deviation of the number of children per node from 1.3 to 0.75.
Fig. 7 presents a comparison of the average QLR with varying traffic load: While iCPLA fares better than RPL-MRHOF, the proposed Q-learning-based method achieves the lowest QLR due to a more balanced load distribution For instance, the proposed method reduces the QLR of RPL-MRHOF by 71% at 120 ppm/node, while iCPLA achieves only 5% reduction at the same traffic load. Specifically, the parent selection strategy in iCPLA considers only the link-level information, i.e, collision probability, which is insufficient to solve node congestion under heavy load.
As already mentioned, LLN devices have small queue sizes that start to overflow quickly as the traffic load increases. In this respect, a solution could be to increase the Buffer Size (BS) to avoid node congestion. However, as shown by Fig. 8, increasing the BS has a marginal effect to mitigate queue losses, especially under heavy load scenarios. Specifically, increasing the BS of RPL-MRHOF from 20 to 40 packets reduces the QLR by 12% and 3% at traffic loads of 90 ppm/node and 120 ppm/node, respectively. The figure also shows that, in order for RPL-MRHOF to match QLR performance of the proposed method when the traffic load is 90 ppm/node, the BS should be 90 packets, which is infeasible in practice for the resource-constrained LLN nodes. In summary, achieving a balanced routing topology has a more significant impact on node congestion mitigation than increasing the BS.
As noted in Section III-A, queue losses are the main reason for the packet delivery degradation in the network. Therefore, the reduction in QLR is directly translated to improved packet delivery performance, as shown in Fig. 9. The figure depicts the Packet Delivery Ratio (PDR), which is the number of packets successfully delivered to the DODAG root divided by the total generated packets. At 120 ppm/node, the proposed method enhances the PDR performance of RPL-MRHOF by 160% compared to a 60% improvement achieved by iCPLA.
Fig. 10 compares the average delay of successfully received packets by the DODAG root under varying traffic load. As the traffic load increases, the average delay in all three methods increases as more packets are backlogged and a packet has to spend more time in the output queue before transmission. However, the proposed method guides the nodes to a fair queue utilization strategy that helps to reduce the average backlogged packets per node. Accordingly, the queuing time is decreased which reduces the average delay in turn. For instance, at 90 ppm/node, the proposed method reduces the average delay of RPL-MRHOF and iCPLA by 27% and 12%, respectively. In other words, the proposed method outperforms the competitors with a significantly higher packet delivery ratio, and a lower delay of the delivered packets.
Fig. 11 depicts the average number of transmitted DIO messages of each node under different traffic loads for the proposed method and RPL-MRHOF. As the figure demonstrates, the proposed method incurs higher DIO overhead compared to RPL-MRHOF, especially under heavy load. This is because our proposed method resets the Trickle timer more frequently to fast distribute the feedback function when a node suffers from congestion. However, under heavy load scenarios, the increased overhead is still insignificant compared to the total traffic in the network. For instance, at 90 ppm/node, the DIO overhead represents less than 0.5% of the total traffic. Moreover, the increased DIO overhead could be a reasonable cost to the obtained performance improvements in the PDR and the average delay.
In this paper, we have proposed a RL-based routing approach to mitigate node congestion and improve routing performance of RPL networks. Each node adopts Q-learning to select it preferred parent based on a composite feedback function. The congestion level at each node is embedded in the RANK metric and distributed using a modified Trickle timer strategy. Performance evaluations have been carried out and proved the effectiveness of our proposed method to achieve load balancing and improve the network performance in terms of packet delivery and average delay.
-  H. Tran-Dang, N. Krommenacker, P. Charpentier and D. Ki, “Toward the Internet of Things for Physical Internet: Perspectives and Challenges,” IEEE Internet of Things J., vol. 7, no. 6, pp. 4711-4736, Jun. 2020.
-  B. Ghaleb, et al., “A survey of limitations and enhancements of the ipv6 routing protocol for low-power and lossy networks: A focus on core operations,” IEEE Commun. Surv. Tut., vol. 21, no. 2, pp. 1607-1635, Secondquarter 2019.
-  T. Winter et al., RPL: IPv6 routing protocol for low-power and lossy networks, IETF RFC 6550, Mar. 2012.
-  B. Ghaleb, et al., “A survey of limitations and enhancements of the ipv6 routing protocol for low-power and lossy networks: A focus on core operations,” IEEE Commun. Surv. Tut., vol. 21, no. 2, pp. 1607-1635, Second Quarter 2019.
-  J. V. V. Sobral, et al., “Routing protocols for low power and lossy networks in internet of things applications,” Sensors, vol. 19, no. 9, pp. 1-40, May 2019.
-  S. K. Sharma and X. Wang, “Toward massive machine type communications in ultra-dense cellular iot networks: Current issues and machine learning-assisted solutions,” IEEE Commun. Surv. Tut., vol. 22, no. 1, pp. 426-471, Firstquarter 2020.
-  D. Praveen Kumar, T. Amgoth and C. S. R. Annavarapu, “Machine learning algorithms for wireless sensor networks: A survey,” Information Fusion, vol. 49, pp. 1-25, Sept. 2019.
-  P. Levis et al., The Trickle Algorithm, [Online]. Available: https://tools.ietf.org/html/rfc6206.
-  P. Thubert, Objective function zero for the routing protocol for low-power and lossy networks (rpl), IETF RFC 6552, Mar. 2012.
-  O. Gnawali and P. Levis, The minimum rank with hysteresis objective function, IETF RFC 6719, Sept. 2012.
-  H. -S. Kim, H. Kim, J. Paek and S. Bahk, “Load balancing under heavy traffic in rpl routing protocol for low power and lossy networks,” IEEE Trans. Mobile Comput., vol. 16, no. 4, pp. 964-979, Apr. 2017.
-  P. Fabian, A. Rachedi, C. Gueguen and S. Lohier, “Fuzzy-based objective function for routing protocol in the internet of things,” in Proc. IEEE GLOBECOM 2018, Dec. 2018, pp. 1-6.
-  Chowdhury, A. Benslimane and C. Giri, “Non-cooperative gaming for energy-efficient congestion control in 6Lowpan,” IEEE Internet of Things J., vol. 7, no. 6, pp. 4777-4788, Jun. 2020.
-  S. Taghizadeh, H. Bobarshad and H. Elbiaze, “CLRPL: Context-aware and load balancing rpl for iot networks under heavy and highly dynamic load,” IEEE Access, vol. 6, pp. 23277-23291, Apr. 2018.
-  S. L. Sampayo, J. Montavont and T. Noël, “LoBaPS: Load balancing parent selection for rpl using wake-up radios,” in Proc. IEEE ISCC 2019, Jul. 2019, pp. 1-6.
-  H. Yao, X. Yuan, P. Zhang, J. Wang, C. Jiang and M. Guizani, “Machine learning aided load balance routing scheme considering queue utilization,” IEEE Trans Veh. Technol., vol. 68, no. 8, pp. 7987-7999, Aug. 2019.
-  B.-S. Roh, M.-H. Han, J.-H. Ham and K-I. Kim, “Q-LBR: Q-learning based load balancing routing for uav-assisted vanet,” Sensors, vol. 20, no. 19, pp. 1-18, Oct. 2020.
-  H. Yao, X. Yuan, P. Zhang, J. Wang, C. Jiang and M. Guizani, “A machine learning approach of load balance routing to support next-generation wireless networks,” in Proc. IEEE IWCMC 2019, Jun. 2019, pp.1317-1322.
-  A. Musaddiq,et al., “Reinforcement learning-enabled cross-layer optimization for low-power and lossy networks under heterogeneous traffic patterns, ” Sensors, vol. 20, no. 15, pp. 1-26, Jul. 2020.
-  M. L. Littman, “Reinforcement learning improves behavior from evaluative feedback,” Nature, vol. 521, pp. 445-451, May. 2015.
-  JP. Vasseur et al., Routing metrics used for path calculation in low-power and lossy networks, IETF RFC 6551, 2012.
-  J. Hou, J. Rahul and Z. Luo, Optimization of parent-node selection in rpl-based networks, IETF RFC 6921, Mar. 2017.
-  J. Miranda et al., “Path loss exponent analysis in Wireless Sensor Networks: Experimental evaluation,” in Proc. IEEE INDIN 2013, Jul. 2013, pp. 54-58.