Internet of Things (IoT) has played a key role in our daily life as it enables various intelligent applications in our cities. As one of the applications of IoT, which is most related to our daily travelling, smart transportation has served our citizens by offering real-world information and making transport facilities more convenient. Here we give a short background of IoT and smart transportation to provide a better scope that this thesis will cover.
1.1.1 Internet of things
The Internet of Things (IoT) is a paradigm which is increasingly getting attention in modern wireless telecommunications. The basic concept of IoT is that ubiquitous objects around us, such as sensors and mobile phones, are able to communicate and cooperate with each other to solve a common problem [giusto2010internet]. Specifically, the IoT network includes a variety of smart devices with the functions of connecting, exchanging and sharing data with each other over the Internet [madakam2015internet]. In order to enable these functions in IoT networks, one of the key technologies is the Radio-Frequency IDentification (RFID) technology, which allows smart devices to exchange the information of device identification to the target receivers (e.g., Cloud facilities) by using RFID identifier [jia2012rfid]. Another foundational technique is the wireless network used for connecting intelligent devices to monitor the environment. With these two techniques, an IoT system can capture real-time environmental data through sensors embedded in IoT devices. The data emitted from the system can be transmitted to the Cloud via gateways for further storage, process and analysis [2011A]
. Typically, in a cloud-dominant centralised architecture, Artificial Intelligence (AI) enabled computing nodes are often integrated and implemented at the cloud side, with an intention to collect the useful information from the transmitted data centrally and provide better insight for users to make decisions. Some recent IoT applications relying on this architecture are described in the following works, such as in the field of healthcare monitoring, traffic monitoring and environmental resource monitoring[2018Brokering, 2018Economic, St2016Cloud, li2012compressed, he2012integration].
In a word, with the advances in wireless communication and sensor networks, IoT has been gaining attention in the area related to our daily life and more and more ’things’ or smart objects are being involved in IoT networks. As a result, these IoT-related technologies have also made a large impact on new information and communications technology (ICT). However, the advanced IoT networks also come with inevitable shortcomings, especially those usually require the decision-making process to be conducted at the device side or edge side for better security  and privacy protection . Specifically, in a typical IoT scenario where data streams from various IoT devices can be transmitted to the Cloud and stored on a cloud database. Our initial observation is that most IoT devices start to transmit data at a fixed transmission frequency, and such a transmission frequency is typically set by default or pre-defined by the device manufacturer with limited options made available to users. However, some advanced IoT devices with edge intelligence, e.g. Raspberry Pis and the Jetson series toolkit from Nvidia, can now be programmed to promptly respond to changes in the external environment [10.1117/12.2571307, 8230004], and can also be deployed with deep learning algorithms to satisfy stringent low-latency transmission requirements for time-sensitive IoT applications [9287960, 9289509]. This approach does not sufficiently cater for a practical situation where groups of IoT devices may work collaboratively with limited system resources restricted by the operational environment. In fact, implementing IoT devices in a resource-constrained environment may impose two interesting problems in the design of IoT networks: 1) how to determine an adaptive transmission frequency for each IoT device so that an overall utility of the group of devices can be maximised in response to the dynamic changes of the environment; 2) how to ensure that different kinds of network resources can be better managed in a way that heterogeneous IoT devices can be engaged with the network in a secure, privacy-aware and plug-and-play manner. In order to address the mentioned problems, the first topic of this thesis is to propose a transmission frequency system for edge devices in an IoT network with a robust anomaly detection mechanism.
1.1.2 Smart transportation
On the one hand, IoT has played a key role in enabling the smart city, which combines data collection, analysis and decision making [batty2012smart]. On the other hand, the smart city has become a terminology along with IoT, a novel city management approach to establish a collaborative society, where the data from daily life is leveraged to provide decisions for city management [albino2015smart].
Obviously, as the population is growing, the need for transportation increase dramatically and therefore smart transportation becomes the most challenging part of a smart city. To enable smart applications in modern transportation, advanced technologies, e.g. intelligent transportation system (ITS), have been proposed to provide creative insight for traffic management and improve user experience by providing proper information about the traffic network [lin2017intelligent]. For instance, a smart parking system can save time for drivers, by informing drivers of the availability of parking spaces [khanna2016iot]; carbon emissions and pollutants may also be minimised by recommending a shortest path to drivers for their parking search process [agarana2017minimizing]. To sum up, smart transportation has shed a light on modern traffic management and satisfied the need of citizens in daily commuting. However, there are still open problems in smart transportation, such as traffic demand prediction, accident prevention, traffic flow prediction; cloud-based multi-agents planning; energy consumption [rodrigue2020geography], which are more challenging to deal with using conventional means of traffic management.
Bike availability prediction
As one of the common modes of transportation, bikes provide a healthy and convenient way for short-distance travel and sharing bikes have become prevalent in our cities. Also, an efficient bike-sharing system can not only reduce cost and commute time for urban commuters but also effectively mitigate the level of air pollution emissions generated in cities [otero2018health]. However, bike availability prediction is one of the challenging problems in traffic demand prediction because the available number of bikes tends to be unbalanced, particularly at peak demand dates and hours [hulot2018towards]. Therefore, an important consideration to make the bike-sharing system efficient is to balance supply and demand in the bike-sharing network [raviv2013optimal]. To do this, traditional management methods such as manual monitoring systems, have been deployed to enable the relocation of bikes across different stations using other means of transportation, e.g. trucks [7313194, raviv2013static]
. However, this approach can easily lead to supply-demand imbalance due to estimation errors of system operators and unexpected traffic delays during the bike transition. Thus, due to the uncertainty of departure and arrival of bikes at any bike station, it is important to take a more proactive approach by accurately predicting the number of bikes that will be available for users to access at any given time and location. However, on the topic of traffic demand prediction, most of the work focus on taxi demand/availability prediction[yao2018deep, yao2019revisiting]
and limited work discusses the topic of availability prediction for sharing bike. Meanwhile, the current approaches are not able to forecast the availability precisely because of the weakness in traffic feature extraction and modelling. Therefore, in this thesis, the problem of sharing-bike availability prediction using graph neural network (GNN) is our second topic to discuss.
Lane change detection
As another challenge of smart transportation, accident prevention ensures driving safety and deserves more attention. Even if the traffic suggestions and regulations have been authorized to ensure a safe driving environment and minimise the chances of a traffic accident as much as possible, malicious driving intentions (e.g., acute acceleration; frequent lane changing) still play a threat to traffic safety and disturb the normal traffic flow. For instance, a speed advisory system (SAS) offers speed guidance for ensuring driving safety, but the vehicles tend to be driven with unexpected acceleration and lane changing behaviours [jeon2014effects], once they leave the road segment with SAS. Therefore, detection of driving intentions has been involved in traffic management, alarming for intervention when the driving safety may be under threat, such as traffic incidents[asakura2017incident] [hawas2007fuzzy], traffic congestion [wang2018locality] and malicious driving 
. It is worth paying attention to frequent lane changing, which may easily result in severe traffic accidents on highway networks. Existing approaches, such as hidden Markov model (HMM)[li2016lane] and LSTM-based methods [tang2020driver, 8813987], have been found less capable in dealing with the lane changing detection problems as they can not model the traffic data with natural geographical information (e.g, the connection between lanes) sufficiently. Therefore, the last topic in this thesis concerns the detection for lane changing intention using GNN to leverage the geographical information on the highway network, to improve the detection performance.
1.2 Research objectives
Research objectives 1: Optimise transmission frequencies for edge devices in IoT network with robust anomaly detection mechanism.
We consider the two problems of the design of IoT networks, as discussed in section 1.1.1. Our key assumption is that different IoT devices may have different priority levels when transmitting data in a resource-constrained environment and that those priority levels may only be locally defined and accessible by edge devices for privacy concerns. With these in mind, the research objective is to optimise the transmission frequencies for a group of IoT edge devices under practical constraints. We aim at establishing a transmission frequency management system which can allocate optimal transmission frequencies to IoT devices and maximise the overall utility of the edge devices in the IoT network in a decentralised manner. In order to ensure the security of the system, we shall also devise an anomaly detector, on top of the designed optimal transmission management system, which can effectively identify abnormal transmission frequencies in different settings. The anomaly detector is expected to only leverage limited information from the IoT system. We will investigate both mathematical rule-based and deep learning based approaches, and examine their efficacy in tackling such challenges.
Research objectives 2: Availability prediction for the sharing-bike scheme using spatial-temporal graph convolutional network.
As to a research topic related to smart transportation, we first consider the problem of availability prediction for sharing bikes. The research objective is to present a availability prediction system which can forecast the available number of sharing bikes among different bike stations accurately and promptly using models trained on realistic data. In particular, spatial-temporal graph convolutional network (ST-GCN), as a powerful variant of graph convolutional networks (GCN) which aims to capture the relationship of data contained in the graphical nodes across both spatial and temporal dimensions, is applied for improving the prediction accuracy. Recently, graph based solutions have caught much attention in the literature as they have shown efficacy in improving traffic management. We shall apply spatial-temporal graph convolutional network (ST-GCN) to capture the relationship of data between graph nodes and compare its performance with other schemes to illustrate its efficacy in chapter 3. Moreover, the impacts of different modelling methods of adjacency matrices shall be investigated.
Research objectives 3: Detecting lane changing intention on highway network scenario using graph neural network.
The last research objective is related to driving safety. As mentioned previously in section 1.1.2, frequent lane changing intention threatens driving safety on the highway network. The objective of this part is to develop an algorithm which is able to detect the frequent lane changing behaviour on highway network using graph-based deep learning methods. As we shall see, the proposed algorithm will be able to forecast the lane changing probability of vehicles on a segment of the highway network in real-time.
1.3 Thesis contributions
The thesis discusses three topics related to IoT and smart transportation. The contributions of the thesis can be summarised as followed:
In chapter 2, we propose a transmission frequency management system which is able to find the optimal transmission frequency for each IoT device, in order to maximise the overall utility in a resource-constrained, privacy-aware environment. Design an anomaly detector to ensure the transmission frequencies of the proposed IoT transmission frequency management system are in good order.
In chapter 3, we design a deep learning architecture by combining attention mechanism with the spatial-temporal graph neural network, to better predict the sharing-bike availability based on realistic datasets. Furthermore, we also discuss the impacts of different modelling methods of adjacency matrices on the proposed architecture.
In chapter 4, we apply a refined version of a graph neural network, to predict the lane changing intention and analysis the pattern of driving data for the purpose of model interpretability.
1.4 Thesis structure
The thesis is organised as follows:
Chapter 1 introduces the background, our research objectives, thesis contribution and structure.
Chapter 2 approaches the first research objective by applying optimisation and deep learning method to IoT systems.
Chapter 3 achieves the second research objective by leveraging graph neural networks to forecast the sharing-bike availability, based on the data collected by IoT devices embedded in bike stations.
Chapter 4 tackles the third research problem by using graph neural network to analyse the driving patterns and predict the lane changing intention, based on the data generated from a novel mobility simulator.
Chapter 5 summarises the thesis and highlights the potential directions for future work.
In a cloud-based IoT solution, data from various IoT devices need to be pushed to cloud-based database instances in real-time. However, the capacity of storage space is limited. For instance, an IBM Cloudant database instance allows 1 GB of data storage with 10 writes/sec for its Lite Plan users, and 20 GB of data storage with 50 writes/sec for its Standard Plan users [bienko2015ibm]. Given this scenario with the limited storage resource, if the Maximum Writing Frequency (MWF) of the data is not managed properly, it can be envisioned that a writing congestion event, e.g. a REST-API writing failure, can be triggered for a group of IoT devices. Also, another concern is on privacy, which, in our context, refers to the fact that the mapping between the utility and the transmission dynamics of a given IoT device should not be revealed to any unrelated devices, third-party gateways and untrusted cloud units or instances. If this mapping information is revealed publicly it may be possible for an attacker to identify which IoT device is more vulnerable in a given system [2020A].
To solve this challenge, in this chapter we propose a transmission frequency management system for IoT edge devices in a decentralized architecture with anomaly detection mechanisms. Thus the MWF can be managed optimally by a group of IoT devices and any abnormal writing frequency occurrences can be detected by the gateway. To carry out optimisation, we assume that each IoT device is associated with a utility function with some concavity [7106504, 2019Utility], in a way that only the user of the device can specify. Here, the utility refers to how a user can practically benefit from a given Data Flow Writing Frequency (DFWF). For instance, a utility function can easily describe the accuracy of a trained model with respect to DFWF of a given IoT device for an Edge AI type of IoT application [LV202190]. Furthermore, as previously mentioned, such a utility function may also potentially reflect the significance or vulnerability of an IoT device in a specific scenario. For instance, a faster transmission frequency of a webcam in a bank system may be more desirable, i.e., have higher utility, especially in an emergency, than that of a detector.
With this idea in mind, our main objective in our system is to maximise the overall utility of the group of IoT devices given the predefined and limited MWF and storage capacity of the database. We will show that the presented challenge can be formulated as a concave optimisation problem with constraints. This problem will then be solved using the well-known Alternating Direction Method of Multipliers (ADMM) algorithm [boyd2011distributed] in a decentralised optimisation framework where each utility function is locally defined on the edge device and will not be revealed to any unrelated devices and untrusted management platforms, such as other smart gateways and cloud units/instances. The proposed solution aims to provide flexibility in data transmission for IoT systems and applications, especially in resource-constrained environments. As we shall see, the designed system is fully autonomous and can be easily deployed to optimally manage various IoT transmission frequencies with anomaly detection capabilities.
We note that significant work on anomaly detection has been undertaken in IoT context: for instance, Liu et al.  proposed a detector for on and off attack by a malicious network node in an industrial IoT site; Anthi et al.  represented an intrusion detection system for an IoT system to identify the Denial of Service (DoS) attacks; Ukil et al.  discussed the detection of anomalies in healthcare analytics based on IoT by analysing the cardiac signal; and Hu et al. [traffic] proposed a Context-augmented Graph Auto-encoder (Con-GAE) for anomaly detection in traffic monitoring. However, the anomalies defined in these works are largely based on tempering with contents in data packets transmitted by IoT devices (e.g., changing a data value from “A” to “B” in the transmitted file [miller2016detection]) and no approach has been found on anomaly detection for an IoT data transmission frequency system involved with an optimal iterative scheme. Therefore, in this thesis, we are interested in detecting the malicious manipulations leading to a change of transmission frequency as a result of the anomalies happening on the edge devices.
The contributions of this chapter can be summarised as follows:
We propose an optimisation framework for an IoT network so that the transmission frequency of the connected IoT devices can be dynamically adjusted to their optimal values in a low latency through an ADMM-based iterative optimisation method.
We design an anomaly detector on top of the frequency management system, which is able to infer anomalies that may occur in the underlying transmission management system in real-time.
We propose both mathematical rule-based and deep-learning-based approaches for detecting anomalies in the IoT transmission frequency management system. In particular, the rule-based approach is designed to reveal anomalies in the system based on fundamental optimisation theory, and the deep-learning approach aims to establish a prediction model based on sequential data analysis in system implementations.
We conduct a comprehensive comparative study using both anomaly detector strategies and demonstrate the strengths and weaknesses of the two approaches in both simulated and practical working environments.
The remainder of this chapter is organised as follows. In section 2.2, the architecture of the proposed system is presented. The optimisation problem is formulated in section 2.3 and its implementation is discussed in section 2.4. The experiments of transmission frequency management and results are discussed in section 2.5. The anomaly detection mechanisms are demonstrated in section 2.6. The real-world experiment for anomaly detection is presented in section 2.7 and the corresponding results are discussed in section 2.8. Finally, a conclusion for this chapter is provided in section 2.10.
2.2 System Architecture
Our proposed system architecture is illustrated in Fig. 2.1. The system consists of four main components, including IoT edge devices, gateways, a cloud platform and users. The main functionalities of each component are described as follows:
IoT devices: sensors/devices connected to a gateway, having the capabilities of defining utility functions and the ability to solve a local optimisation problem in a decentralised manner.
Gateway: collects data from IoT devices/sensors, passes data to the Cloud, and conducts basic data processing tasks including anomaly detection to protect and inform users.
Cloud platform: a central hub for data analysis, monitoring and storage.
Users: the owner of the IoT devices who wishes to use the IoT devices in some collaborative application scenarios.
In the proposed system, a gateway starts by waiting for a connection from IoT devices. When an IoT device initially connects to the gateway, the decentralised optimisation algorithm is activated to calculate the optimal transmission frequencies for all connected devices whilst taking account of the resource constraints of the system. After that, the gateway starts to collect data streams from all IoT devices after the transmission frequencies are established. Finally, data collected by the gateway is transmitted to the cloud platform for data storage and further analysis of the IoT devices if specifically requested by the users.
2.3 Problem Statement
We now present the specific problem statement to be solved in this chapter. A user wishes to determine the optimal DFWF of every IoT edge device so that the overall utility of the whole group can be maximised, given , the number of devices connected to the gateway, the utility of the device with current DFWF , MWF , total data storage (e.g., in unit MB) available per received data packet, , and the data size (e.g., in unit MB) required for the ’th device per writing request.
Mathematically, this problem can be formulated as follows:
We shall only require that each utility function can be modelled as a continuously differentiable, non-decreasing, strictly concave function, which is a common assumption for modelling the utility of internet data traffic [srikant2004mathematics]. For example, utility functions may be modelled as a cluster of negative quadratic functions.
2.4 System Implementation
The classic ADMM algorithm proposed in [boyd2011distributed] is particularly suited to solving the formulated optimisation problem (2.1) as the problem can be converted to a convex optimisation problem with convex constraints. Here we briefly recall the ADMM algorithm for solving (2.1), which is shown in Algorithm 1, where and are updated in an alternating fashion and is a dual update variable.
Note that the above ADMM algorithm can be implemented in a decentralised manner as our objective function is separable which implies that both x and uvector updates in the algorithm can be implemented in parallel. Finally, the z update depends on inputs from both x and u. Given these inputs, the projection operator projects the resulting vector to the constrained convex space . Thus, the z update needs to be implemented on gateway. Note that is the augmented Lagrangian parameter and we take , being equivalent to a step size in update. The ADMM algorithm in its decentralised format is shown in Algorithm 2.
With this algorithm in mind, the proposed system can be implemented in the following steps, which are illustrated in Fig. 2.2.
During the initialisation stage, a user needs to specify some parameters before running the algorithm. This includes , , , and the utility function of each device.
When the initialisation step finishes, the ADMM algorithm will be implemented in an iterative manner on the edge IoT devices to determine the optimal DFWF by computing the optimal as per Algorithm 2.
During each iteration, the gateway gathers all the optimal from all devices, calculates and broadcasts the updated z value to local edge devices. Upon receiving the z value, each edge device updates correspondingly.
If there are any resource changes during runtime, the algorithm can dynamically capture the changes to recalculate the optimal solution given the new context.
When the algorithm converges, the optimal DFWF will be set by each device, and these devices can then start pushing data to the cloud accordingly.
The gateway keeps monitoring the data injection and detects if an anomaly happens on any of the transmission frequencies. If so, the user will be alerted and the optimal solution will be recalculated and reset after the anomaly has been remedied. We note that the legitimate reconfigurations of the system should not be identified as anomalies. Instead, the devices notify the gateway when legitimate changes happen, and the system executes step S4.
Finally, all transmitted data streams will be stored on the cloud and an authorised user can leverage the stored data for visualisation and analysis by making a request.
2.5 Experiment results on optimal transmission frequency allocation
This section presents simulation results to evaluate the performance of the proposed system. As shown in Fig. 2.3, the system consists of a laptop as the central node (i.e., as a smart gateway in this work), three IoT devices (Raspberry Pi), and a router for the communication between the gateway and the IoT devices. Typically, IoT devices connect to the router in a wireless manner. However, in our setup, since the IoT devices do not have the capability of wireless transmission, they transmit data to the router via cables, and the laptop communicates with router wirelessly. Decentralised ADMM optimisation and data transmission are implemented on both the gateway and devices via socket programming. System parameters for the simulations are set as , , , , , and . The utility functions in this simulation are presented in Table 2.1 and have the characteristics previously specified to successfully apply the ADMM algorithm. We note that the utility functions are required to be concave based on optimisation problem 2.1 and the utility functions in Table 2.1 are selected as our examples. We simulate the system in two scenarios: a) resources are sufficient for the data transmission request, and b) resources are insufficient for the data transmission request from all devices. For each device , its transmission frequency is defined as data is transmitted times per second. In particular, implies that the device is not transmitting data. Thus, for each device, an extra constraint, applies to indicate the minimum transmission frequency. For simplicity, we set in our simulation.
It is worth noting that the gateway is not able to access the utility function of each device in order to cater for privacy concerns, and also that the transmission frequency of each device is calculated locally and not explicitly exposed to the gateway. However, a DFWF may be estimated by the gateway by evaluating the time intervals of the consecutively received data packets and an averaged DFWF is calculated over 300 data packets after the optimal DFWF is assigned.
|Device index||Utility Functions|
2.5.1 Allocation with sufficient resources
In this scenario, only device and device are connected to the gateway (i.e., parameter ) and all other system parameters are kept as , , , with the associated utility functions and shown in Table 2.1. With these parameters, the theoretical optimal results of the ADMM implementation are and for the optimisation problem 2.1. This result implies that the gateway expects to receive and data packet(s) per second from device and on average. In this setup, the capacity provided by the system is sufficient since and . With the decentralised ADMM implemented using the simulation setup, the optimisation results and resource consumption of the system are illustrated in Fig. 2.4 and Fig. 2.5, respectively. In particular, Fig. 2.4 shows the evolution of the calculated DFWF for both devices as estimated by the gateway. The DFWFs are estimated along with the number of received data packets, indicated by the red and green lines for device and device , respectively. Concretely, our results show that the estimated DFWFs are and for device and , respectively, as shown in Table 2.2, which result in a delay for device (i.e., calculated by ) and a delay for device . The estimated DFWFs are just slightly below the theoretical optimal DFWFs, indicated by the dotted line in Fig. 2.4. The decrease of the DFWF may be accounted for by the internet speed, while the communication between the gateway and the devices is based on a router. Meanwhile, we find that the fluctuation of the estimated DFWFs is caused by the data jamming when the gateway is receiving data packets from IoT devices with high writing frequency. Fig. 2.5 shows the sum of DFWFs as well as the size of total data packets of all connected devices per second transmitted to the gateway. The dotted line indicates the maximum total DFWF (in red) and received data size (in green) for each data packet. Since the system can provide sufficient resources, the total DFWF and the writing data size have not reached the resource boundary after the transmission frequencies are optimised, indicating that the proposed system is robust as long as the system resources are sufficient for this specific data transmission task.
|DFWF (Hz)||DFWF (Hz)|
|Device 1||Device 2|
2.5.2 Allocation with insufficient resources
In this scenario, after device and device have connected to the gateway and the optimised transmission frequencies have been calculated, a new device, device , connects to the gateway and the timing of connection is recorded. Given , , , , , and the corresponding utility functions , , reported in Table 2.1, the theoretical optimal results of the ADMM implementation are calculated as , and for optimisation problem 2.1. This result implies that, on average, the gateway expects to receive , and data packet(s) per second from devices , , and respectively.
Based on the simulation platform, the decentralised optimisation process and system resource usage are shown in Fig. 2.6 and Fig. 2.7 in the scenario of insufficient resources. We note that before the connection of device , device and device transmit their data packets under the corresponding optimised transmission frequencies exactly as described in the first scenario with sufficient resources. As shown in Fig. 2.6, after the device connects to the system (indicated by the red arrow), the DFWF of device is readjusted and converges to a new optimal value. The DFWF of device remains unchanged since the recalculated optimal result equals the previously assigned DFWF before the connection of device . After the decentralised ADMM solution is found for device (indicated by the magenta circle), device pushes data packets to the gateway using its optimal DFWF. After all three devices are transmitting data steadily (i.e., after the magenta circle), our results show that the estimated DFWFs are , and for device , , and , respectively, which are reported in Table 2.3. Again, these estimated DFWFs are slightly below the theoretical optimal DFWFs, indicated by dotted lines, reflecting time delays of , and (i.e., calculated by ) for devices , , , respectively during their transmissions.
After the optimal transmission frequencies are established, as shown in Fig. 2.7, device starts to push data (marked by the magenta circle) and the total writing data size reaches the level of the system resource boundary immediately. This indicates that the proposed system is able to reallocate the system resources to finish the data transmission task effectively using the ADMM approach. Finally, for comparison purposes, we evaluate the overall utility under the ADMM-optimised DFWFs, with non-optimised average distributed DFWFs (i.e., ), and non-optimised proportionally distributed DFWFs (i.e., ) as two baselines given the same MWF . The results shown in Table 2.4 find that the utility under ADMM-optimised DFWFs achieves the largest value, which demonstrates that the proposed system obtains the best result compared to other trivial system setups that have not undergone any optimisation process.
|DFWF (Hz)||DFWF (Hz)||DFWF (Hz)|
|Device 1||Device 2||Device 3|
2.6 Anomaly detection for changes of transmission frequency
While the transmission frequencies are determined and allocated by the system, all the devices push data steadily with their specified DFWF. However, the transmission frequencies can be tampered with both explicitly and implicitly. In other words, a malicious attack to the device can not only manipulate the DFWF explicitly, but also can modify the utility function (i.e., the input or function type), the system transmission data size and the system resource, which leads to a change of DFWF implicitly. In this section, the above manipulations are discussed for the examination of abnormal transmission frequency detection at the gateway side.
We first consider the scenario of manipulating the DFWF explicitly. According to the fundamental mechanism of the ADMM algorithm, the gateway only has access to z. Since x achieves convergence to z eventually, as a specific example (i.e., and ) shown in Fig. 2.8, we argue that the gateway is able to detect the anomaly of x during the whole transmission process based on its knowledge of the latest value of z. Specifically, this detection process can be described in the following three steps:
Gateway accesses the value of for each device.
Gateway estimates the DFWF (i.e., the converged value of ) for each device according to the received time-stamped data flow.
If the estimated DFWF is significantly different to the reference value of (i.e., , where is a threshold depending on the network delay), the optimal transmission frequency can regard as anomalous and as being manipulated.
However, the above detection process is not able to apply in some scenarios. Given the transmission frequency management system described in Fig. 2.1 and problem (2.1), there are other types of manipulations on the edge (i.e. including edge devices and gateway) that can also lead to the changes in transmission frequencies. Specifically, these manipulations can happen by changing the utility function input, function type, data size requested per writing request (i.e. defined on edge devices), maximum writing frequency and data storage (i.e. system resource allocated to the gateway), leading to a new ADMM optimisation process with x value converging on z value. In general, when manipulations happen on the device in the network, a new optimisation process needs to be reactivated by solving the following problem:
where and denote the new utility function and new data packet size after tampering, respectively.
Clearly, there are many ways that an optimal transmission frequency can be implicitly tampered. In our context, we consider the following specific definitions:
Manipulation on utility function input only: The independent variable of the utility function is manipulated by adding an input factor with a small given range, .
Manipulation on utility function type and input: The utility function can be totally changed to anther type of concave function specified by the utility function set of the system, i.e., .
Manipulation on transmission data size: The data size required for the ’th device per writing request is manipulated by adding a size factor with a small given range, .
Comment: It is also possible to affect the optimal transmission frequency and by manipulating system resource in a small given range, such as and . In our definition, such manipulations are regarded as a systematic adjustment as it is not directly related to any user-specific property, e.g., , and thus it will be regarded as normal scenarios in our anomaly detection analysis.
In addition, we also have the following assumptions in our problem 2.2.
We assume that at every given time only one edge device is manipulated, which is the fundamental basis for detecting an anomaly when multiple devices are manipulated in our system.
We assume that the anomaly detector is a separate process running on the gateway, and it can only access limited information on the gateway but not all. More specifically, we assume that the anomaly detector can only access the value of z and the sum of x and u, denoted by v, from the ADMM iterative process at the gateway. It will never access the exact transmission frequency x directly from the local devices and other resources/parameters shared between devices and the gateway.
We assume that the anomaly detector starts to monitor anomalies in real-time once the ADMM algorithm converges and local devices start pushing data to the gateway. The device setting will be reset when any anomalies are detected, and the optimisation process will be reactivated to reset the optimal solutions for fair resource allocation as per the normal situation. To further illustrate this point, the process of anomaly detection is shown in Fig. 2.9.
We now introduce two approaches to address the anomaly detection problem, namely a rule-based approach and a deep learning approach. The rule-based approach detects system anomalies based on the mathematical deduction, and the deep learning approach solves the detection problem using collected experimental datasets of the system. The rule-based approach is proposed as a baseline method as we shall see it has some drawbacks in detecting system anomalies in detail.
2.6.1 Rule-based anomaly detection
Our objective is to investigate the behaviour of the optimised system before and after manipulation. To this end, we borrow some fundamental concepts from the optimisation theory, i.e. the Karush-Kuhn-Tucker (KKT) conditions [gordon2012karush] for the optimisation problem (2.1) under study. For mathematical conventions, we now rewrite the original optimisation problem (2.1) in the following format:
where is a convex function. The Lagrange equation of (2.3) is presented as follows:
and the KKT conditions require the following to be held for optimality:
where is the operation of partial derivative (i.e., gradient), , are Lagrange coefficients for , and .
Specifically, and , which represents the constraints in problem (2.3) with:
Clearly, the converged optimal solution will fall into one of the following situations with reference to system constraints.
Given , we have according to equation (2.5). The system is running under . Thus, for each device , we have
for problem (2.3).
Considering the constraint , when a manipulation results in an increase of DFWF for device (i.e. ), at least one of () decreases. Considering that is monotonously increasing with respect to an increased (i.e. with convexity), the decrease of will also decrease . Consequently, , will decrease as per (2.8), which indicates decrease of . Therefore, an increase of results in the decrease of transmission frequencies of all other devices .
Given , we have according to equation (2.5). The system is running under . For each , we have
where since is the required data size. This implies
for problem (2.3).
Similar to the first situation, without loss of generality, an increase of DFWF for device , , after a manipulation will lead to a decrease of at least one due to the equality constraint . Since is convex, the decreases of indicates a decrease of . Given formula (2.10), we have that decreases proportionally followed by the increase of , resulting a reduced .
Given and , we have and according to equation (2.5). Thus, the system is running within the boundary of system resources. For each device , we have
Considering that the system is running within the boundary of system resources, manipulation on any device will not affect other devices. That is, for instance, when a manipulation results in an increase of DFWF for device (i.e. ), other , remain unchanged since they were already optimised and the system resource is sufficient to cover the extra needs for device .
Given the above discussion, we have observed that once the manipulation accounts for a change of DFWF on a given device, DFWF of other devices will either change oppositely or remain unchanged. Accordingly, we can devise a simple rule-based mechanism for anomaly detection, and the flow chart is shown in Fig. 2.10. It operates as follows. When the system starts to operate and converges to optimality normally, the anomaly detector keeps a record of the normal z value while keeping monitoring the z value from the algorithm iteration in real-time. Once the absolute difference between the observed z value and the normal z value becomes greater than a preset threshold (component-wise), the anomaly for the corresponding device is recorded. In this work, the thresholds are defined as , , , , , to the change of the recorded normal z value so that the performance of the approach can be evaluated comprehensively.
To further demonstrate how we can apply the rule-based approach for anomaly detection, a simple simulation is conducted on the IoT system consisting of three devices. The utility functions for all three devices are reported in Table 2.5, where we assumed that the first device, i.e., device 1 was manipulated by only adding an at a given point during our experiment. Our results are shown in Fig. 2.11. It can be seen that device 1 was manipulated at the 100 iteration, indicated by different cycles highlighted in Fig. 2.11, leading to an increase by (i.e., from 2.02 to 3.93) in DFWF, while device 2 and device 3 reduced their transmission frequencies by and correspondingly. Therefore, by applying a threshold less than to the change of recorded normal values, the rule-based detector can detect the increase of transmission frequency in device 1 and the decrease of transmission frequencies in device 2 and device 3 successfully. Given this, an anomaly will be spotted in this case.
2.6.2 Limitations of the Rule-based Anomaly Detection
Our results in Section 2.6.1 show that a rule-based approach has potential for anomaly detection as long as the manipulation leads to a change of transmission frequency. However, such an approach also has certain limitations when deployed in the real world, which is summarised as follows:
The rule-based approach mainly relies on the optimality criteria without fully leveraging information from the iterative process, and as a result it cannot further distinguish different types of anomalies when a manipulation happens on the edge device.
As we shall see, system parameters, i.e., z, may fluctuate during the optimisation process and that can easily result in misjudgements when using the rule-based approach.
Furthermore, when there are network delays in the IoT network, transmission frequencies of the devices may not change simultaneously, which can also lead to misjudgements when using the rule-based approach.
Due to the uncertainty of a practical running IoT environment as well as the depth of information that can be leveraged from the collected data for anomaly detection, we are also interested in exploring a data-driven based solution to address the limitations exposed by the rule-based approach which is introduced in the following section.
2.6.3 IoT anomaly detection with LSTM-based approaches
In this section, deep learning-based approaches are proposed for anomaly detection on the gateway, covering all categories of the anomalies defined in Section 2.6. Our starting point is the observation that an anomaly detector can only access the value of z and the sum of x and u, i.e., v
at every given time point of interest, i.e., a sequential data. Inspired by this, we aim to leverage a prevalent sequence-based model, long short-term memory (LSTM)[schmidhuber1997long] in our work,which becomes one of the popular architectures in anomaly detection [ergen2019unsupervised, lindemann2021survey] and can leverage the collected information during the optimisation process.
Specifically, we apply a basic one-layer LSTM architecture in our model design and compare the detection performance with different complicated variants which have been applied in anomaly detection, e.g., bidirectional LSTM (bi-LSTM) [aljbali2020anomaly], stacked-LSTM [thill2019anomaly], LSTM with attention mechanism [xia2021new] (LSTM-attention) and LSTM with encoder techniques [nguyen2021forecasting] (LSTM-encoder), considering that the extra deep learning architectures may improve the detection performance. Let denote the input feature at step t (i.e., the iteration of the ADMM algorithm), then the LSTM network essentially extracts hidden information at each step, , and feeds this in as the input of the next step, . A standard LSTM unit includes a cell, a forget gate, an input gate and an output gate to jointly manage the information flow from input to output. The input feature can be either a scalar, vector or matrix. In our case, the input feature is represented as a matrix consisting of system parameters of each device, , from iteration to . Here, the input features contain , where . The output of the LSTM model is the categorical label for the anomaly corresponding to the manipulation types as per our definition.
2.7 Experimental setup for anomaly detection
In this section, we first introduce several different types of manipulations, then we discuss the IoT system setup and data generation process. Finally, we present the LSTM network for anomaly detection. The IoT system is setup to transmit the data stream, under the circumstance where the transmission frequency may be manipulated implicitly. During the process of data stream transmission, the ADMM parameters which are able to reflect the system behaviours are recorded, to generate a dataset for LSTM model training for detecting the manipulations.
2.7.1 Setup for manipulations
Utility functions defined on IoT devices may indicate user’s preference in real-world IoT application. It is worth noting that how to define a user’s preference using a utility function is an open issue  as different users may end up having totally different utility values with respect to a given source, i.e., DFWF in our case. However, in our context, we shall make the assumption that such a function is concave as it generally reflects the fact that a user’s satisfaction level is increased when the allocated DFWF is also increased. With this in mind, we have the following settings:
Manipulation on utility function type and input: The utility function is changed from to (i.e., see Table 2.6) with input factor, resulting in manipulation , labelled as type .
Utility Functions Table 2.6: Utility Functions set
Manipulation on transmission data size: The data size factor is set as a random value from the set of and the is manipulated as , labelled as type .
Manipulation on utility function input only: In this case the input factor is set as a random value from the set of for the manipulation , which is labelled as type .
Comment: As mentioned in Section 2.6, manipulating system resources can also affect the optimal transmission frequencies for edge devices, but it will be treated as a normal systematic adjustment. Regarding manipulation of system resources, the MWF, , and data storage amount, , are manipulated by adding an MWF factor and storage factor. The factors and are attributed a value from the set of and respectively, ensuring that the manipulated and are positive. Here we have manipulation and which are labelled as normal (type ).
2.7.2 System setup
In general, we consider two different system setups in experiments. One simulates the ideal IoT scenario including an arbitrary number of devices, without considering the effects of network delay. The second simulates a practical IoT environment involving real IoT devices with network delay. In order to compare the performance of the two setups, we manually trigger manipulations and record the manipulation count/type for both systems. However, considering that in a real-world environment it is impossible for a gateway to know the ground truth, thus we deploy the pre-trained model based on the ideal scenario and evaluate the performance of the model in the practical IoT environment.
More specifically, the simulation system (SS) and real world system (RS) are introduced to validate the performance of the anomaly detector. The SS simulates the ideal scenario that all devices transmit data to the gateway without network delay. In our experiment, SS includes the virtual edge devices and gateway on a local computer where the data streams can be exchanged even if there is no network environment. The RS simulates the practical IoT application that all devices transmitting data to the gateway with network delay effect being considered. In our real-world implementation, this consists of three edge devices (i.e. Raspberry Pis) and a laptop acting as the gateway. Edge devices communicate with the gateway through a wireless router as shown in Fig. 2.3. The key system properties for this practical system are set as and .
It is worth noting that in SS we simulate the ideal scenario and generate data for the purpose of training the anomaly detector. Therefore the data is labelled corresponding to anomalies when devices are manipulated. In RS, we simulate the scenario that edge devices are implemented in a real-world IoT network for daily service. In this context, the data collected from RS is without labels and is used for anomaly detection in real-world applications.
2.7.3 Data generation
The process of data generation can be summarised as follows: during the daily service of the IoT network, the system suffers attacks and transmits a data flow containing unexpected transmission frequencies, the system returns to its normal state after the end of the attack. Specifically, at the beginning, SS and RS are running under the normal state. After the ADMM algorithm has converged for the duration of several ADMM iterations, a type of manipulation happens on the IoT devices and the system reacts, calculating new transmission frequency values. After the anomaly happens and the ADMM algorithm converges under the anomaly, the edge devices return to the normal state and the system repeats the process. The duration of normal states varies between 100 and 120 iterations, while the anomalies last for duration between 50 and 70 iterations. During this cycle, the normal situation is labelled as type and anomalies are labelled as different numeric types. Data (i.e. ADMM parameter) z and v generated from the ADMM algorithm are recorded along with each iteration during the interaction between the gateway and the edge devices. Data is fully labelled as either normal (type 0) or anomalous (type 1, 2, 3) and attributed to either SS or RS. Data generated from SS is called simulation set while that generated from RS is called practical set.
Note that anomalies can happen on any device and in this section, we evaluate the anomaly detection based on anomalies occurring on device number one. This considers a reasonable scenario in a real-world IoT network, where a small number of devices (i.e. one device in our system) are attacked while the majority of devices (i.e. the other two devices) are maintained as normal. Fig. 2.12 demonstrates the real-time change of parameter z when anomalies happen on device one. A decrease in the of device one () is accompanied by an increase in the of device two and three ( and ) when the function type and function input are manipulated in SS.
2.7.4 Setup for LSTM-based networks
For the one-layer LSTM architecture, the settings include an input feature length of , resulting in an input size of 610, which consists of , where indicates the IoT device number. The step size is set as 5 and the hidden size of LSTM is set as 100. The bi-LSTM model is established based on this one-layer LSTM architecture with bidirectional mechanism. The stacked-LSTM is composed by stacking two one-layer LSTM architectures. For LSTM-attention, a multi-head attention mechanism with two heads follows the one-layer LSTM architecture. The number of input units of attention is set as 100, as the same as the hidden size of LSTM. For LSTM-encoder, an encoder-decoder based on the one-layer LSTM is established and trained at the first stage. Then the encoder part is used for extracting the hidden feature for detection. The simulation data set is split as follows: for training, for model validation and
for simulation testing. Finally, the LSTM-based models are tested using the practical data set. Experiments are repeated ten times for each anomaly type and the mean and standard deviation of prediction accuracy are presented in Table2.7 for the simulation test and practical prediction.
For the purpose of clear observation, we first investigate the performance of one-layer LSTM separately for each anomaly type then combine all anomaly types to assess general detection ability. Finally, we compare the performances of different variants of LSTM in detecting all anomaly types.
2.8 Detection results and discussion
2.8.1 Anomaly detection on SS
In this section, different anomaly types are detected on SS and the model performance is evaluated. Firstly, when generating data (i.e. ADMM parameters z and v) from the SS, we investigate the scenario that only one specific type of anomaly (manipulation of function input alone) happens repeatedly. Here we should note that different one-layer LSTM models are trained for different scenarios that only consist of a specific type of anomaly, with of the data specified as the training set, of the data for validation and of the data for testing.
As shown in Table 2.7, anomalies caused by manipulating the utility function input only are detected with an accuracy of . Similarly, we investigated the detection performances for the other two anomaly types “Function Type and Input” and “Data size”. Our results show that both anomalies can be detected with relatively high accuracy ( and ) for manipulations of utility function type & input and transmission data size respectively. These separated detection accuracies for specific manipulations reveal that the deep learning based approach is able to extract the individual pattern of each type of manipulation with very high accuracy. We note that the detection accuracy for “Data size” is slightly lower than the detection accuracy for other types. The reason might be that the chosen data size factor (in section 2.7.1) leads the manipulated data size close to the correct data size and the change harder to detect.
|Anomaly types||Simulation||Real world system|
Function input only
|98.14% 0.52%||82.84% 3.81%|
Function type and input
|99.82% 0.01%||93.90% 1.52%|
|93.91% 1.00%||92.65% 0.85%|
|98.81% 0.38%||96.28% 0.89%|
|92.35% 0.84%||78.88% 3.80%|
Furthermore, when generating data (z and v) from the SS, we also investigated the scenario that three types of anomalies appear randomly (only one anomaly happens each time but can be any one of the different anomaly types). Here, only one LSTM model is trained for detecting different anomalies using data from the SS, with of the data used as the training set, of the data for validation and of the data for testing, which is consistent with the previous setups.
Both four-class detection (with labels for situations including normality and the different anomalies, respectively) and two-class detection (here, normality and manipulation of system resources are labelled as , and other manipulations are labelled as ) are investigated. The prediction accuracy was found as for four-class anomaly detection and for two-class anomaly detection.
The rule-based detection in the SS is based on thresholds by identifying to which extent the value is changed. Here, the threshold was assumed to be of the optimal transmission frequencies of the IoT devices. Given this setting, Table 2.8 demonstrates the detection results obtained using this approach. Specifically, comparing with Table 2.7, the general (two-class) results show that the LSTM-based detection can easily outperform the rule-based detection method.
|Anomaly types||Simulation||Real world system|
Function input only
Function type and input
2.8.2 Anomaly detection on RS
In order to better represent detection of anomalies in a real-world IoT environment, different types of anomalies are detected using the RS in this section. We recall that the LSTM model is trained based on the simulated data from the SS and will be tested using the data from the RS in this setup.
Our results in Fig. 2.13 indicate the value of parameter for devices 1, 2 and 3 (green, magenta and blue lines, respectively) when the RS system is running normally, and in scenarios when three types of anomalies occur. In comparison to Fig. 2.12, in this case, when an anomaly occurs, the value of parameter for devices 1, 2 and 3 do not change at the same time, which causes the observed misalignments with respect to iterations of the ADMM algorithm. Fig. 2.14 shows the variation of parameter for device 1, 2 and 3 on RS on a long-time scale with misalignments, fluctuations and jumps.
Comparing these results to those obtained for the SS experiments (Table 2.7), it is evident that the accuracy of anomaly detection for “Function Input Only” from the RS () is lower than that from the SS () because of the misalignments between the values of different devices. Similarly, detection of “Function Type and Input” and “Data Size” anomalies in the RS ( and ) had accuracies slightly lower than those presented in the SS simulation results. In addition, four-class detection and two-class detection were also investigated in the RS. As shown in Table 2.7, the general two-class detection achieved the highest accuracy of in the RS, which indicates that the proposed LSTM-based method is promising for real-world IoT networks. However, with misalignments between parameter for different devices, performances from the RS for four-class detection (i.e. an accuracy of ) and for two-class detection (i.e. an accuracy of ) are reduced compared to the performances from the SS (i.e., accuracies of and for four-class and two-class prediction respectively).
Performance on the RS was poorer for the rule-based anomaly detection approach (Table 2.8), which may be due to the misalignments between parameter between different devices. Since the rule-based approach leverages the simultaneous relationship between different transmission frequencies, it can be expected that a larger misalignment leads to poorer performance for the rule-based approach. The general two-class detection results from rule-based and LSTM methods are compared against ground truth in Fig. 2.13 in the RS. The detection results from the LSTM method better match the ground truth, while the rule-based method claims the anomalies incorrectly when there are misalignments and fluctuations in the data flow.
In order to provide more details for comparing the performance of rule-based and LSTM methods, precision, specificity and recall metrics are calculated and shown in Table. 2.9. Note that we calculate the metrics for LSTM every time steps as the input length of the LSTM model is taken as in the model settings presented in Section 2.7, while the metrics for the rule-based method are computed in each time step. When the system is running normally, both methods have a high specificity value ( for the LSTM method and for the rule-based method), which means that most of the time both anomaly detectors will not alarm when the RS is running normally. However, the LSTM method obtains a higher recall value than the rule-based method for anomaly detection ( for the LSTM method and for the rule-based method), indicating that the LSTM method can alarm promptly when most malicious manipulations occur, but the rule-based method fails to detect most anomalies. Given the precision values ( for the LSTM method and for the rule-based method), the majority of anomalies identified by the LSTM method are real anomalies and therefore the LSTM method is more acceptable for use in real-world applications.
The results presented in Table 2.7 indicates that the one-layer LSTM detects anomalies more effectively in both the SS and the RS. In Table 2.7, the standard deviations of the detection results reveal that LSTM-based anomaly detection is robust, including the real-world system (RS). Although the accuracy decreases to with some uncertainty (standard deviation of ) for four-class anomaly detection in the RS, the LSTM method can still obtain stable high performance (accuracy of with standard deviation of ) for two-class anomaly detection. However, when detecting the “Function Input Only” anomaly, the LSTM method has worse performance than the rule-based method. One possible reason for this is that the fluctuations and jumps in data flow shown in Fig. 2.14 cause uncertainty during the training process of LSTM models.
As applying the extra deep learning architecture may enhance the detection in complicated environment, the one-layer LSTM, bi-LSTM, stacked-LSTM, LSTM-attention and LSTM-encoder architecture are compared in four-class anomaly detection. Fig. 2.15 demonstrates the four-class detection accuracy from one-layer LSTM, bi-LSTM, stacked-LSTM, LSTM-attention and LSTM-encoder. Each model is trained 10 times with different parameter initializations and the average detection accuracy and standard deviation are calculated. For the detection in real world environment, applying the extra deep learning techniques is not able to improve the detection accuracy apparently. By contrary, the encoder-decoder mechanism degenerates the anomaly detector in simulation environment. Table 2.10 shows that the complexity of different architectures. With the comparable inference time consumption, one-layer LSTM has the minimum number of parameters which means that one-layer LSTM can detect anomalies effectively with less computational resource.
No. of model parameters
Simulation inference time (s)
Real world inference time (s)
Table 2.8 indicates the detection results from rule-based model. Detection of anomaly “Data Size” using the rule-based method has the lowest accuracy when compared to the other types of anomalies. The reason is that a change of transmission frequency for the manipulated device may lead to identical-trend changes of transmission frequencies for other devices, which prevents the rule-based detection working effectively. Interestingly, the detection accuracy for “Data Size” in the RS is very comparable to that in the SS which may be largely caused by the misalignments and fluctuations in the RS data flow. However, as expected, the detection accuracies for other types of anomalies in the SS are greater than those in the RS. Finally, we also investigated the impact of thresholds in rule-based anomaly detection (Table 2.11). The threshold plays an important role in anomaly detection, but simply increasing or decreasing the threshold can not obtain a better performance on anomaly detection. One the one hand, a small network disturbance will trigger the anomaly alarming incorrectly if a small threshold applies. On the other hand, anomaly will be ignored because the change of transmission frequency can not trigger the detector if the threshold is too high. Therefore, a problem arises on how to select an optimal threshold for anomaly detection in a practical IoT application (i.e., trial & error), which is another drawback of rule-based method compared to the LSTM-based approach.
|Thresholds||Real world system|
1% optimal frequency
5% optimal frequency
10% optimal frequency
15% optimal frequency
30% optimal frequency
50% optimal frequency
In this chapter, we propose a novel transmission frequency management system for IoT edge devices. This innovative system is able to assign the optimal transmission frequency for each IoT device in the network dynamically and recalculate the new optimal transmission frequencies adaptively, when there is a new connection of a new device. Furthermore, we also devise mechanisms for anomaly detection of the system when transmission frequencies may be manipulated in different settings.
Our simulation results show that the proposed system is effective in real-world scenarios, with high accuracy for estimation of transmission frequency in a low-latency () router-based experimental IoT network. Considering that IoT edge devices may suffer attacks which manipulate their transmission frequency and transmit data streams with an incorrect cadence, we use both a mathematical rule-based and LSTM-based approach to detect the potential anomalies in transmission frequency. The rule-based approach demonstrates the internal process during an anomaly event but can not reliably detect the anomaly in a practical environment. In contrast, the LSTM-based approach indicates greater potential for implementation in both simulations and real-world environments for the detection of abnormal transmission frequency.
Most recently, there has been an increasing interest in adopting sharing bike schemes globally as these schemes can be seen as effective tools in combating global challenges such as improving sustainability (e.g., reduce the commuting cost and air pollution [otero2018health]) in transportation. One of the key requirements to facilitate the bike-sharing system is whether the supply and demand can have a good balance in a bike-sharing network [raviv2013optimal]. In general, the relocation of bikes ensures the balance between supply and demand, but the uncertainty of departure and arrival among different bike stations has been making the bike relocation harder to execute precisely. Therefore, accurately forecasting the availability of bike at a given time and station becomes increasingly important.
Recently, convolutional neural networks (CNN) have been applied to extract the relationship between adjacent traffic networks whilst the recurrent neural networks (RNN) were used to arrest the temporal information. For short-term traffic prediction, fully connected long short-term memory (LSTM)[shi2015convolutional] and CLTFP [wu2016short], two architectures mixed the long short-term memory networks with convolutional operation, were proposed in order to catch both temporal and spatial cues. However, LSTM or other networks with recurrent architecture are computationally intensive. Also, it is harder for the network parameters to converge to global optimal values, since the recursive training process accumulates the error. On the other hand, CNN-based methods also have their limitation since the convolution process the data in 2-D form restrictively, which may not be the natural structure of traffic data.
These above issues of CNN and RNN-based methods were investigated and addressed by the spatial-temporal graph convolutional networks (ST-GCN) [DBLP:conf/ijcai/YuYZ18], a variant of a graph neural network (GNN) for utilizing spatial information. Spatial-temporal convolutional blocks were introduced and applied repeatedly in this architecture, combining several graph convolutional layers [DBLP:conf/nips/DefferrardBV16] with sequential convolution in order to represent the spatial-temporal relations. Subsequent to this approach, STG2Seq [bai2019stg2seq], a sequence-to-sequence variant of STGCN, is proposed with more reference on historical data and an attention module, for multi-step passenger demand forecasting. However, there are still some important issues to be solved in the ST-GCN architecture. For instance, how effective a specific adjacency matrix scheme can contribute to traffic demand prediction. Also, to what extent an attention-based mechanism can be applied to further improve the accuracy for a given demand prediction model.
To answer these questions, our key objective in this chapter is to investigate how ST-GCN, supplemented with an attention-based mechanism, can further enhance the performance of bike availability prediction across different bike stations in cities. From an application/service perspective, we believe the proposed method can help cyclists make their personalised travel plan more appropriately by finding the best bike station nearby with high confidence in availability. Thus, the contribution of this chapter can be summarised as follows:
We combine an attention mechanism with the ST-GCN, namely AST-GCN, to improve the ability of extracting spatial-temporal features for the prediction task. In comparison with the existing methods, our model shows a promising performance.
We review related works in the recent literature and summarise four categories for modelling adjacency matrices, namely spatial based, temporal based, spatial-temporal based and adaptive based adjacency matrix.
Given our findings in 1 and 2, we evaluate our proposed AST-GCN model with the adjacency matrices of interest using a real-world dataset, Dublinbike, for bike sharing availability prediction. Our results show that: (a) adaptive spatial-temporal adjacency matrix can achieve the best performance; (b) spatial-temporal based adjacency matrix can achieve better results than that only using spatial-based or temporal-based adjacency matrix; (c) spatial-based adjacency matrix achieves similar performance as the temporal-based one.
The rest of the chapter is organised as follows. We introduce some previous research related to traffic demand prediction in section 3.2 and formulate our problem in section 3.3. Experimental setups are demonstrated in section 3.4 and the results are discussed in section 3.5. Finally, we summarise our work in section 3.6.
3.2 Related Work
3.2.1 Existing Methods
In general, forecasting traffic demand is difficult, when a traffic demand depends not only on the historical demand pattern of the target area (e.g., suburb) but also on the pattern of other areas (e.g., urban). To meet this challenge, many studies using deep learning such as CNN, RNN, and GNN have been proposed.
As the traditional convolutional operation in CNN process the data with a 2D approach, the layout of a city is geographically divided into square blocks in order to extract spatial relationships from all regions [zhang2017deep], nearest regions [yao2018deep] or in other 2D forms [chu2020passenger]. RNN based methods and their variants [yao2019revisiting] are applied to catch temporal correlation, for instance, structuring the historical traffic demand sequence for each region [shi2015convolutional] and presented as a 1D feature-level fused architecture [wu2016short]. GNN based methods, with natural advantages in utilizing spatial information, model the traffic network by a general graph instead of treating the traffic data arbitrarily (e.g., grids and segments) in CNN and RNN methods. GCN, as a variant of GNN, which is able to combine spatial and temporal information, is widely used in the scenario of traffic demand prediction as seen in many recent works [DBLP:conf/nips/DefferrardBV16] [bai2019stg2seq] [DBLP:conf/ijcai/YuYZ18].
Attention is a popular technique in deep learning that mimics physiological cognitive attention. The effect enhances the importance of small parts of the input data and de-emphasising the rest. This technique has been used to enhance the prediction performance for many sequence-based tasks of GNNs, i.e. Graph attention networks [velivckovic2017graph]. In traffic demand prediction, the importance of each previous step to target demand is different, and this influence changes with time. For instance, a temporal attention mechanism [bai2019stg2seq] is able to add an importance score for each historical time step to measure the influence and this strategy can effectively improve the prediction accuracy.
3.2.2 Adjacency Matrices
An adjacency matrix is used to indicate whether a pair of vertices is connected by edge or not in graph data. For a traffic network, it is important to understand how an adjacency matrix can be used to best capture the interconnectivity between different nodes in the graph. To the best of our knowledge, four types of adjacency matrices have been investigated in previous research works, namely spatial (S), temporal (T), spatial-temporal (ST) and adaptive (A). A spatial adjacency matrix is usually distance-based. Euclidean distances between different stations (i.e., nodes in graph) [DBLP:conf/ijcai/YuYZ18] [chen2020multitask] or the natural geographical distance [kim2019graph] are usually used as weights for its entries. For instance, a shorter geographical distance between two stations may indicate a stronger connection in the graph. A temporal adjacency matrix can be defined based on the similarity score [bai2019stg2seq] (i.e., Pearson correlation coefficient) between the temporal information (i.e., historical traffic demand sequence) of each pair of nodes/stations. For example, a larger value of Pearson coefficient calculated from the time sequential data for the number of available bikes between two stations, may indicate a stronger connection in graph between these two stations compared to other pairs. To combine the benefits of both spatial and temporal features, an spatial-temporal embedding (ST embedding) can be generated for each node in a graph [ye2020coupled]. However, in such a scenario, it can be hard to describe the adjacency matrix intuitively with the high dimension embedding features and thus the adjacency matrix needs to be adaptively defined along with the training process of GCN [wu2019graph] [chiang2019cluster].
3.3.1 Notations and Problem Statement
We consider a scenario where bikes stations are included as part of a bike-sharing system. Let be the set for indexing the bike stations in the system. For a given bike station , let be the number of available bikes at the station at time . We denote the vector consisting of the number of available bikes across all stations at time . In addition, each bike station is associated with a set of features for model training, e.g. weather condition, day of week, etc, and let represent the values of its features at time , where is the number of features used. Similarly, we let be the feature set values of all bike stations at time . Given the notation above, our learning objective is to find a function which is able to address the following problem:
where denotes the input and output length for the model respectively. Also, the notation presents the output as a sequence of vectors from steps to .
3.3.2 Attention-based ST-GCN
In this section, we introduce the attention-based ST-GCN architecture that used for solving our bike sharing availability prediction problem. We note that the ST-GCN architecture has been presented in [DBLP:conf/ijcai/YuYZ18], and the architecture consists of two identical ST-Conv-Blocks and a fully connected output layer. Specifically, an ST-Conv-Block consists of two temporal gated convolutional (TGC) layers and one spatial graph convolutional (SGC) layer, which are the essential modules of ST-GCN. In general, TGC is in charge of extracting temporal features and SGC is able to extract spatial features from the data. However, since there is no attention on the temporal channel of ST-GCN, this significantly degrades the performance for sequence to sequence based learning tasks. As such, the model’s learning capability may be significantly reduced due to “lost of focus”. To deal with this issue, we introduce a temporal-attention module (TAM) in each ST-Conv-Block, as shown in Fig. 3.1 where the temporal-attention module is depicted in green.
Remark: An attention mechanism was introduced in [shiraki2020spatial] and [zhang2020sta] to extract both spatial and temporal information from ST-GCN networks. The architectures proposed in both works applied attention operation to extract spatial and temporal information separately. In particular, the model in [shiraki2020spatial] consisted of 15 ST-Conv blocks in total with two attentions matrices calculated from them, while the model in [zhang2020sta]
was stacked by 10 ST-Conv blocks with two attention matrices computed from each ST-Conv block. With increased model complexity and computation cost, stacking multiple ST-Conv blocks with attention matrices calculated separately may be of less interest since the spatial and temporal information may not be combined towards an effective spatial-temporal embedding in such a case. Instead, our model only consists of 2 ST-Conv blocks and the proposed AST-GCN architecture lightly merges spatial-temporal information with attention by calculating the attention matrix only once in each ST-Conv Block, which reduces the computation costs during the model training process. Specifically, the first TGC module generates original temporal information and the last TGC module generates spatial-temporal information (as it takes account of the output of the preceding SGC layer as its input). Passing through two average 3D pooling layers, both temporal and spatial-temporal information are combined before a Relu activation function is applied. A sigmoid function is connected here to generate probabilistic weights (attention matrix) with values between 0 and 1. With this matrix in place, the attention-based temporal information is generated by using a dot product with the output of the first TGC layer and then concatenated as input to the subsequent ST-Conv Block. Both spatial and temporal information in the data flow are fully captured before passing to the dense layer for sequential output prediction.
3.4 Algorithms and Experiments
In this section, we discuss the different configurations investigated for comparative studies.
3.4.1 Experimental Datasets
Dublinbike: DublinBikes is a bike-sharing scheme in operation in Dublin City, Ireland. The system is illustrated in Fig. 3.2, where each node is a bike station and each blue number in the circle indicates the number of available bikes in real-time. Real-time data is accessible using an API and we also have access to historic data, recorded every five minutes, which includes timestamps, station states, number of available bikes and station locations, etc. We choose the data111https://data.smartdublin.ie/dataset/analyze/33ec9fe2-4957-4e9a-ab55-c5e917c7a9ab from 01/07/2020 to 01/10/2020 for our studies.
NYC-Bikes[2016DNN]: This dataset includes the NYC Citi daily bike orders of people using the bike sharing scheme. We choose the transaction records from April 1st, 2016 to June 30th, 2016 (91 days). This contains the following information: bike pickup station, bike drop-off station, bike pick-up time, bike drop-off time and trip duration.
Visualcrossing Weather Data222https://www.visualcrossing.com/weather-data: This dataset provides weather conditions at different locations at different historical time points, including temperature, humidity and wind speed, etc. This weather dataset has been integrated for experiments that use the Dublinbikes dataset.
3.4.2 Experimental Setup
Dublinbike: For this scenario, we use the number of available bikes at each bike station in the first 3 hours to predict the number of available bikes at each bicycle station 45 minutes later, where each data point is the averaged number of available bikes in 15 minutes. This implies that we take the past 12 consecutive observation points to predict the following 3 points of our interest. The dataset consists of 110 bike stations in total. The data is then separated into a training set (60), a validation set (20) and a testing set (20) in a sequential manner.
NYC-Bikes: NYC Citi Bike is dock-based and every depot of bikes is considered as a station. Following the same experiment setup as in CCRCN [ye2020coupled], we filter out the stations with fewer orders and keep the 250 stations with the most orders. The time step is set to half an hour. Among the last four weeks considered, the first two are used for validation, and the last two are for testing.
To evaluate the performance across different models, Mean Absolute Error (MAE) has been selected as the performance metric, indicating an intuitive margin between the predicted and the true amount of available bikes at each station.
3.4.3 Baseline Algorithms
Dublinbike: To the best of our knowledge, there has been no GNN based methods implemented for the Dublinbike dataset. In particular, there has also been no ST-GCN based methods applied for solving the prediction for this dataset. For comparative studies, we conduct the experiments and use ST-GCN [DBLP:conf/ijcai/YuYZ18] as our baseline.
NYC-Bikes: A lot of methods have been reported using this dataset to predict traffic demand. The state of the art work is presented in CCRCN [ye2020coupled]. Based on this, we compare the performance of different methods, including our proposed model, in a similar experimental setting. Specifically, the following methods are compared: (a) HA 333The average of historical values at previous time steps of a fixed length.
; (b) XGBoost[xgboost2016]; (c) FC-LSTM[lstm1997]; (d) DCRNN[2017arXiv170701926L]; (e) ST-GCN[DBLP:conf/ijcai/YuYZ18]; (f) STG2Seq[bai2019stg2seq]; (g) GraphWaveNet[wu2019graph] and (h) CCRNN[ye2020coupled].
3.4.4 Network Setup
The historical data length used for both the Dublinbikes dataset and the NYC-citi dataset is set to 12, the prediction length is set to 3 in Dublinbikes and 12 in NYC-citi respectively. The feature dimension used in NYC-citi is 2 representing the pick-up and drop-off demand. The feature dimension used for the Dublinbikes dataset is 8, details of the feature selection will be discussed in the results section. All models are optimised by the Adam algorithm[kingma2017adam]. Other setting of parameters are presented in Table 3.1. The dimensions of the data flow during the training process of the proposed model are overlapped in Fig. 3.1 for illustration purposes. It is worth noting that the input of the first temporal gated-Conv is strictly the same as the input of the corresponding ST-Conv block while the input of the second temporal gated-Conv is the output of previous spatial gated-Conv block. The concatenate operation concatenates the output of the first and the second temporal gated-Conv block.
|Historical data length||12||12|
|Initial learning rates||0.001||0.0001|
|Optimiser||Adam algorithm||Adam algorithm|
|LR adjustment strategy||cosine annealing||adjust at equal intervals|
3.4.5 Adjacency Matrix Setup
The adjacency matrix in the original ST-GCN architecture is not adjustable/trainable. As a result, this fixed adjacency matrix may not fully capture the spatial relationship between nodes in the graph. To improve it, we adapt the fixed adjacency matrix to a trainable adjacency matrix and then initialise the matrix using meaningful contextual information, e.g. distance between nodes, similarity between stations’ historical time-series data. Further, an adaptive adjacency matrix (AAM) is able to extract spatial attention information from the graph adaptively, and thus it makes the AST-GCN effective in capturing both spatial and temporal attention information. For our comparative studies, different setups of adjacency matrices are investigated as follows:
For the implementation of the adjacency matrix proposed in ST-GCN [DBLP:conf/ijcai/YuYZ18], the sigma is set to 0.2 and the epsilon is set to 0.368;
For the implementation of the adjacency matrix proposed in STG2Seq[bai2019stg2seq], the sigma is set to 0.05;
For the implementation of the adjacency matrix proposed in CCRCN[ye2020coupled], the dimension of station feature is set to 20 and the sigma is set to 1.
Other adjacency matrices do not need parameters to be set. In other words, these adjacency matrices are calculated directly without parameters or are purely adaptive.
3.5 Results and Discussion
3.5.1 Feature Selection for Dublinbikes Dataset
In order to select the best features for our experiments, an ablation study has been carried out for a set of features which model temporal, spatial as well as weather characteristics. Specifically, we adopt ST-GCN as our basic setting for evaluation of different feature combinations. Our full feature sets are as follows: (1) number of available bikes (AB); (2) time of day (TD); (3) day of week (WD); (4) weather condition description (WCD); (5) temperature (T); (6) wind speed (WS); (7) cloud coverage (CC) and (8) Humidity (H).
Results of the ablation study are reported in the Table 3.2, from which we easily conclude that the following feature combination gives the best performance: number of available bikes (AB), time of day (TD), day of week (WD) and weather conditions description (WCD).
3.5.2 Results Discussion
NYC-Bikes: The results on the NYC dataset are compared between the proposed AST-GCN and the existing algorithms reported in [ye2020coupled] as shown in Table 3.3. It is shown that the AST-GCN algorithm outperforms the existing graph based architectures (i.e., ST-GCN and STG2Seq) with 24.67% improvement in MAE, from 2.4976 to 1.8815. Also, although CCRNN beats all of its competitors, the AST-GCN shows minor difference in performance, and it still demonstrates comparable metrics compared to other sequence based models including Graph WaveNet, DCRNN.
Model MAE HA 3.4617 ST-GCN 2.7605 STG2Seq 2.4976 XGBoost 2.4690 FC-LSTM 2.3026 Graph WaveNet 1.9911 DCRNN 1.8954 AST-GCN + EAAM 1.8815 CCRNN 1.7404 Table 3.3: Experiment result of AST-GCN on NYC-citi [ye2020coupled]
Dublinbikes: As shown in Table 3.4, after applying distance initialised AAM (DIAAM) on ST-GCN, the prediction achieve better results with MAE equals 1.27. By replacing ST-GCN with AST-GCN, the MAE result has been significantly improved from 1.27 to 1.04. Among others, the embedding AAM (EAAM) makes the best performance which leads to the MAE equals 1. Results in Fig. 3.3 further highlight this key finding. Specifically, the biases between the ground truth and the first timestamp (i.e. NAB prediction for the first 15 minutes) as well as the third timestamp (i.e. NAB prediction for the 45 minutes) are both negligible showing that our proposed model can achieve impressive prediction performance for both short-term (15 mins) and long-term (45 mins) for the best case scenario.
|Model||Categories 111The abbreviations in this column have been presented in Section 3.2.2.||MAE (%)|
|ST-GCN + Euclidean distance||S||1.36 (0%)|
ST-GCN + DIAAM
|S + A||1.27 (-6.67%)|
AST-GCN + DIAAM
|S + A||1.04 (-23.5%)|
AST-GCN + EAAM [wu2019graph]
|ST + A||1.00 (-26.5%)|
AST-GCN + Euclidean distance [DBLP:conf/ijcai/YuYZ18]
AST-GCN + Geographical distance [kim2019graph]
AST-GCN + Temporal correlation [bai2019stg2seq]
AST-GCN + ST embedding [ye2020coupled]
3.5.3 Performance Evaluation w.r.t. Adjacency Matrices
In this section, we discuss how different adjacency matrices can affect the learning performance for our proposed AST-GCN architecture. Our results are illustrated in Table 3.4 where the percentage in parenthesis shows the difference of the achieved MAE in comparison to the basic setting: ST-GCN + Euclidean distance. Unsurprisingly, our results show that those fixed adjacency matrices, including both spatial based and temporal based, achieve the worst results among all other settings. In contrast, the adaptive-based settings can generally achieve better results compared to the fixed types, but with one exception for the spatial-temporal based setting, i.e. AST-GCN + ST embedding, which also shows a competitive result. For the adaptive-based settings, the embedding AAM, i.e. AST-GCN + EAAM, achieves the best result compared to the other AAM setting initialised by distance, i.e. AST-GCN + DIAAM.
3.5.4 Performance Evaluation w.r.t. Different Bike Stations
In this section, we present the prediction results for each bike station in the Dublinbike dataset using the best trained model (AST-GCN + EAAM). Our objective here is to illustrate the confidence with which a user can rely on our proposed prediction model to make a decision when he/she decides to get access to a bike from his/her nearby area. Our station-wise results are illustrated in Fig. 3.4 and Fig. 3.5. Specifically, Fig. 3.4 shows the heat-map of station-wise MAE over the geographical map of Dublin city where the bike stations are facilitated. The red marks indicate a higher MAE and blue-green marks indicate a lower MAE in the corresponding area. Generally speaking, the results demonstrate that the prediction is more accurate (low-MAE values) outside of the city center showing that users can collect bikes with high confidence in the availability of bikes. The highest prediction error occurs in the heart of city centre, i.e. the bike station located at the “Princes Street/O’Connell Street”, with the MAE equalling to 2.4. This may be caused by a frequent access and return of bikes by users in this central commuting area, leading to a relatively higher uncertainty in bike availability. The second highest prediction error appears in the western part of the city, i.e. the green-blue region indicated in the rectangular box in Fig. 3.4. However, this is mainly due to the aggregated effect where a few bike stations are very close to each other in the “Benburb Street” area. An in-depth view of the region, shown in the upper left corner of the rectangular box in Fig. 3.4, further validates that the prediction error of each bike station therein is low. Another reason causing the relative high prediction error in “Benburb Street” area may be the train arrivals in Heuston Station. The frequent access and return of the sharing bikes by travellers travelling by train may be challenging for the model to predict the availability of bikes. Finally, the statistical histogram of the station-wise MAE is illustrated in Fig. 3.5 showing that most bike stations have an MAE-based prediction error less than 1.5 bikes, which indicates that our proposed forecasting system is very robust and accurate for a number of bike stations in the Dublin city.
In this chapter, we propose a spatial-temporal graph convolutional network architecture embedded with a temporal-attention module (AST-GCN) to predict the number of available bikes in bike-sharing systems using realistic datasets. The temporal attention module is able to extract temporal attention information which aims to enhance the prediction accuracy compared to that of the original ST-GCN architecture reported in [DBLP:conf/ijcai/YuYZ18]. Our experimental results show that the proposed AST-GCN can perform better than most of existing methods in the NYC-Citi dataset. As for the Dublinbikes dataset, our proposed model has demonstrated a very promising result of 1.00 MAE as the selected performance metric. In addition, we have thoroughly investigated how different modelling of the adjacency matrices can affect the overall model performance through a comprehensive comparative study on the DublinBikes dataset. Current results have shown that embedding AAM can achieve the best results compared to many other settings.
To conclude, we believe that the work presented in this chapter is an important step towards making bike sharing systems more efficient thanks to the ST-GCN enabled techniques. A deep exploration on different adjaceny matrices reveals that embedding adaptive adjacency matrix can achieve the best performance in this work.
With the growing population in modern cities, traffic and transportation systems are becoming the most important infrastructure, supporting citizens for their daily commuting and travelling. Among the components of the traffic system, highway traffic networks provide the most efficient way to commute between different parts of cities, with a lower chance of traffic jams. The expanding use of highway traffic networks inevitably introduces new challenging problems of traffic management, such as the concern of safe driving, to avoid severe collisions by sudden unexpected accelerations, braking and lane change when the surrounding vehicles can not react promptly. Given this background, Intelligent Transportation Systems (ITS) play an important role in solving traffic problems and ensuring traffic safety with fewer fatal traffic accidents [calibaba2017road]
. For instance, with the successful application of computer vision and network communication, such as a camera monitoring system, it is easy to track the moving vehicles with the image processing techniques and then various information (e.g., speed, number of vehicles on the road) are accessible via the appropriate application programming interface (APIs)[mejia2021vehicle] [nam2020deep]. With such information, variable speed limits and real-time speed advisory systems have been proposed to alleviate traffic congestion and maximise the utility of highway traffic networks in various aspects [kuvsic2020extended, 7350149, liu2021mpc]. For instance, Fig 1 demonstrates the speed advisory system (SAS) with variable speed limits deployed in Dublin city which is able to recommend optimal speeds for each lane on the M50 highway traffic network in Dublin city111https://www.rod.ie/projects/enhancing-motorway-operation-services.
However, even if the SAS system has been applied to govern driving speed and reduce the chances of traffic accidents [li2013impacts], drivers may drive with different driving intentions (i.e., acceleration; lane changing) unconsciously if they drive freely without traffic restriction [jeon2014effects]. For instance, Fig. 4.2 illustrates the framework of the M50 highway traffic network, where the SAS is implemented on the segment marked in green. Once vehicles leave the segment with SAS system (in green), it is more likely that vehicles may change lane freely and accidents may happen (in red).
Although current work has shown the effectiveness of detecting the lane change in transportation systems using HMM [li2016lane] and LSTM based methods [tang2020driver, 8813987], these methods can not leverage the natural geographical information (e.g, the connection between lanes) sufficiently. The key difference between our work and theirs is that we detect the intention of lane changing based on GNN, in which the graph modelling can extract the spatial information between lanes and boost the detection performance. Existing works related to detecting the lane changing behaviours focus on vehicle-level detection [mandalia2005using] . These works forecast whether a specific vehicle has an intention to change the lane while driving on the road, in order to avoid potential collisions. In this work, given the background that vehicles are driving at recommended speed on the highway traffic network, we detect the lane changing intentions using information collected from road-level rather than from individual vehicles, to indicate the chaotic level of the current road network such that different levels of traffic intervention may need to be applied. Regarding traffic network modelling, the previous works model the highway traffic network as a graph with the junctions as nodes and the roads as edges. We model the highway traffic network as a graph with the lanes as nodes and connectivity between lanes as edges, to extract the graph features with lane changing information, which will be discussed in section 4.3.3. The main contributions of this chapter include:
We evaluate the performance of lane changing detection against different temporal segments, to investigate the efficiency of the detection algorithm. Results show that our method can detect lane changing intention in 90 seconds with higher accuracy comparing to HMM-based [li2016lane] and LSTM-based method [tang2020driver], which can raise an alarm promptly in real-world applications.
We apply temporal graph convolutional network with attention mechanism, to leverage the temporal information for accurate detection. In comparison with temporal convolutional neural network (TCNN), attention temporal graph convolutional network (ATGCN) shows the advantages in real-world application.
Finally, for the purpose of interpreting our trained model, we calculate the standard deviation and spectral information divergence for the input features, to evaluate the contributions that the features make to the model.
The remaining parts of the chapter are organised as follows. We introduce speed advisory system (SAS) on highway traffic networks as the background of this work and review some deep learning based detection for traffic flow in section 4.2. The experiment design, data processing and neural network architecture are demonstrated in section 4.3. Experimental results and further details regarding the results are discussed in section 4.4. Finally, we summarise this work in section 4.5.
4.2 Related works
4.2.1 Speed advisory system
With the development of ITS and vehicle-to-vehicle/infrastructure (V2X) technologies, Intelligent Speed Advisory (ISA) systems have shown the capability in improving driving safety in various traffic scenarios [hounsell2009review, tradisauskas2009map, gu2018design, chen2021intelligent, liu2015topics]. In highway traffic networks, in addition to driving safety, driving vehicles at the suggested speed has the benefits such as minimizing the emission, energy consumption and health risks [7350149, gu2018design]. With this in mind, road operators and transportation departments can always monitor the speed of vehicles with the help of an intelligent camera-based platform [mejia2021vehicle] to ensure that drivers follow the recommendation of the speed advisory system.
4.2.2 Deep learning based traffic flow analysis
A large body of work in the literature has been found using deep learning methods for traffic flow analysis. Most recently, deep belief networks[huang2014deep]lv2014traffic] and recurrent neural network (RNN) based approaches [tian2018lstm] have been implemented to analyse the sequential traffic flow data leveraging the long term temporal dependencies. Jointly working with sequential deep learning models, by segmenting the city into multiple areas and grids, CNN architectures with temporal units have been devised to access both spatial and temporal information where the traffic flow is processed into sequential 2-D data  [ma2017learning]. However, the above methods meet with common limitations for traffic flow analysis since they neglect the natural non-Euclidean property (e.g., graph) in road networks.
In general, traffic networks are naturally represented in graph format, where the roads are natural edges and connections between roads act as nodes. In order to overcome the significant limitation of the previous mentioned deep learning methods in traffic flow analysis, graph neural networks (GNNs) are applied as an ideal approach to data analysis on traffic networks since spatial dependencies between different nodes have been represented in graph structure. With the input of graph features, variants of GNN architectures have been proposed as the state-of-the-art approaches and obtained promising performances in various scenarios [wu2020comprehensive] for detection problem. For instance, Diffusion Convolutional Recurrent Neural Network (DCRNN) [li2018dcrnn_traffic], Graph Wavenet [wu2019graph] and spatial-temporal Graph convolutional network (STGCN) [DBLP:conf/ijcai/YuYZ18] have been designed to leverage the spatial-temporal information and improve the traditional GNN architecture, which can boost the performance of data analysis in highway traffic networks. Tanwi et al.  refined the DCRNN to transfer the common spatial-temporal information between cities with similar geographical structure to improve the detection performance. Yu et al. [DBLP:conf/ijcai/YuYZ18] proposed STGCN to leverage the spatial and temporal dependencies between different areas of a city, to improve the performance of traffic demand forecasting.
4.3.1 Simulation & Experiment Design
In this section, the traffic flow influenced by different driving intentions is simulated using SUMO [behrisch2011sumo]
. SUMO is open-source software for the simulation of urban mobility, which is prevalent for the purposes of proposing and validating research ideas related to the intelligent transportation. In this work, we select a segment of highway traffic networks in Dublin city (i.e., the M50 highway network) as the scenario where different driving intentions may happen in the real world. As shown in red in Fig.4.2, as the vehicles leave the green segment where the driving speed is guided by SAS, the drivers may drive with frequent lane changing intentions (in red segment) which endanger the traffic safety. There are four lanes in this segment of the M50 highway network and the data on traffic flow is collected while vehicles are running on this highway traffic network segment.
In this experiment, a new vehicle is generated per simulation step (i.e., 1 second) on the lane recommended by SUMO. In a normal situation, all vehicles are driving at SAS speed without frequent lane changes on the highway traffic network, where the SAS speed is set as 80 km/h. However, considering that different driving intentions could happen in the real world, we consider the possibility of violating SAS speed and frequent lane changing, when generating the traffic flow data. Violating SAS speed is defined as driving at a speed that is different from SAS speed in a given range (e.g, , , of SAS speed) and lane changing means that the vehicle randomly switches to any lane (i.e., four lanes including the current lane where the vehicle is currently driving on) of the highway traffic network. With this in mind, each vehicle has the possibility (i.e., SAS probability 0.1, 0.5 and 0.9) driving at SAS speed and the possibility (i.e., lane probability 0.1, 0.5 and 0.9) driving at the same lane at each simulation step while staying in the highway traffic network. The higher probability indicates that the vehicle has a higher chance to follow SAS speed and drive in the same lane. For instance, SAS probability = 1.0 and lane probability = 1.0 mean that the vehicle will drive at the SAS-recommended speed and will not change lane for the whole journey. Fig. 4.3 demonstrates an example with setting SAS probability = 0.1, lane probability = 0.9 and of SAS speed. It indicates that each vehicle conducts uniform motion for the whole journey where the speed has 0.1 probability to set as SAS speed (i.e., 80 km/h) or has 0.9 probability to set as the speed from 64 km/h to 96 km/h (i.e., violating of SAS speed). Once the speeds are set, the vehicles are not able to change the speed. Each vehicle has 0.9 probability to drive at the current lane (i.e., 0.1 probability change the lane). In a real-world application, multiple cameras can be set to monitor the vehicles in each lane respectively. Since the driving speed of vehicles and the possibility that vehicles driving at SAS speed can be estimated, we detect the different lane changing intentions (i.e., different lane changing possibilities) in this chapter.
4.3.2 Feature selection and model training
For feature selection, while vehicles are driving with intentions of speed and lane changing, the average driving speed and vehicle number on each lane are collected and estimated by the camera. With this information, the road sector management unit can estimate not only the possibility that vehicles driving at SAS speed, but also the range of speed changes (e.g, , , of SAS speed) for the vehicles that do not follow the SAS speed. Therefore, the different models for lane changing detection will be trained on different SAS probability and range of speed changes. We label the traffic flow based on different probabilities of lane changing (i.e., lane probability). The traffic flow data used for model training, validation and testing are generated for 3600, 1800 and 3600 simulation steps, corresponding to monitoring the traffic flow for a period of one hour, half-an-hour and one hour respectively in real world.
4.3.3 Traffic Flow on Graph
In this section, we introduce the processing of highway traffic flow with graph modelling. In previous works, the traffic flow data is collected at the junctions between different roads. However, there are multiple lanes on each road and we collect the lane-wise traffic flow data. With this setting, we treat the highway network as a graph , where denotes the nodes which is the set of lane segments , denotes the edges which is the connections between nodes. The adjacency matrix derived from a graph is denoted by . The connectivity of the graph is set as fully connected as the vehicle may change lanes from one to any other while driving with lane changing intentions, indicating for . Specifically, as shown in Fig.4.4, the highway network is divided into two segments therefore we have 8 lane segments (i.e., ) and graph signal is collected at each simulation step among different nodes, where denotes the averaged vehicle speed and the number vehicle (i.e., density) on the lane. Finally, for is processed as a sample of graph data, where indicates the length of the temporal segment when processing the graph data.
4.3.4 Network architecture
Temporal convolutional networks (TCNN). With the graph modelling in highway traffic networks, TCNN is designed as a baseline, to evaluate the ability of CNN in detecting the intentions given graph-traffic data flow. As simple as possible, the architecture of TCNN is refined from [DBLP:conf/ijcai/YuYZ18] and demonstrated in Fig. 4.5. Graph features extracted from each temporal segment are conveyed to three identical 2-D convolutional layers. The output from the first convolutional layer is activated by a sigmoid function to have normalised values between 0 and 1. Output from the other two convolutional layers is added with normalised values and then activated by a Relu function, followed by a fully-connection layer. This setting considers that two convolutional layers without sigmoid activation tune the model parameters in general, converging to the optimal values faster, while normalised values from the first convolutional layer with sigmoid activation can help to adjust the parameters precisely.
Temporal graph convolutional networks with attention mechanism (ATGCN). Based on the temporal convolutional networks proposed above, we extend the network architecture to graph convolutional networks with an attention component. Referring to the work presenting the ST-GCN architecture [DBLP:conf/ijcai/YuYZ18], we introduce TGCN with attention mechanism, consisting of two attention temporal convolution blocks (ATCs) and a fully-connected output layer. Each ATC consists of two temporal convolution blocks used in TCNN, with attention mechanism applied to process temporal information, as demonstrated in Fig. 4.6. Note that ATGCN has the latent static spatial information since the nodes are fully connected to each other as discussed in section 4.3.3.
4.3.5 Network setup
Referring to the processing traffic flow into graph format in section 4.3.3, the number of nodes is set as 8, indicating 8 lane segments in highway traffic network. The number of features is set as 2, corresponding to the average speed and number of vehicles on each lane segment collected as the graph features. The length of temporal segment is set to 30, 60 and 90 respectively, which will be examined later by our algorithm.
For TCNN architecture, each convolution layer has 2 input channels (e.g., corresponding to the number of features) and 64 output channels, with the kernel size set to 3. In each input channel, a 2-D traffic data slice with a dimension of [, 8], indicating the specific feature from the 8 lanes in a given temporal segment, is used for the model training. Therefore, the fully-connected layer receives the input size as [ x 8 x 64] and the output size as 3, corresponding to the 3 categories of anomalies that will be discussed in section 4.4.1. We set the batch size as 32 indicating there are 32 2-D traffic data slices for each training iteration.
The ATGCN architecture shares the same setting with TCNN architecture related to the part of the temporal convolution layer and fully-connected layer. The averaged 3-D pooling operator processes the data along with the dimension of , with the output vector with a size of [1,]. This vector conducts dot product operation with the output of the temporal convolution module, realizing the attention effect on temporal information. Table 4.1 lists the common settings when training the ATGCN and TCNN architecture.
|Nodes||8 (only for ATGCN)|
|Length of temporal segment||30, 60, 90|
|Initial learning rates||0.001|
4.4 Experimental Results and Discussion
In this section, we analyse and discuss the results of detection for lane changing intention when the vehicles were driven under irregular speeds.
4.4.1 Lane changing detection
|5% of SAS speed||SAS prob = 0.1||90.85%||96.00%||97.39%|
|SAS prob = 0.5||86.62%||90.57%||94.78%|
|SAS prob = 0.9||98.17%||98.86%||100.00%|
10% of SAS speed
|SAS prob = 0.1||91.69%||95.71%||97.39%|
|SAS prob = 0.5||90.42%||97.14%||98.26%|
|SAS prob = 0.9||97.61%||98.86%||97.83%|
20% of SAS speed
|SAS prob = 0.1||96.06%||98.00%||99.13%|
|SAS prob = 0.5||95.07%||98.29%||100.00%|
|SAS prob = 0.9||96.20%||98.86%||99.13%|
|5% of SAS speed||SAS prob = 0.1||91.27%||96.29%||90.43%|
|SAS prob = 0.5||86.06%||91.71%||89.57%|
|SAS prob = 0.9||98.59%||98.86%||97.83%|
10% of SAS speed
|SAS prob = 0.1||92.54%||97.14%||94.35%|
|SAS prob = 0.5||93.52%||96.86%||97.39%|
|SAS prob = 0.9||97.75%||98.00%||96.96%|
20% of SAS speed
|SAS prob = 0.1||96.48%||98.00%||98.26%|
|SAS prob = 0.5||95.49%||98.00%||96.96%|
|SAS prob = 0.9||96.76%||96.29%||99.13%|
Here we evaluate the deep learning algorithms for detecting lane changing intentions. In order to exclude the effect of speed violation when detecting the traffic flow caused by lane changing, the data is divided under three conditions, that is data generated under possibilities (i.e., 0.1, 0.5 and 0.9) of speed violation. Detection for intentions of lane changing is investigated in these conditions separately.
The detection also considers the effect of temporal segments when processing the graph data. We select three temporal segments with different lengths (i.e., ) when generating the sample of graph data. With these settings, the algorithm detects the lane changing intention every 30, 60 and 90 seconds respectively in a real-world application. Every two contiguous samples have an overlap time of given the specific length of temporal segment .
Table 4.2 and Table 4.3 demonstrate the results of lane changing detection using ATGCN and TCNN respectively. On the one hand, the averaged accuracies based on ATGCN are better than that based on TCNN for different ranges of speed change. For each category of speed change, the detection based on ATGCN obtains the highest averaged accuracy given the length of temporal segment , which outperforms the performances of TCNN. For instance, ATGCN achieves the highest accuracy and TCNN obtains accuracy given . On the other hand, the length of a temporal segment when processing the graph data has an important impact on detecting the traffic flow caused by lane changing. For ATGCN, as the length of temporal segments become longer, the performance of detecting traffic flow caused by lane-changing behaviours gets better for most conditions when the speed is changed in different ranges. For instance, when most vehicles violate the SAS speed (i.e., when vehicles have only a 0.1 probability to follow SAS speed), expanding the length of the temporal segment from to increases the averaged detection accuracy from to when the speed is changed within the range of of SAS speed. The averaged accuracy among all conditions also increase in line with the increase of the length of temporal segments. For TCNN, extending the length of the temporal segment can enhance the algorithm’s chances of detecting the lane-changing behaviour. The performances under and are better than that of under and the detection obtains the best performance (i.e., accuracy ) under when the speed is changed within the range of of SAS speed.
The average accuracies from Table 4.2 and Table 4.3 also indicate that the detection is getting more accurate as the range of speed change gets larger for both ATGCN and TCNN. When the vehicles are driving at large different speeds, the lane changing behaviours can cause a disturbance in traffic flow and lead to higher risks of traffic accidents. Therefore, given the larger range of speed change, the detection algorithm can catch the information representing the lane changing intentions easier. Even if the vehicles are driving at the most similar speed (i.e., only violate SAS speed within a range of ) where the chance of traffic accident is smaller, the ATGCN can also detect the corresponding lane changing behaviour with an accuracy of .
4.4.2 Feature visualisation and analysis
In this section, the features (i.e., average speed and vehicle number on the lane) are visualised and the importance of these features are discussed for lane changing detection.
Fig. 4.7 and Fig. 4.8 demonstrate the average speed and vehicle number in lane 4 for an hour, among different lane changing intention, where the vehicles have the probability of 0.1 and 0.9 to follow the SAS speed respectively. Table 4.4 shows the statistical mean and standard deviation of the corresponding features. For both situations where SAS prob = 0.1 and SAS prob = 0.9, as vehicles tend to drive without lane change intentions (i.e., the lane probability increases), the standard deviations of vehicle number get smaller, which indicates the sequential feature of vehicle number tend to be more stable and this pattern can be caught by the prediction models. As for the average speed on the lane, the standard deviation changes slightly without a clear trend and the values of standard deviations are close to each other, while the lane probability increases. This pattern indicates that the vehicle number plays a dominant role in lane change detection even if the average speed has reflections to lane changing intentions.
For the purpose of comparing the similarity of the features between different lane probabilities in the frequency domain, the spectral information divergence (SID) measurements are calculated between average speeds and between vehicle numbers. A higher value of SID indicates the two signals are more different with respect to the spectrum pattern. Table 4.5 shows the SID measurements for average speeds and vehicle numbers between different lane probabilities under conditions SAS prob = 0.9 and of SAS speed change. The SID values for the feature of vehicle numbers are tremendously larger than that for average speeds, indicating that the patterns shown in vehicle numbers are more specific given the corresponding lane probability and provide crucial information for lane changing detection.
|Conditions||Features||LP = 0.1||LP = 0.5||LP = 0.9|
|SAS prob = 0.1||Avg. Speed||1.04||1.14||1.05|
|SAS prob = 0.9||Avg. Speed||0.49||0.43||0.37|
|Features||LP=0.1 vs. LP=0.5||LP=0.1 vs. LP=0.9||LP=0.5 vs. LP=0.9|
In this chapter, we model the traffic flow data on highway traffic networks using graph and leverage temporal graph convolutional network architecture embedded with attention mechanism to detect vehicles lane changing intentions. The experiments compare the detection performance of ATGCN with that of TCNN. Comparison results indicate that the attention mechanism enhances the ability in capturing the key temporal information and improves the detection accuracy. With ATGCN, anomalies can be detected within 90 seconds with the highest accuracy and this prompt detection is important for traffic condition monitoring. In addition, with graph data modelling for highway traffic networks, a simpler TCNN architecture can also detect vehicles lane changing intentions accurately if sufficient information is provided by a larger temporal segment. In fact, TCNN is also a promising alternative with shorter time window (e.g., 30- and 60-second window) calculation, but a longer time window with ATGCN can achieve better performance/accuracy.
To conclude, we believe that the chapter releases implications for intelligent transportation: 1) Graph modelling on traffic flow suits the nature of highway networks and helps to enhance the knowledge representation. 2) The length of the temporal segment affects the performance of anomaly detection. When anomalies are required to be detected accurately and rapidly in important segments of highway traffic networks, delicate models (e.g., ATGCN) deserve more consideration. On the contrary, if anomalies will not cause severe threats to the driving safety (e.g., the driving speed varies within a small range) and can be monitored infrequently, the simpler model (TCNN) can be applied to reduce the computation cost.
5.1 Thesis summary
In this thesis, we discuss three topics related to IoT and smart transportation. In chapter 2, we investigate the problem on maximising the overall utility of IoT networks in a secure, privacy-aware and plug-and-play manner. For achieving this objective, we assume that there are different priority levels when different IoT devices transmit data to the central node in a decentralised setup with limited system resources. We propose a transmission frequency management system with anomaly detection mechanisms to better manage the IoT networks. Also, we introduce the system architecture including four key components: IoT devices, Gateway, Cloud platform and Users. Each IoT device is associated with a utility function with certain assumptions, and our objective is maximise the overall utility for the group of devices in the network by solving a mathematical optimization problem. Applying decentralised ADMM optimisation, the transmission frequency management is able to allocate the optimal transmission frequency to each IoT device in a privacy-protected manner. We also discuss anomaly detection in different scenarios using both mathematical rule-based and an LSTM-based approaches. In real-world experiments, the optimal transmission frequencies are calculated and set locally on each IoT device, without the allocation from central node. Meanwhile, manipulations that lead the IoT devices to transmit data deviating the set transmission frequencies can be detected by the proposed anomaly detector with high accuracy.
In chapter 3, we investigate the problem of sharing bike availability. Based on the current research related to traffic demand prediction, we leverage the state-of-the-art spatial-temporal graph convolutional network (ST-GCN) as the foundation to approach the research objective, to predict the number of available sharing bikes using realistic datasets. To enhance the prediction accuracy, we embed ST-GCN architecture with an attention module (AST-GCN) to leverage spatial and temporal information with different focuses. Furthermore, we also discuss the impacts of different modelling methods of adjacency matrices on the proposed architecture. Experimental results show that our proposed method using AST-GCN with the embedded adaptive adjacency matrix outperforms the majority of existing approaches in two real-world datasets.
In chapter 4, we consider the problem of detecting the lane changing intention on highway traffic networks for improving driving safety. We define the lane changing intention as lane changing probability and then simulate the traffic flow with a group of vehicles drive at different lane changing probabilities using a popular mobility simulator (i.e., SUMO). Given the simulation scenario, we leverage temporal graph convolutional network with attention module (ATGCN) to detect the lane changing intentions and compare the performance with another concise algorithm, i.e., temporal convolutional neural network (TCNN). Experiment results show that ATGCN can detect the lane changing intentions within 90 seconds with higher accuracy, while the TCNN can also detect the lane changing intentions quite accurately with just lower accuracy compared to ATGCN. In a word, there is a trade-off between detection performance and the computation resources. If the computational resource is limited in the IoT network, TCNN can play as a computation-economic role to ensure the driving safety; While there is enough resource for computation, ATGCN is the better option for detecting the lane changing intention.
In general, the thesis investigates how the advanced optimisation theories and novel machine learning methods can be applied to deal with real-world challenges arising in several research areas of IoT and smart transportation.
5.2 Limitations and Future works
The thesis discusses different topics related to the deep learning and optimisation algorithm applied in IoT and smart transportation. During the research carried out in this thesis, some limitations which merit further improvement arise and we now revisit these in our future work.
In chapter 2, the transmission frequency management system allocates the optimal transmission frequencies in order to maximise the overall utility of a group of IoT devices. The utility functions defined on IoT devices are strictly concave and smooth. However, in some scenarios where the utility functions are non-smooth and non-concave, the system behaviours become different and we will investigate the dynamics of system behaviours given the non-smooth and non-concave utility functions defined on IoT devices. Regarding the anomaly detection in transmission frequency management system, we only employ the LSTM architecture and there is a lack of investigations using other deep learning methods. The future work will experiment with other deep learning algorithms (e.g., graph neural network [deng2021graph, zhao2020multivariate] and Transformer based architecture [huang2020hitanomaly]). We will also experiment with different topologies (e.g., partially connected topology [jun2010partial]) that will be applied to the IoT network to model the connection relationship between devices and the gateway. As for anomaly detection, we only simulate the scenario when only one device suffers one type of malicious manipulation at the same time. In future work, we will investigate cases when a device suffers attacks by multiple manipulations at the same time and we will also consider the scenario where there are new devices connecting to or disconnecting from the gateway.
In chapter 3, although the overall accuracy of bike availability prediction is low for all stations, the prediction errors are relatively higher for the stations in the city centre area. In order to improve the prediction performance in the city centre, the network structure, adjacency matrices and advanced feature selection will be investigated as part of our future work.
In chapter 4, lane changing intention is predicted when the vehicles are guided by only a specific SAS speed (e.g., SAS speed = 80 km/h). In future work, it is worth investigating the prediction performance when the vehicles are driving at different SAS-recommended speeds. The lane changing intention is described as different static probabilities (e.g., probability 0.1, 0.5, 0.9). However, in the real world, it is more complicated to describe the intention, since the probability of lane changing can be varied depending on drivers’ characteristics. In future work, we shall factor in such complexity in modeling drivers for a more accurate analysis for real world scenarios. Also, the real driver behaviours (e.g., the lane changing behaviours in real world) would be investigated, in order to figure out what level of detection accuracy would be enough for real-world applications. Finally, we wish to note that the SAS system will cover all parts of M50 highway network in the near future. In this context, different attributes of the road segments, e.g., length of lanes, number of lanes, are required to be redesigned for a better modelling of the graph, which forms another part of our future work.