Spatial Positioning Token (SPToken) for Smart Mobility

05/16/2019
by   Roman Overko, et al.
ibm
0

We introduce a distributed ledger technology (DLT) design for smart mobility applications. The objectives of the DLT are: (i) preserving the privacy of the individuals, including General Data Protection Regulation (GDPR) compliance; (ii) enabling individuals to retain ownership of their own data; (iii) enabling consumers and regulatory agencies alike to confirm the origin, veracity, and legal ownership of data, products and services; and (iv) securing such data sets from misuse by malevolent actors. As a use case of the proposed DLT, we present a blockchain-supported distributed reinforcement learning innovation to determine an unknown distribution of traffic patterns in a city.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

09/16/2018

A blockchain framework for smart mobility

A blockchain framework is presented for addressing the privacy and secur...
11/19/2021

A privacy-aware zero interaction smart mobility system

Smart cities often rely on technological innovation to improve citizens'...
08/06/2019

LUCE: A Blockchain Solution for monitoring data License accoUntability and CompliancE

In this paper we present our preliminary work on monitoring data License...
06/18/2018

Sustainable blockchain-enabled services: Smart contracts

This chapter contributes to evolving the versatility and complexity of b...
08/09/2019

Privacy-Aware Distributed Mobility Choice Modelling over Blockchain

A generalized distributed tool for mobility choice modelling is presente...
10/08/2019

A Distributed Ledger Based Infrastructure for Smart Transportation System and Social Good

This paper presents a system architecture to promote the development of ...
08/04/2020

Reinforced Epidemic Control: Saving Both Lives and Economy

Saving lives or economy is a dilemma for epidemic control in most cities...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Companies such as Facebook, Google, Amazon, Waze and Garmin are just some examples of corporations that have built successful service delivery platforms using personalised data to develop recommender systems. While products gleaned from data mining of personal information have without doubt delivered a great societal value, they also have given rise to a number of ethical questions that are causing a fundamental revision of how data is collected and managed. Some of the most pressing ethical issues include:

  • preservation of individuals’ privacy (including GDPR compliance);

  • the ability for individuals to retain ownership of their own data;

  • the ability for consumers and regulatory agencies alike to confirm the origin, veracity, and legal ownership of data, products and services;

  • and protection against misuse by malevolent actors.

It is in this context that distributed ledger technology (DLT) has much to offer. For example, it is well known that the use of technologies such as blockchain has been proved beneficial to alleviate, or even eliminate, some of these above considerations. Consequently, our objective in this paper is to design one such system; we are particularly interested in developing a DLT that support the design and realization of crowdsourced collaborative recommender systems to support a range of mobility applications for smart cities. As it will be seen, this objective is challenging for a number of reasons. First, from the perspective of the basic distributed ledger design, we are interested in a system able to support high-frequency micro-transactions of the type required to support the rapid exchange of information between the multitude of IoT enabled devices found in cities. Second, as the DLT must support multiple control actions and recommendations in real-time, transaction times should be fast with low or zero transaction fees. Finally, the DLT should penalize malevolent actors who attempt to spam the system or lie to attack the design of any recommender system based on the DLT.

It should be noted that wrapping a DLT layer around personal information will fundamentally change the business model of many companies. Many corporations currently monetize recorded personal data with no explicit reward returned to the owner of such data (other than personalized recommendations or free access to products in return for the collected data). If such data is no longer available free of charge to these corporations, that will surely jeopardize existing business models. In future, most data will be privately held and not available in a public manner, and companies seeking to develop services, will need to purchase this data to sample an unknown density. In this context, a fundamental question therefore is how to do this at minimum cost, as quickly as possible, given some desired level of accuracy (e.g., a minimum quality of service). Given this background, a fundamental requirement is to develop a set of tools to enable such companies to sample these large data sets, secured in a distributed ledger, in an economic manner.

A second challenge arises from the design of recommender systems itself. In many important applications, the development of complex decision making tools is inhibited by difficulties in interpreting large-scale, aggregated data sets. This difficulty stems from the fact that data sets often represent closed-loop situations, where actions taken under the influence of decision support tools (i.e. recommenders), or even due to probing of the environment as a part of the model building, affect the environment and consequently the model building itself. Recently a number of papers have appeared highlighting the problem of recommender design in closed loop [Lazer2014, Sinha_NIPS17, Shorten_IEEETech16, Bottou, jonathan, roman]. Even in cases when there is a separation between the effect of a recommender and its environment, the problem of recommender design is complex in many real world settings due to the challenge of sampling and obtaining real time data at low cost.

In this paper we bring both of the above problems together in one framework. In particular, we consider the problem of sampling an unknown density representing traffic flow in a city, constituted by secured data points, using a DLT type architecture, without perturbing the density through probing action. Specifically, we will use reinforcement learning (RL) [RN229] to sample the density in order to build a model of the environment. However, while classical RL is usually not applicable for this purpose in many smart city applications due to its long training time, and due to the disruptive effects of probing, we shall demonstrate how the use of DLT allows us to achieve rapid probing actions without affecting the environment, and also enabling individuals to not only retain ownership of their own data but also be rewarded for contributing to the RL algorithm.

Ii Related work

Our work brings together ideas from many areas. DLT is a term that describes blockchain and a suite of related technologies. From a high perspective, the DLT is nothing more than a ledger held in multiple places, and a mechanism for agreeing on the contents of the ledger—namely the consensus mechanism. Since blockchain was first introduced in Nakamoto’s white paper in 2008 [nakamoto2008bitcoin], the technology has been used primarily as an immutable record keeping tool that enables financial transactions based on peer-to-peer trust [puthal2018everything, conoscenti2016blockchain, zheng2017overview, banerjee2018blockchain, yli2016current]. Architectures such as blockchain operate a competitive consensus mechanism enabled via mining (Proof-of-Work), whereas architectures such as the IOTA Tangle [wang2018survey] based on graph structures often operate a cooperative consensus technique. In this work, we will use IOTA DLT. Our interest in IOTA stems from the fact that its architecture is designed to facilitate high-frequency microtrading. In particular, the architecture places a low computational and energy burden on devices using IOTA, it is highly scalable, there are no transaction fees, and transactions are pseudo-anonymous [popov2017equilibria]. In terms of mobility applications, we note that several DLT architectures have already been proposed. Recent examples include [carnet, towards, bc]

and the references therein. To the best of our knowledge, our work is the first using a Directed Acyclic Graph (DAG) structure, namely the IOTA Tangle, to support distributed machine learning (ML) algorithms.


In terms of ML, we borrow heavily from RL and Markov Decison Processes (MDPs), and in particular, crowdsourced ML. The literature on MDPs and RL algorithms is vast and we simply point the reader to the recent publications [jonathan, Epperlein2018, Krumm2008, SimmonsBrowningZhangEtAl2006], in which some of this work is discussed. With specific regard to RL and mobility, some applications are presented in [work15, work18, work17, work19, work16]. As in our previous work [roman], we exploit the idea of using crowdsourced behavioural experience to augment the training of ML algorithms (see recent survey for an overview of this area in [Vaughan2018]). As it is also mentioned in [roman], our work also strong links to adaptive control [n1]. The idea of augmenting offline models with adaptation is discussed extensively in the recent multiple-models, switching, and tuning paradigm [n2].

Finally, it is worth mentioning that we are ultimately interested in the design on recommender systems that account for feedback effects in smart city applications. In [florian, bei, arieh], different information is sent to different agents in an attempt to mitigate closed-loop effects. An alternative, more formal, approach is presented in [jonathan]

. There, the authors attempt the identification of a smart city system from closed-loop data sets. In particular, the authors present a Tikhonov regularization procedure for estimating parameters of a closed-loop Markov-Modulated Markov Chain, which consists of two Markov Chains: (i) a chain whose state is visible, and whose transition probabilities are modulated by (ii) a second Markov Chain whose state is hidden and whose transition probabilities can in turn also be modulated. Similar issues have drawn interest from various domains including economics 

[HalVarian_NASUS17], recommender systems [Cosley2003, Sinha_NIPS17], physiology [Gollee11], and control engineering in the context of Smart Cities [Shorten_IEEETech16].

Iii A distributed ledger for crowdsourced smart mobility - SPToken

Iii-a Design objectives

Our intent is to design a DLT-based system for crowdsourcing in a smart mobility environment. In particular, we explore how to apply this framework to a RL setting where a third party is interested in acquiring information from vehicles in order to solve an optimization problem.

The underlying idea is to use a set of virtual vouchers or tokens as a proxy to indicate specific points of interest that algorithms might be interested in investigating. In RL algorithms, for example, we are interested in maximizing the expected reward (relative to an objective function) for taking a specific route across a city. To make this process clearer, consider the following example. Figure 1 shows an instance of a typical scenario where two junctions and are connected to one another through the road segment . At time , a vehicle updates the ledger with some information (e.g., pollution levels, travel time) and registers the last visited intersection (, in this example). Intuitively, this can be depicted as if the vehicle left the aforementioned token at junction . Then, a new vehicle passing via junction and directed to junction , can “collect” this token and, as it passes by junction at time , it updates the ledger with new information regarding this route link and the new position of the token. It is noteworthy that a car “deposits” a token when it deviates from the token route. Additionally, any new car that passes via junction and is addressed along the token route will be able to collect the token and the procedure is repeated for a new road segment.

The concept of using tokens to mark specific points where measurements are needed perfectly conforms with a DLT-based system. In fact, it is natural to use distributed ledger transactions to update the position of the tokens and to link them to the points of interest, and associated data, using transactions (this can be done, for example, using smart sensors at various junctions linked to digital wallets, as shown in Figure 1). Of course, the design of such a network poses a number of challenges that need to be addressed:

  • Privacy: In the DLT, transactions are pseudo-anonymous111https://laurencetennant.com/papers/anonymity-iota.pdf. This is due to the cryptographic nature of the addressing, which is less revealing than other forms of digital payments that are uniquely associated with an individual [2019arXiv190107302F]. Thus, from a privacy perspective, the use of DLT is desirable in a smart mobility scenario.

  • Ownership: Transactions in the DLT can be encrypted by the issuer, thus allowing every agent to maintain ownership of their own data. In the aforementioned setting, the only information required to remain public is the current ownership of the tokens.

  • Microtransactions: Due to the amount of vehicles in the city environment, and also due to the need of linking the information to real time conditions (such as traffic or pollution levels), there is the demand for a fast and large data throughput.

  • Resilience to Misuse: The system must be resilient to attacks and misuse from malevolent actors. Typical examples include double spending attacks, spamming the system, or writing false information to the ledger. All these instances can be greatly limited by a combined use of a consensus system based on Proof-of-Work (PoW) and Proof-of-Position (PoP), which will be described in the next section.

(a) A vehicle passes through a junction where another car has recently issued some data (this is displayed by a token). This makes the agent eligible to write transactions to the ledger.
(b) The same vehicle passes through junction . It then writes some data, relative to the road link , to the ledger and deposits the token so that another vehicle will be able to collect it.
Fig. 1: The sequence to issue new data from vehicles. Here denotes a token.

To meet all the design objectives described above, in the next section we propose Spatial Positioning Token (SPToken), a permissioned distributed ledger based on the IOTA Tangle.

Iii-B The Tangle and the Proof of Position

As discussed above, we are interested in building Tangle-based, a particular DLT architecture that makes use of Directed Acyclic Graphs (DAGs) to achieve consensus about the shared ledger. A DAG is a finite connected directed graph with no directed cycles. In other words, in a DAG there is no directed path that connects a vertex with itself. The IOTA Tangle is a particular instance of a DAG-based DLT [popov2017equilibria], where each vertex or site represents a transaction, and where the graph, with its topology, represents the ledger. Whenever a new vertex is added to the Tangle, this must approve a number of previous transactions (normally two). An approval is represented by a new edge added to the graph. Furthermore, in order to prevent malicious users from spamming the network, the approval step requires a small PoW. This step is less computationally intense than its blockchain counterpart [banerjee2018blockchain] and can be easily carried out by common IoT devices, but still introduces some delay for new transactions before they are added to the Tangle. Refer to Figure 2 for a better understanding of this process.

The Tangle architecture has the advantage over blockchain to allow microtransactions without any fees (as miners are not needed, in order to reach consensus over the network [2019arXiv190107302F]), which makes it ideal in an IoT setting as it is described in the previous section. Moreover, the Tangle fits perfectly with the concept of multiple tokens being transferred from one location to another as its DAG structure makes it natural to describe such a process.

Unlike the Tangle, in which each user has complete freedom on how to update the ledger with transactions, the SPToken network has a regulatory policy in order to prevent agents to add transactions that do not possess any relevant data (since transactions are encrypted). Therefore, as a further security measure, SPToken makes use of PoP to authenticate transactions. In other words, for a transaction to be authenticated, it has to carry proof that the agent was indeed in an area where a token was available. This is achieved through PoP via special nodes called Observers

. Each observer is linked to a physical sensor in a city. A sensor can be a fixed piece of infrastructure, or a vehicle which position is verified. Whenever a car passes by an observer that is in possession of a token, a short range connection is established (e.g., via Bluetooth) and the token is transferred to the vehicle’s account. To deposit the token and to issue a transaction containing data, the agent needs to pass by another observer to establish a short range connection (note that not every observer is available to establish a connection at every moment). To avoid users to hoard tokens, they should be automatically returned if not used after a certain period of time. See Figure

1 for a better understanding of this process. This process ensures that vehicles have to be physically in the interested locations to be able to issue transactions. This further authentication step makes SPToken a permissioned Tangle (similar to permissioned blockchains [puthal2018everything]), i.e., a DAG-based distributed ledger where a certain amount of trusted nodes (the observers, in this case) is responsible to maintain the consistency of the ledger (as opposed to a public one, where security is handled by a cooperative consensus mechanism [popov2017equilibria]).

Furthermore, an additional PoW step can be introduced into the network to ensure that multiple vehicles (each with tokens) compete with each other to write to the ledger. In this context, instead of an observer issuing a single token, a number of virtual tokens is issued to appropriate vehicles. Once each of these vehicles completes a physical PoP step (for example, traversing a segment), they then compete to write to the ledger via PoW. While a full discussion of this is beyond the scope of the present paper, it is worth noting that this procedure would make it extremely expensive for dishonest actors to write biased data to the ledger (in a manner similar to the blockchain mining mechanism).

Fig. 2: Sequence to issue a new transaction. The blue sites represent the approved transactions and the red ones describe transactions that have not been approved yet. The black edges represent approvals, whereas the dashed ones represent transactions that are performing the PoW in order to approve two unapproved sites.

Iv Application example - Reinforcement learning over SPToken

Our objective now is to implement a RL strategy using the token architecture described in the previous section. Specifically, instead of using vehicles as RL agents [roman] to probe an unknown density, we use tokens passing between vehicles to effectively create virtual agents and emulate the behaviour of agents designed to probe the environment. Formally, we employ a modified version of the recently proposed model-based MDP learning algorithm called Upper Bounding the Expected Next State Value (UBEV) [ubev]. UBEV involves a combination of backward induction with maximum likelihood estimation to (i) construct optimistic empirical estimates of state transition probabilities, (ii) assign empirical immediate reward, and (iii) compute optimal policy. In fact, our design of the action space allows us to avoid estimating the transition probabilities, which significantly reduces the training time. Effectively, the algorithm learns only the reward function which describes the environment (e.g., traffic patterns in a city).

Since the training time is a common disadvantage of RL algorithms, we propose to launch independent tokens, which act as virtual vehicles and use the same policy to explore different areas of a city. Further details of the proposed approach together with the corresponding experimental assessment are provided in the following sections. In particular, we experimentally assess:

  • how fast the system learns to avoid traffic jams,

  • how quickly the system returns to the shortest path policy once the traffic jams clear up, and

  • how the training time depends on the number of independent tokens.

Essentially, the UBEV algorithm [ubev]

performs a standard expectation-maximization trick. Namely, it first fixes the state transition probabilities of the MDP and the expected reward estimates, and uses backward induction to design the optimal deterministic policy in the feedback form which maximizes the expected reward. Next, this policy is used to “probe” the environment, and the statistics collected over the course of probing are used to update transition probabilities by employing a standard “frequentist” maximum likelihood estimator 

[HTF], which simply computes the frequencies of transitioning from one state to another subject to the current action (that can be a function of the current state). Then, the optimal policy (for the updated estimates of the transition probabilities and reward) is recomputed again. This procedure is treated as an episode of the training process and is iterated until convergence (as demonstrated in [ubev]).

Iv-a Modified UBEV algorithm

We now present the Modified UBEV (MUBEV) algorithm. Recall that an MDP is a discrete stochastic model defined by a tuple , where

  • is the set of actions, and is the number of actions,

  • is the set of states, and is the number of states,

  • is the probability of transition from state under action to state ,

  • is the reward of choosing the action in the state .

The trajectory of the MDP is defined as follows: it is assumed that , i.e., the state at time is drawn from a distribution P which depends on , and . In this case, the expected reward associated to the policy is defined in this fashion:

(1)

where is the distribution of the initial state, and is defined as follows:

(2)

The goal of MDP is to maximize the expected reward (1), and the optimal MDP policy, i.e., the policy maximizing Equation (1), is calculated through the backward induction process given by:

(3)

We are now in a position to present the MUBEV algorithm. Algorithm 1 represents a modified version of the UBEV algorithm as a result of adapting the original UBEV algorithm for its use in the context of our target problem, which includes the following modifications to the UBEV algorithm. First of all, we use a specific type of the action space, namely we apply the following actions: 'turn left', 'turn right', 'go straight', and 'stay in the same state'. This allows us to provide the algorithm with the set of predefined transition probabilities. Specifically, for each action , the corresponding transition probability matrix has rows with zero elements but one at the location of the state, which represents a road link where the agent jumps from the current state provided the action was taken. For example, if action is 'turn left', then for every road link (state) there is just one “utmost left” road link, and so the probability of transitioning to the corresponding state is . As a result, it is not required to learn the transition probabilities, which is a significant advantage especially for large road networks. Second, at the beginning of the training, there is a little or no information of the reward distribution, and the algorithm rather explores than exploits. For instance, it assigns the optimal policy “randomly”: if all the components of the -function are equal, i.e., (Algorithm 1, line 22), the original algorithm always selects the first component of (as per line 25). In other words, it probes the environment without any preference in term of the direction of the exploration. In contrast, we force it to stick to the shortest path policy in the case , so that it explores the surrounding along the shortest route, and gathers the corresponding reward statistics along that route. Once it “faces” a traffic jam after a certain action , it gets delayed, which in turn introduces the negative reward for the action at state . As a result, the reward distribution changes, and the shortest path policy is amended to avoid the jam by looking for a detour. By operating in this fashion, we sample along near optimal trajectories which has also a practical value. Third, we aim to launch multiple participating tokens always starting at different (randomly sampled) origins and having the same destination. All these tokens follow the same policy, and the corresponding statistics are then used to update the expected reward. Consequently, learning and adaptation happen more rapidly. Finally, we propose a stationary model of the MDP with (i) the exchange of collected reward and statistics (Algorithm 1, lines 32-33) between agents (tokens), and (ii) the contribution of such data to the recommender system.

Notation for MUBEV and the Reward Function. In Algorithm 1 we have: is the set of states, where a state corresponds to a road link (edge) in a SUMO network; is the set of actions; and denote cardinality of finite sets and respectively; is the length of the MDP’s time horizon; P is an array of predefined transition probabilities; is the shortest path policy; is the number of MUBEV tokens; is the failure probability (see [ubev] for details); is the number of actions taken from state at time ; is accumulated reward from state under action at time ; is the value function from time step for state ; is the Q-function for the appropriate state, action and time [ubev]. Initial values of elements in arrays , , and are zeros for all . Failure tolerance is scaled by . is the maximum reward that the agent can receive per one transition; is the maximum value for next states. and

denote vectors of length

, and is interpreted as a vector of length . is the width of the confidence bound [ubev]; is the Euler’s number; is normalized reward from state under action at time ; and are auxiliary variables. Vector is a vector of initial states of MUBEV tokens, which is uniformly sampled in range from to with no repeated entries. The agents (tokens) interact with the environment each time step , and receive reward determined by the reward function defined in Function 1.

1:; ; ; ; ; ; ; ; .
2:.
3:function (; )
4:     if  then
5:         // Distance reward computation
6:         
7:         if  then
8:              
9:         else
10:                        
11:         // Time reward computation
12:         
13:         if  then
14:              
15:         else
16:              
17:             // Applying the edge coefficient
18:              if  then
19:                                          
20:          Total reward
21:     else Jumping to the same state
22:         if  then
23:               Penalty: impossible action
24:         else
25:              if  then
26:                  
27:              else
28:                   Penalty: leaving destination                             return
Function 1 The Reward Function

Concerning Function 1, it returns total reward, i.e., distance reward plus time reward, at time . Additionally: is actual travel time on an edge that corresponds to state ; is a scale factor that increases minimum travel time on an edge due to traffic uncertainties; is a parameter used for faster learning of congestions; and are the weights of distance and time reward, respectively; is the absolute value of penalty given to the agent if it takes impossible actions during the learning process or when it leaves the destination; is the shortest route length from state to the destination state ; is the edge length that corresponds to state . Finally, is the duration of yellow and red phases of a traffic light signal (TLS) that controls edge (state) ; if an edge is not controlled by a TLS, we apply for that state. If some edges are not controlled by traffic light signals, we employ the edge coefficient for them (Function 1, lines 13-14) which is computed in this fashion: if the length of an edge that corresponds to state is smaller than the average edge length , then , otherwise .

1:; ; ; P; ; ; ; .
2:
3:.
4:; .
5:
6:for  do
7:     // Optimistic planning loop
8:     for   to   do
9:         
10:         
11:         
12:         
13:         for  do
14:              for  do
15:                  ;
16:                  if  then
17:                       
18:                       
19:                       
20:                       
21:                       
22:                       
23:                                          
24:                                 
25:              ;
26:              if  then
27:                  
28:              else
29:                                 
30:              ;               
31:     // Execute policy for one episode
32:     
33:     for   to   do
34:         
35:         
36:          The reward function
37:         
38:               
Algorithm 1 Modified Upper Bounding the Expected Next State Value (UBEV) Algorithm - MUBEV

V Numerical simulations

In the following application, we are interested in designing a recommender system for a community of road users. We distribute a set of MUBEV tokens so that the uncertain environment can be ascertained. These tokens are passed from vehicle to vehicle using the DLT architecture described in Section 3. Specifically, in what follows tokens are passed from one vehicle to another in a manner that emulates a vehicle probing an unknown environment. The token passing is determined both by MUBEV and DLT and can be orchestrated using a cloud-based service. Cars possessing a token are permitted to compete to write data to the DLT. We refer to such vehicles as virtual MUBEV vehicles. In this way, the token passing emulates the behaviour of a real agent (vehicle) that is probing the environment. Once the environment has been learnt, it is communicated to the community via some messaging service.

For the experimental evaluation of our proposed approach we designed a number of complex numerical experiments, based on traffic scenarios implemented with the open source traffic simulator SUMO [sumo]. Interaction with running simulations is achieved using Python scripts and the SUMO packages TraCI and Sumolib. The general setup used in our simulations is as follows:

  • In all our experiments, we make use of the area in Barcelona, Spain shown in Figure 3.

  • A number of roads are selected as origins, destinations, and sources of congestion. Experiment 1 uses the set {Origin 1, Congestion 1, Destination 1}, while the Experiment 2 and 3 use {Origin 2, Congestion 2, Destination 2}.

  • In all simulations we use a new vehicle type based on the default SUMO vehicle type222https://sumo.dlr.de/wiki/Definition_of_Vehicles_Vehicle_Types_and_Routes with maxspeed=118.8 km/h and impatience=0.5. To generate traffic jams, we modify the maximum speed of certain cars to be 6.12 km/h and populate the selected roads with them. When these vehicles are in possession of a token, they become virtual MUBEV vehicles.

  • Whenever required, shortest path is calculated with SUMO using the default routing algorithm (dijkstra).

  • We refer to a token trip as a RL episode.

Fig. 3: Realistic road network used in the experiments: between Vila de Gràcia and El Camp d’en Grassot i Gràcia Nova in Barcelona, Spain.

Concerning the design parameters of the reward funcion and the MUBEV algorithm, in all our experiments we set , , , and tuned the other design parameters as follows: , , . The specific setup for each individual experiment will be described in the corresponding subsection below.

V-a Experiment 1: Optimal route estimation under uncertainty

The purpose of the first experiment is to evaluate the performance of our approach for the estimation of optimal routes under uncertainty, and for this we first remind the reader the general operation of our approach. Over a given episode, a number of tokens follows the system recommendations, and when these tokens reach their destination, another set of tokens takes over, and over every consecutive episode MUBEV updates tuning data from each virtual MUBEV car.

For the purpose of this present discussion, we use a token over each episode of learning, meaning that, over each episode, data from the token is used to update the MUBEV policy. For this, the MUBEV token has a fixed origin-destination (OD) pair given by {Origin 1, Destination 1}, and we select the road section labeled as Congestion 1 (which belongs to the shortest path for the selected OD pair) to generate a traffic jam on it at different intervals (see Figure 3). Then, over each new episode we start the token from Origin 1 and ask it to travel to Destination 1, keeping a record of its performance in terms of travel distance (route length) and travel time regardless of its success. Additionally, a token has a maximum number of allowed links (defined by MDP’s time horizon) that it can traverse, and if it does not reach its destination within this restriction, then the token trip is declared incomplete. The results for this experiments are shown in Figure 4, from which we can draw some important conclusions:

  • In general, we can see that the token succeeds in both avoiding traffic jam once congestion is created, and returning to shortest path once congestion is removed, using a reasonably small number of episodes (see Figure 4 bottom).

  • As time passes, more information (statistics) is collected from the environment in the form of reward, and the token is more likely to fully complete a trip for the given OD pair (i.e., fewer red crosses as the experiment progresses in Figure 4).

Fig. 4: Experiment 1: Travel time and travel distance of a virtual MUBEV car (token) during the learning process on a changing environment, using a fixed OD pair.

These two observations validate our expectations about the UBEV-based routing system: (i) it is able to adapt to uncertain environments, and (ii) its performance improves as time passes. It is worth noting that this experiment is useful to analyse the performance of a single token in the learning process from the environment using a fixed OD pair. Note that once the environment has been determined, the recommendations gleaned from the environment can be made available to the wider community of vehicles. We explore this in the following experiment.

V-B Experiment 2: Route recommendations from the UBEV-based system and speedup in learning

The previous experiment is an absolutely simple demonstration of the successful use of UBEV in a mobility context. We now explore a scenario where multiple tokens, starting from different origins, are used to update MUBEV policy over each episode. Specifically, in the next experiment, we evaluate the performance of MUBEV as a function of the number of tokens over each episode, subject to a uniform geographical distribution of origins and a common destination (namely, Destination 2). Additionally, we analyze the performance of a (non-MUBEV) car trying to reach Destination 2 from the given fixed Origin 2, using a recommendation from a simplistic UBEV-based routing system. In this case, the initial recommendation is shortest path, and further recommendations come from the MUBEV recommender system for the OD pair {Origin 2, Destination 2}. In addition, in this case, if a complete route cannot be calculated using the MUBEV recommender system, then the most recent valid recommendation is reused. The results for this experiments are depicted in Figures 5 and Figure 6.

Fig. 5: Experiment 2: Average travel time and travel distance of a test car using route recommendations from a UBEV-based routing system and involving multiple MUBEV tokens. Each data point corresponds to the average of 10 different realizations of the experiment, and a moving average with window size 2 was later used to smooth the resulting signals.
Fig. 6: Experiment 2: Average learning speed using multiple MUBEV tokens. Each data point corresponds to the average of 10 different realizations of the experiment.

In Figure 5, it can be observed that the number of participating tokens directly affects the convergence rate of the algorithm. As expected, the more tokens are involved, the faster the learning process. From Figure 6, we can notice the relationship between the number of participating MUBEV cars and the number of episodes are required to learn a new given traffic condition (either congestion or free traffic).

V-C Experiment 3: Comparative analysis

Finally, the third experiment was designed to compare the performance of our UBEV-based approach with respect to a reference solution: shortest path (SP) routing. It is worth mentioning that this reference solution is widely used by a variety of route recommenders, and so the proposed comparison is reasonable. Therefore, for this experiment, we use two test cars, one of which uses recommendations from our MUBEV recommender system in Experiment 2, and the other uses SP policy all the time, both for the OD pair {Origin 2, Destination 2}, and with congestion on road link Congestion 2 on a given interval. Results for this experiments are shown in Figure 7, in which we can observe that the performance of our UBEV-based approach is similar to SP routing under free-traffic conditions, but it clearly outperforms SP (in terms of total travel time) once traffic jam is introduced. It is also clear that a route different than the SP route implies longer travel distance (as seen in Figure 7 bottom), but this is ultimately negligible for an end user as long as the resulting travel time is shorter than the one using SP.

Fig. 7: Experiment 3: Performance of two test cars, one using recommendations from the UBEV-based routing system, and the other one always using shortest path policy. Each data point corresponds to the average of 10 different realisations of the experiment, and a moving average with window size 10 was later used to smooth the resulting signals.

Vi Conclusion and Outlook

We introduced a distributed ledger technology design for smart mobility applications. The objectives of the DLT are: (i) preserving the privacy of the individuals, including General Data Protection Regulation (GDPR) compliance; (ii) enabling individuals to retain ownership of their own data; (iiii) enabling consumers and regulatory agencies alike to confirm the origin, veracity, and legal ownership of data, products and services; and (iv) securing such data sets from misuse by malevolent actors. As a use case of the proposed DLT, we successfuly presented a blockchain-supported distributed RL algorithm to determine an unknown distribution of traffic patterns in a city.

Acknowledgements

This work was partially supported by SFI grant 16/IA/4610.