Bayesian Classifier for Route Prediction with Markov Chains

by   Jonathan P. Epperlein, et al.

We present here a general framework and a specific algorithm for predicting the destination, route, or more generally a pattern, of an ongoing journey, building on the recent work of [Y. Lassoued, J. Monteil, Y. Gu, G. Russo, R. Shorten, and M. Mevissen, "Hidden Markov model for route and destination prediction," in IEEE International Conference on Intelligent Transportation Systems, 2017]. In the presented framework, known journey patterns are modelled as stochastic processes, emitting the road segments visited during the journey, and the ongoing journey is predicted by updating the posterior probability of each journey pattern given the road segments visited so far. In this contribution, we use Markov chains as models for the journey patterns, and consider the prediction as final, once one of the posterior probabilities crosses a predefined threshold. Despite the simplicity of both, examples run on a synthetic dataset demonstrate high accuracy of the made predictions.



There are no comments yet.


page 1

page 2

page 3

page 4


Statistical Efficiency of Travel Time Prediction

Modern mobile applications such as navigation services and ride-hailing ...

On Computing the Total Variation Distance of Hidden Markov Models

We prove results on the decidability and complexity of computing the tot...

An Intelligent System to Detect, Avoid and Maintain Potholes: A Graph Theoretic Approach

In this paper, we propose a conceptual framework where a centralized sys...

Infinite Mixture Model of Markov Chains

We propose a Bayesian nonparametric mixture model for prediction- and in...

(Blue) Taxi Destination and Trip Time Prediction from Partial Trajectories

Real-time estimation of destination and travel time for taxis is of grea...

Efficient Destination Prediction Based on Route Choices with Transition Matrix Optimization

Destination prediction is an essential task in a variety of mobile appli...

Locally optimal routes for route choice sets

Route choice is often modelled as a two-step procedure in which travelle...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

©2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Understanding driver intent is a prerequisite for a personalised driving experience with features such as personalised risk assessment and mitigation, alerts, advice, automated rerouting, etc. One of the most important driver intentions is the destination of the journey and the route to get there. Furthermore, in particular for hybrid vehicles, knowledge of the route ahead of time can be used to optimise the charge/discharge schedule, e.g. [2] find improvements in fuel economy of up to 7.8%.

In the data used in [3], 60% of trips are repeated and hence predictable from the driving history; [4] suggests that in the more general setting of user mobility, more than 90% of a user’s trajectories are “potentially predictable.” The combination of value and feasibility has sparked much research in this direction. While some algorithms rely on GPS trajectories only – e.g. [3] computes geometric similarity between curves obtained from cleaned and filtered raw GPS data – other approaches, such as [1, 5, 6]

, where Markov chains and hidden Markov models are built on the road network to estimate the most likely routes and destinations, next turns, or trip clusters in a broader sense, require map-matching to first map GPS traces to links in the road network.

The present paper falls into that latter category, as it builds on and extends the recent work of [1] and contributes to a novel and flexible approach to the important problem of driver intent prediction. It is structured as follows: after introducing some notation relating to probability, stochastic processes, and Markov chains in the next section, we then show in Section 3 how trips can be modelled as outputs of stochastic processes to obtain an estimate of the posterior probabilities of each known journey pattern. The algorithm resulting if Markov chains are used as the stochastic process models is described in detail in Section 4, and Section 5 provides some experimental validation. We close with some possible extensions and improvements, and the observation that [1]

also fits into the presented framework, if the stochastic process model is chosen to be a naive Bayes model instead of Markov chains.

2 Notation

We should write

for the probability of the event that a realisation of the discrete random variable

equals , and the probability of that same event given that the event occurred. However, for convenience we will most of the time write instead of , when it is clear from the context, what is meant. For a set of parameters

parametrising a probability distribution, the notation

is taken to denote the probability of the event if the parameters are set to .

We let denote the natural numbers, and for , . Matrices will be denoted by capital letters, their elements by the same letter in lower case, and the set of row-stochastic matrices, i.e. matrices with non-negative entries such that every row sums up to 1, by . We denote the cardinality of a set , i.e. the number of its elements, by . All cardinalities here will be finite.

A stochastic process is a sequence of random variables indexed by , which often denotes time. For , we call a sequence of realisations of the random variables a trajectory of the process; is the probability of the trajectory given that the stochastic process is parametrised by ,111e.g. if corresponds to an unfair coin flip, would correspond to the probability of heads whereas interpreted as a function of , it is the likelihood of the parameters being equal to .

Specifically, we will use Markov chains, which are stochastic processes with and that are completely defined by a transition (probability) matrix

and a vector

of initial probabilities. Then, and characterises the process. Note that the “next state” depends only on the current realisation and not on the past; this is also know as the Markov property. This corresponds to a directed graph with nodes and weights on the edge from node to . The stochastic process defined by the Markov chain then corresponds to an agent being initialised on some node according to and making every decision where to go next according to the weights on the edges leading away from it.

3 Problem Setting and Framework

From a driver’s history of trips taken in the past, we want to learn a predictive model, allowing us to identify properties (such as destination or specific route) of a currently ongoing trip as soon into the trip as possible. The history is

where each trip has length and consists of sequences and of time stamps and road segments. The road segments are identified by their OpenStreetMap (OSM) way IDs [7], which implies that map matching has been performed on the raw GPS trajectories. We’ll return to that point in Sec. 5.1. Let

be the set and number of all road segments ever visited.

Each trip in belongs to a cluster , where the cluster encodes the journey pattern, destination, or more generally “properties” of the trip. From now on, we shall use the more generic term “cluster,” as defined in [1]; see also there for further explanations. A cluster could be as coarse as a collection of all trips with the same destination (defined as e.g. the last road segment of a trip), hence encoding only the property of destination, or more fine-grained, by defining a measure of similarity between trips and clustering according to those similarities; for instance trips along the “scenic route to work” and using the “fastest route to work” would then belong to different clusters despite sharing the same destination. More details are again postponed until the computational examples are described in Sec. 5. Whichever way it is obtained, let us define this set of clusters by


and assume that is a partition of , i.e. every trip belongs to exactly one cluster. We can then state the problem more precisely as:

Given an ongoing trip , decide what cluster  trip belongs to.

The proposed framework consists of a model for each cluster, providing a likelihood function

, a prior probability

, the Bayesian update to the posterior probabilities , and the criterion by which the prediction is made. The next sections elaborate on these parts.

3.1 Journeys as Stochastic Processes

Inspired by classical single-word speech recognition algorithms – see e.g. the classic survey [8] – where words are modelled as stochastic processes (for speech recognition frequently hidden Markov models), emitting sequences of vectors of spectral (and/or temporal) features derived from the acoustic speech signal, and the word corresponding to the stochastic process with the highest likelihood of having generated the present sequence is returned as the result, we propose to model journey patterns as stochastic processes emitting road links.

The choice of the type of stochastic process is free, there can even be different kinds for different clusters; once a type is chosen, the parameters for each cluster have to be estimated from the trips in belonging to . In other words, a model for each cluster is “trained” on the available history. The outcome of this training process is a mapping , i.e. a function that allows us to evaluate the likelihood for each cluster of it having produced the currently available sequence of road segments (and time stamps). Our choice of stochastic process here will be Markov chains, which are particularly easy to train and evaluate; the details are given in Sec. 4.1.

3.2 Prior Probabilities

Before a trip even started, there might already be a high probability of it belonging to a certain cluster: if you have Aikido class on Wednesdays at 19:00 o’clock, and the current trip started on a Wednesday at 18:30 o’clock, there’s a very high probability that the trip’s destination will be your dojo. More generally, all additional information available about the trip, such as the current day of the week, the weather, current public events etc. form the context of the trip, and from the context, we can estimate the prior probabilities of each without any trip information yet available. In the absence of context, we can set , i.e. make the prior proportional to how many trips in belonged to , or even simply set for all .

3.3 Bayesian Updates

Bayes’ law relates the quantity we are ultimately interested in – the probability of the current trip belonging to given what we already know about the trip, i.e.  – to the quantities estimated from – the likelihood and the prior – by


A subtle point concerns the normalisation , the probability of the trip observed so far with no further assumptions on its nature: by computing the numerators and then normalising them to sum up to one, i.e. by imposing

we implicitly assume that every trip is indeed in one of the already known clusters. As a simple fix, we could introduce the probability of any trip not belonging to any known cluster as a constant , in which case we’d have , simply a normalisation to a different constant. We proceed without accounting for unknown clusters, but keep this implicit assumption in mind.

3.4 Stopping Criterion

If, when, and how the updates should be stopped and the final prediction announced depends on the application. If the goal is route planning, we need to be reasonably certain of the unique destination before planning a route. On the other hand, if the goal is identifying risks along the remaining journey, it is sufficient to narrow down the set of possible routes and check for risks along all of them; predictions can be made continuously in this scenario. Other applications might call for other measures.

Here, we apply the simple criterion that as soon as one of the clusters’ posterior probabilities exceeds a threshold , prediction stops and returns cluster as the result, which should work well for the first application.

4 Cluster Prediction Algorithm

In order to derive a concrete algorithm for cluster prediction, a choice of statistical process model has to be made, and the models have to be trained. In this section, we describe this for the case of Markov chains.

4.1 Modelling Clusters as Markov Chains

Modelling clusters by Markov chains is, of course, a simplification of reality, but as we shall see it leads to a computationally very tractable algorithm and performs very well in our computational experiments in the next section. The simplifying assumption is the “Markov assumption”: if the current trip belongs to a cluster , the probability distribution of the next road segment222The capital “” is used because it denotes a random variable. depends only on the current road segment .

More formally, the state space of the Markov chain is , where is the total number of road segments in the road network under consideration. Every trip , or rather its sequence of road segments, then corresponds to a trajectory of the Markov chain. If a trip belongs to a certain cluster , e.g. “scenic route from home to work,” and if the trip so far has been , then, in full generality, the probability distribution of the next road segment, given all that is known at , is . In modelling this as a Markov chain, we are imposing that


Then, is the transition probability matrix of cluster , and is the probability to turn into road from road , if the current trip is in cluster .

The transition probabilities are estimated by


This very intuitive estimate is in fact the maximum-likelihood (ML) estimate, see e.g. [9]. If a road segment never appears (or more precisely, is never transitioned from) in any trip in , then (4) cannot be applied; for now, we can just set these to 0, and return to this issue in Sec. 4.2.

For the initial probabilities , the (ML) estimate is


but this assumes that the prediction task always starts at the beginning of a trip; if for whatever reason the first few links of a trip are missing in the data, this might lead to a failure of the prediction. E.g. by choosing

i.e. making the initial probability uniform over all road segments that appear in the cluster, or even setting for all , we avoid this problem and allow for prediction to start during a trip; we thus treat

somewhat heuristically as a design parameter. The likelihood function is then given by


or recursively by

1:procedure Training(, , )
2:     for all  do
3:          by (4), (5), (8)      
4:procedure Prediction()
6:     , None
7:     for all  do
8:         ,      
9:Prediction loop
10:     while  do
11:         if trip is finished then
12:              return None
13:         else
14:              wait until new segment is received
15:              ,
16:              for all  do
17:                  if  then
19:                  else
20:                        see Eq. (7)                   
21:                   Bayesian update               
22:              for all  do
23:                   Normalization                             
24:     return
Algorithm 1 Cluster prediction using Markov chains

4.2 Unseen Transitions and Unseen Road Segments

When a transition that has never occurred in the training data occurs in the current trip to be predicted, the algorithm described so far will break down, because the likelihood will drop to for all clusters , and hence the posteriors will be undefined as . This will be a rather common situation: the data is not perfect, hence segments visited in reality could be missing in the data, GPS data could be mapped to the wrong way ID, small detours could take the driver along never before visited roads, etc. Similar to the PageRank algorithm [10] and as in [1], we address this problem by introducing a small and adding it to each transition probability (except self-loops, i.e. transitions ), even the ones never observed. The matrices then have to be re-normalized to obtain stochastic matrices again, however now every probability . Formally, for all :


If the prediction algorithm receives a road segment that has never been seen, i.e. if , this is treated in much the same way by extending the likelihood function, or rather the transition probability matrices , to assign a minimum probability to transitions to or from unseen links. This can be addressed very easily in the code by inserting an if statement before updating the likelihood, but also formally: we add a state for “unseen segment” to to obtain , and add a column and row equal to to every (note that a transition from to is now allowed, as it corresponds to two previously unseen segments in sequence, and not necessarily a repetition of the current segment). Every link with an ID not in is then mapped to before the likelihood is computed.

Pseudocode of the full algorithm, including the modifications described here, is shown as Algorithm 1.

5 Computational Experiments

We now use the dataset in [1] to test the proposed algorithm; a short description is given below, for more details, see [1].

Figure 1: Failure rates and mean percentage of links needed to make a prediction after 8 rounds of 50-50 cross validation. Note the different view angles of the two axes.

5.1 Data

Seven origins and destinations across Dublin were selected, representing typical points of interest such as “home,” “work,” “childcare,” etc. This yields a total of 21 possible origin/destination pairs, for 17 of which up to 3 distinct routes were generated. These routes were then fed into the microscopic traffic simulator SUMO [11] to generate a total of

trips in the form of timestamps and longitude/latitude coordinates. To simulate real GPS data, uniformly distributed noise (on a disk of radius 10m) was added to each point.

To prepare the data in the form of for our algorithm, the sequence of GPS points needs to be converted into a sequence of way IDs, which was done using the Map Matching operator of IBM Streams Geospatial Toolkit. Subsequently, duplicates were removed (i.e. if more than one consecutive GPS point was mapped to the same road segment, only the first instance was kept).

5.2 Clustering

As in [1], we consider two types of clusters:

  • by Origin/Destination: Two trips belong to the same cluster, iff they have the same origin and destination (as defined by proximity on their final GPS coordinates). This results in clusters.

  • by Route:

    Similarity between two trips is measured by the ratio of shared road segments between the two trips and the total number of road segments in both trips. Hierarchical clustering is then performed, and the dissimilarity threshold is chosen to be 0.3, yielding


5.3 Prediction Results

Figure 2: Failure rate and fraction of links needed for prediction vs confidence level for fixed if clustering by Origin/Destination.
Figure 3: Failure rate and fraction of links needed for prediction vs confidence level for fixed if clustering by Route.
Figure 4: Failure rate and fraction of links needed for prediction vs for fixed if clustering by Origin/Destination (left) and Route (right).
Figure 5: Decreasing failure rates as more trips are added to the training data for clustering by Origin/Destination and Route; and .
Figure 6: Fraction of links needed as more trips are added to the training data for clustering by Origin/Destination and Route; and .

In all cases described below, we chose uniform probabilities for the prior and initial probabilities , i.e.  and . More careful choices could certainly improve prediction results, but as we shall see below, this simplest of choices is sufficient to demonstrate the efficacy of the algorithm.

We collected two quantities of interest: the failure rate as the number of false predictions (which includes cases where the end of a trip is reached without a prediction being made) divided by the number of all trips predicted, and the fraction of trip completed at the time of prediction, i.e. #(road segments visited)/#(total road segments in trip). If a wrong prediction was made, we recorded the amount of links needed as NaN; these points are then excluded from the computation of averages.

In order to get an idea of good ranges for the parameters and , we performed a small initial cross-validation experiment by training on only 50% of the data and then predicting the route of the remaining 50%. This was done for 8 random choices of training and testing set, on a grid of values . The results are shown in Fig. 1 and indicate that the approach is robustly effective over a wide range of values for and : with parameters in a reasonable range, e.g.  and , the routes are predicted accurately in more than 99% of the test cases within the first 20% of the trip. The results also show “breakdown points:” if the confidence level is increased to the point where a posterior probability of 0.6 is sufficient for prediction already, the amount of false predictions increases rapidly; on the other hand, if the small probability parameter is chosen so large that it dominates the probabilities estimated from data, the amount of links necessary to distinguish routes increases rapidly.

To investigate further, we then performed leave-one-out cross-validation along two “slices” on a finer grid: for each of the trips, we trained the Markov chains on the remaining trips; then, the trip that was left out was predicted. This was done for and a finer grid of values for , and for and a finer grid of values for . The entire procedure was repeated once for clustering by Origin/Destination, and once for clustering by Route. The results are shown in Figs. 3-4, and we observe:

  • The prediction accuracy is very robust with respect to , the failure rate is below 1% for a wide range of , and declines rapidly once a breakdown point of is reached.

  • The same goes for the percentage of links needed to make a correct prediction. Additionally we note that, as should be expected, lower confidence in prediction (i.e. larger values of ) tends to lead to fewer links needed for prediction; however this effect only kicks in once the failure rate increases quickly.

  • Predicting the origin and destination appears to be slightly easier, since it consistently requires fewer links to do so. This is not surprising as there are only 17 clusters to choose from, whereas in the case of predicting the route there are 30.

  • The robustness with respect to is even more pronounced: Fig. 4 indicates that, once a good value for is selected, neither the failure rate nor the amount of links needed for prediction depend on .

As a last experiment, we attempted to simulate the realistic situation of an in-car system improving its prediction model with each taken trip by incrementally moving trips from the testing set to the training set. Specifically, for the first data point, we trained on the first trip only and predicted the remaining ones. This obviously lead to an immediate wrong prediction of almost all trips. We then added the second trip, retrained and predicted the remaining 779 trips, and so on. The results are shown in Figs. 6 and 6, and we see that once roughly 10% of the trips have been taken (so around 75 trips), the algorithm predicts at least 90% of the remaining trips correctly while needing on average between 15 and 20% of the trip to be completed to make its prediction.

6 Conclusions

The contribution was twofold: on the one hand, the flexible framework of modelling route patterns as stochastic processes and using the associated likelihoods to update a posterior probability is introduced, and on the other, a concrete algorithm is presented, obtained by modelling the stochastic processes as Markov chains.

The flexibility of the approach is only touched upon, there are many possible extensions which should be explored once a richer dataset is available – even though we worked on the generation of a realistic dataset (see Sec. 5.1), it is still a synthetic one and the excellent performance of the presented algorithm is hard to improve upon. Improvements to be explored using a more challenging, real dataset, include:

  • Other stochastic process models can be used. Indeed, the approach taken in [1] fits into the outlined framework, if the choice of stochastic process is a naive Bayes model, i.e. if assuming that . We intend to test other stochastic process models, such as the recently developed closed-loop Markov modulated Markov chains [12] in the near future.

  • So far, the available context is not used at all. For future practical applications however, the prior probabilities should be made dependent on such contextual variables as the day of the week or the time of the day, for instance by setting , the prior probability if the current trip occurs on a weekday would be made proportional to the relative frequency of trips in among previous weekday trips; [1] has further details on context and its inclusion, only there, the contextual variables influence the stochastic process model directly instead of entering via the prior.

  • The initial probabilities and small probabilities can be shaped to be larger for roads that are not on, but close to, the roads in cluster , and smaller for roads that are far away. This can be expected to improve convergence of the posterior probabilities.

Overall, the success the approach has without tapping into such extensions is encouraging further research.


  • [1] Y. Lassoued, J. Monteil, Y. Gu, G. Russo, R. Shorten, and M. Mevissen, “Hidden Markov model for route and destination prediction,” in IEEE International Conference on Intelligent Transportation Systems, 2017.
  • [2] Y. Deguchi, K. Kuroda, M. Shouji, and T. Kawabe, “HEV charge/discharge control system based on navigation information,” in Convergence International Congress & Exposition On Transportation Electronics.   Convergence Transportation Electronics Association, oct 2004.
  • [3] J. Froehlich and J. Krumm, “Route prediction from trip observations,” in SAE Technical Paper.   SAE International, 04 2008. [Online]. Available:
  • [4] C. Song, Z. Qu, N. Blumm, and A.-L. Barabási, “Limits of predictability in human mobility,” Science, vol. 327, no. 5968, pp. 1018–1021, 2010. [Online]. Available:
  • [5] Simmons, B. Browning, Y. Zhang, and V. Sadekar, “Learning to predict driver route and destination intent,” in 2006 IEEE Intelligent Transportation Systems Conference, Sept 2006, pp. 127–132.
  • [6] J. Krumm, “A Markov model for driver turn prediction,” SAE Technical Paper, Tech. Rep., 2008.
  • [7] OpenStreetMap Wiki, “Way — OpenStreetMap Wiki,” 2017, accessed Nov 24. [Online]. Available:
  • [8] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, Feb 1989.
  • [9] T. W. Anderson and L. A. Goodman, “Statistical inference about Markov chains,” The Annals of Mathematical Statistics, vol. 28, no. 1, pp. 89–110, 03 1957. [Online]. Available:
  • [10] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank citation ranking: Bringing order to the web.” Stanford InfoLab, Tech. Rep. 1999-66, November 1999, previous number = SIDL-WP-1999-0120. [Online]. Available:
  • [11] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recent development and applications of SUMO - Simulation of Urban MObility,” International Journal On Advances in Systems and Measurements, vol. 5, no. 3&4, pp. 128–138, December 2012. [Online]. Available:˙v5˙n34˙2012˙4.pdf
  • [12] J. Epperlein, R. Shorten, and S. Zhuk, “Recovering Markov Models from Closed-Loop Data,” ArXiv e-prints, June 2017.