Vehicles are equipped with an increasing number of sensors and electronics to react dynamically to changing road conditions and to increase driver safety. As a result, large volumes of driver-specific data related to driving conditions and driver behavior are generated. We are interested in analyzing this data to learn models of driving behavior. Such models could be used to anticipate dangerous situations, to improve the driving schedule of a person, and to tailor various aspects of the driving experience to the individual.
Here, we use data collected from one vehicle’s sensors over numerous trips to construct a Hierarchical Dirichlet Process (HDP) model of driving behavior and road conditions. HDPs are commonly used for topic modeling of text corpora [1, 2, 3, 4] to uncover the set of topics that comprise each document in the corpus. In our case, the documents are road segments and the words are associated quantized sensor measurements. The topics in the HDP model are sensor distributions in the road segments; these distributions capture the driving conditions in each road segment as encountered by the driver as well as their driving behavior and common driving conditions. To our knowledge this is a new approach for modeling driving behavior. Unlike related work which is based on assumptions about the capabilities and behaviors of humans (i.e. see for an overview ), our model is purely data driven.
It is important to note that the hierarchy within the HDP model allows sharing of measurements across similar road segments. This is an appealing aspect of the model since it enables us to learn an expressive model for road segments which are visited rarely via similar road segments that are visited more often. In order to utilize an HDP model, we first organize the sensor data into ”documents” (i.e., road segments and their associated quantized measurements). We consider the case in which a road map is not available, however, it is straightforward to incorporate such information. Additionally, typical drivers often traverse a small subset of the roads in the road network. We use a Hidden Markov Model (HMM) to learn the road segments. The HMM condenses position information from recorded trips into road segment states. The set of hidden states effectively corresponds to a sparse road network which consists only of the roads which the driver has traversed. We then use the trained HMM to associate sensor measurements to road segments to produce ”documents” for the HDP model.
In addition to organizing the data for the HDP model, the HMM also provides insight into driver behavior such as typical routes and probable destinations. Special hidden states are introduced in the HMM to represent starting locations (sources) and destinations. Conseqently, identification of the most likely route between two states and finding the distribution over probable destinations become well-posed questions and allow us to make route and destination predictions.
The contributions of this paper are (1) to show how sparsity in the HMM transition matrix together with starting and absorption states lead to accurate long term predictions of driver routes and destinations and (2) the novel application of a HDP split-merge sampler to model the joint distribution of quantized vehicle signals scalable to a large number of road-segments.
Ii Hidden Markov Model
An HMM is used to model the trips that a driver takes through a road network. We explore two models for the HMM. In the first model, the hidden state corresponds to a road segment, a start location, or a destination. In this model, the future path is independent from the past path when conditioned on the current road segment. We expect this to be a poor model of driver behavior since this is likely an oversimplification; the past can provide considerable information about the future. For instance, drivers often do not return to a previously visited state within a trip (unless they are lost).
In our second model, we attempt to capture more of the trip history in the current state by augmenting the road states with the start location. Under this model, the road segment at the next time instance depends only on the current road segment and the start location. We will show that this model is more representative of driver behavior and provides accurate predictions of destinations and routes. We describe this second model below. The first model is a simplification of the described model.
Ii-a Hidden States
Each hidden variable, , in the HMM (see Fig. 3) takes on a value from the set of hidden states, . Source states, , destination states, , and road segment states augmented by the source state, , compose the set of hidden states: . Destination states are absorbing states which are indicated by key-off events in the data. Similarly, source states are indicated by key-on events. The distribution over the initial state, , is parameterized as . Conditioned on the current state, , the distribution for the next state, , is parameterized as
Since physically realizable transitions occur only between road segments in close proximity, we would expect most transition probabilities to be zero. We use a Dirichlet prior on the parameters with to favor a sparse transition matrix:
Ii-B Observation Model
Each trip contains measurements of position, and heading,
. When the vehicle has GPS, the recorded position is the GPS position; otherwise the reported position is obtained by dead reckoning, a process which estimates position by combining the previous position with aggregated incremental changes in a relative coordinate system. Positions which are inferred using dead reckoning are indicated by an inferred-position indicator,; for such measurements, we model a larger uncertainty associated with the measurement.
In addition to position and heading measurements, there is a key-on event at the start of each trip which indicates that hidden state must be from the set of source states. In the measurement model, we have a binary key-on indicator variable, , which takes value if a key-on event occurs at time . Similarly, there is a key-off event and corresponding indicator variable, , which indicates that the hidden state is from the set of destination states. This set of measurements comprise the observation . Conditioned on the hidden state, the measurement model is as follows:
Note that the conditional distribution for position depends on the value on the inferred-position indicator; a larger uncertainty is associated with the position when the position has been inferred. Position is Gaussian with state-dependent parameters:
where is a constant used to capture the increase in uncertainty of the inferred position.
Heading is also Gaussian with its own state-dependent parameters:
The inferred position indicator has a Bernoulli distribution with parameter,. The key-on and key-off measurements are indicators of source and destination states respectively: , and have degenerate distributions.
We would expect measurements arising from the same physical location (road segment) to have parameters which do not depend on the source state. Therefore, the measurement parameters are independent of the source state when conditioned on the road segment state. That is, for , where , and likewise for the other measurement parameters. When we estimate the measurement parameters for a road segment, this formulation allows us to aggregate observations from trips which start at different locations but share this physical road.
Similarly, there will be pairs of source and destination states will correspond to the same physical location. If a trip ends at a given location, the next trip will typically start from the same location. Since this pair of states share physical properties, these states will share measurement parameters.
Ii-C EM Updates
Given the volume of data under consideration, we find that an EM formulation using explicit state assignments provides a tractable learning approach. This approach yields locally optimal values for the set of parameters for using measurements from trips. The EM updates consist of iteratively finding the most likely assignment for the hidden states given previous parameter estimates, then using these assignments to improve the parameter estimates. The reader is referred to  for an introduction to the EM algorithm.
To initialize the parameters, we run DP means 
to cluster the measurements based on position and heading; a state is created from each cluster. DP means allows us to initialize the model without pre-specifying the number of states. Measurements assigned to the cluster (state) are used to calculate initial values for measurement model parameters. The transition matrix is initialized as a full matrix with higher probability for states which are closer together. The distribution for the first state is initialized as a uniform distribution.
Ii-D Predicting Routes and Destinations
Using the HMM model, we can predict a driver route from state to state by identifying the sequence of states with the highest likelihood:
It is well known, , that this can be formulated as a shortest path problem by defining a graph on the hidden states with edge weights .
Additionally, from any road segment state, , we can find the probability of reaching any destination state, ; this is known as the absorption probability. The absorption probability, , is the probability of reaching absorbing state, , if the chain starts from state, , and can be found by solving the following set of equations:
This gives us a probability distribution over destinations when we start from a given road state.
Ii-E Bayesian Nonparametric Topic Modeling of Car Signals
Thus far, we have formulated an HMM model for driving behavior which has predictive aspects. The model is also used to organize the data into ”documents” so that we can perform HDP topic modeling on the dataset. In this section we discuss how we combine a standard HDP model with the use of an HMM to discover documents. We also relate our HDP model for car signals to the classical HDP topic model.
To bridge the gap between the classical HDP topic modeling of text corpora and the modeling of car signals, such as velocity, acceleration and rotational speed, note the following correspondences:
Each learned road segment from the HMM is used as a document in the HDP model. To obtain a set of sensor measurements associated with a road segment, or a set of words from the document, we perform ML assignment of road states for the trips and assign the corresponding sensor measurements to those road states.
Since the car signals are continuous quantities, we quantize them using DP means 
and use a discrete base measure equivalent to the classical text corpus topic model. This means that words are described by a multidimensional vector, which amounts to modeling the joint distribution over all signals.
Ii-F HDP model
The HDP model describes a set of documents, which contain words each. In this context, the documents correspond to trips and the words in a document correspond to quantized sensor measurements from the trip. The distribution of words is modeled as a mixture of topic distributions .
The graphical model can be described as follows. At the top-level, we have that determines the global topic proportions via the stick-breaking construction . Also, shared at the top-level are the global word distributions . At the document-level, is the DP with as the base measure: , from which topic assignment labels are sampled. Finally, the observed signals can be expressed as , where the topic assignment indexes corresponding word distribution . The dashed nodes are the auxiliary split-merge nodes that learn a two-component mixture model for each cluster. These ”sub-clusters” are then used to propose splits and merges that are selected over time. This method combines a Gibbs sampler that is restricted to non-empty clusters with a Metropolis-Hastings (MH) algorithm that proposes sub-cluster splits and merges.
Ii-G Split-Merge HDP Sampler
Because of the large quantity of data, we use highly parallelizable split-merge HDP sampler described in 
. Augmenting the sample space with sub-clusters leads to proposals of likely splits and merges. A combination of a restricted Gibbs sampler (that does not create new clusers) with split/merge moves results in an ergodic Markov chain.
Ii-G1 Resricted Gibbs Sampler
Let denote the number of clusters in document with shared topic and let denote the number of words in document in cluster with topic . Then, the marginal counts and represent the number of words and topics in document , respectively. Extending the DA sampling algorithm results in the following restricted posterior distributions:
Since is not known analytically, we use the auxiliary variable . denotes unsigned Stirling numbers of the first kind. Note that the last components and aggregate the weight of all empty topics. Finally, denotes the set of indices in topic , and and denote the observation and prior distributions. The equations above can be sampled in parallel and fully specify the restricted Gibbs sampler.
The method combines a Gibbs sampler that is restricted to non-empty clusters with a Metropolis-Hastings (MH) algorithm that proposes splits and merges.
Ii-G2 Subcluster Splits and Merges
For each topic , we fit two sub-topics and referred to as the left and right sub-clusters. Each topic is augmented with global sub-topic proportions , document-level sub-topic proportions , and sub-topic parameters . Moreover, each word is associated with sub-topic assignment . Then the marginal posterior distributions can be derived  as:
Notice the similarity between these equations and ones derived earlier. Inference is performed by interleaving the sampling equations with marginal posterior equations .
A Metropolis-Hastings framework proposes splits and merges of sub-clusters and either accepts or rejects them. Let and be a set of regular and auxiliary variables, respectively. Then a sampled proposal is accepted with probability:
Ii-H Performance Evaluation
To evaluate the performance of the HDP model, we are computing the average log predictive probability of held-out words. To compute this probability, we split a test document into two sets: held out words and observed words . Then we update the model using the observed words. This gives us the posterior parameters for the test document which we in turn use to find for the held-out words. Now we can compute the probability of a held-out words as given all training data D as well as the observed words in this document:
where is the conditional distribution of a held-out word under the posterior distribution of words in this document.
As a model to compare the HDP to, we utilize a non-hierarchical model that assumes a Categorical distribution with a Dirichlet prior for the words in each road-state. These distributions are modeled completely independent – not connected via a hierarchy like in the HDP model. This allows us to compute posterior Categorical distributions given the observed words in each road-state.
In the following we will first give results for the predictive power of the HMM model before we describe a topic model for the joint distribution of speed and time-of-day measurements.
Iii-a Dataset Description
Our dataset comprises of 1K trips recorded from a standard car used by a single driver. The routes are mostly commuting to work but also some longer range trips outside the city.
The GPS position and heading measurements of the car are used to train the HMM model. From various other signals of the car we selected quantized car velocity and time of day for the HDP topic model. These were selected, since they contain interesting information both about the driving behavior as well as the driving situation in a road state.
Iii-B Predicting Routes and Destinations
To evaluate the quality of the learned HMM, we examine the ability of the HMM to predict the destination for 20 held-out trips under the two different models. Additionally, we compare the path of the held-out trips against the most likely route obtained from the transition matrix for the HMM.
Fig. 3 shows the performance of the two models on a held-out trip. The plots under the maps in the figure show the absorption probabilities for the probable destinations as a function of time. The maps above show the trip, the most likely route between the source and destination state, and the locations of the probable destinations. While the most likely path between source and destination from both models agrees with the observed trip trajectory, we observe that the augmented model is able to identify the correct destination sooner than the first model. In fact, the first model is able to correctly predict the destination after 10% of the trip for only 3 of the held-out trips while the augmented model is able to do so for 11 of the trips.
In Fig. 6, we show the most likely destination for each road segment. For the augmented model, since each state associated with a road segment also has a start location, we’ve chosen a particular start location to illustrate the differences between these two models. In particular, when starting from the specified start location, we see that trips which traverse beyond destination 7 in Fig. 6 (bottom) are more likely to terminate at a destination which is further from the starting location. The unaugmented model is unable to make this distinction, so trips which traverse road segments near destination 5 in Fig. 6 on the top (which corresponds to destination 7 in Fig. 6 on the bottom) are likely to terminate at that destination.
The results show that the most likely route obtained by the models frequently align exactly with the path of the held-out trips. This can be explained through the sparsity of the transition matrices; since each state can only transition to few states, and very often just one state, long term predictions in this model are quite accurate.
Iii-C HDP Model
We are quantizing velocity and time-of-day measurements to words that can be fed into the HDP inference algorithm. Quantization is performed via DP k-means clustering over the individual signals.
The speed measurements arrive at a rate of 1 Hz from the GPS sensor. There are 696k joint observations – velocity/time-of-day pairs – across all 12k road states. As can be seen in Fig. 1, these observations are distributed non-uniformly – we get a lot of measurements on the daily commute route and few on highways leading outside the city. This means that the road-state corpus has very imbalanced document sizes when compared to text corpus modeling. However, our results demonstrate, that this presents no issue to the inference algorithm.
We empirically found the following set of parameters: and , corresponding to the global and local concentration parameters, respectively.
Fig. 5 demonstrates that the hierarchy in the HDP is able to pool measurements from different road-states to obtain a descriptive topic for these. For each road state we obtain the maximum likelihood (ML) topic assignment and plot the respective road states in red. This pooling of observations can for example be observed for topics 0 and 41, which consist of almost all highway road states as can be seen in the ML topic assignment plots (compare the red road segments to the highways depicted in map in Fig. 1).
Using the inferred mixture of topics for each state, we can now compute the ML estimate of the marginals for the individual sensor signals and plot them color-coded for each road state. Fig. 7 shows this for the marginal over speed and time-of-day. Comparing the spatial distribution of the ML speed estimates computed from the inferred HDP model in Fig. 6(d) with ML estimates obtained from the empirical distribution depicted in Fig. 6(b), we can see that the HDP model is able to capture the distribution of the input data.
Additionally, it is clear from the spatial distribution of the ML estimates of driving speeds, that the HDP model captures for example the fact, that inner city driving is slower than highway driving. The ML estimates of time-of-day (Fig. 6(c) and 6(a)) show that the trips outside the city were not undertaken in the morning or evening.
We have shown that the inherent sparsity of the learned personal road network allows accurate long term predictions of driver routes. Additionally, augmenting the model with start location yields a more representative model which provides better destination predictions. Exploiting the hierarchy of the HDP topic model, we are able to learn expressive topic distributions despite the fact that the number of car signal measurements differs widely between different road states. The combination of both types of of models allows us to model the driving behavior of an individual driver. This type of model can for example assist in optimizing the daily commute route or help predict traffic jams. As a next step it would be interesting to compare the driver models for different drivers to allow driver classification based on the driving behavior.
-  Y. Teh, M. Jordan, M. Beal, and D. Blei, “Hierarchical dirichlet processes,” Journal of the American Statistical Association (JASA), vol. 101, no. 476, pp. 1566–1581, 2006.
-  J. Chang and J. W. Fisher III, “Parallel sampling of hdps using sub-cluster splits,” in Proceedings of the Neural Information Processing Systems (NIPS), Dec 2014.
-  C. Wang, J. Paisley, and D. M. Blei, “Online variational inference for the hierarchical dirichlet process,” in Artificial Intelligence and Statistics, 2011.
M. Hoffman, D. Blei, J. Paisley, and C. Wang, “Stochastic variational
Journal of Machine Learning Research, 2013.
-  T. A. Ranney, “Models of driving behavior: A review of their evolution,” Accident Analysis and Prevention, vol. 26, no. 6, pp. 733 – 750, 1994.
-  A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 39, no. 1, pp. 1–38, 1977.
-  B. Kulis and M. I. Jordan, “Revisiting k-means: New algorithms via bayesian nonparametrics,” in Proceedings of the 29th International Conference on Machine Learning, 2012.
-  R. Simmons, B. Browning, Y. Zhang, and V. Sadekar, “Learning to predict driver route and destination intent,” in IEEE International Conference on Intelligent Transportation Systems Conference (ITSC). IEEE, 2006.
-  J. Sethuraman, “A constructive definition of dirichlet priors,” DTIC Document, Tech. Rep., 1991.