The amount and availability of spatiotemporal data has drastically increased thanks to the ubiquity of sensors in various applications and high-capacity data centers that can store and serve up this data. Transportation data, like New York Taxicab (Dataset, 2016) data, is an example of such spatiotemporal data. Given the availability of these large transportation data, various analysis applications have been developed to gain insights from this data. Characterizing driving context is a new application area which we introduce in this paper. A context can be described as a combination of location (e.g., Interstate-90) and time (e.g., weekdays between 3pm to 7pm). A characteristic for a context can be identified as a correlation between driving behavior and an environmental effect (e.g., traffic congestion). By having information about different driving contexts, one can validate hypotheses about behavior of an individual within a context, and also provide feedback to drivers in order to help them to improve their skills. The former is related to usage-based insurance (NAIC, 2017) and the latter to driver coaching (Stanton et al., 2007).
In this paper, we address the problem of Characterizing driving context. Our characterization is based on the observed behavior of drivers and exploiting complementary sources of spatiotemporal data to analyze this behavior. We define “driver behavior” in terms of meaningful driving patterns (e.g., turn, speed-up, slow-down, etc.). In addition, we try to explore the causes which underlie a specific pattern within a context by conducting analyses across several spatiotemporal data sources (traffic data, road features, etc.). The cause behind a transition between patterns, hence introducing a new pattern, can be extrinsic (e.g., an accident, a traffic signal, a traffic congestion) or intrinsic (e.g., driver-generated distraction, or the personality of the driver).
Characterizing driving context is a challenging problem, due to the lack of supervision on drivers and/or environment. Unlike previous studies in fully monitored environment (that use, for examples, cameras placed inside the car) (Liu and Salvucci, 2001; Sathyanarayana et al., 2008), we focus on a dataset with only externally visible phenomena (e.g., vehicle’s speed) with no monitoring on drivers and environment. Given these limitations, the goal is to learn driving context based simply on the behavior of drivers. In this paper, we introduce DriveContext, a framework to efficiently characterize driving contexts. This framework consists of two components, dSegment and dDescribe. The former applies a behavior-based trajectory segmentation approach to find meaningful driving patterns within a trajectory. The latter reveals the extrinsic causes for each driving pattern. We apply DriveContext on a real-world dataset of car trajectories to derive interesting characteristics for different contexts.
2. Problem Statement
Assume we are given a set of trajectories. For , we define trajectory , where is a data point of the form which captures a vehicle’s status at time as its latitude and longitude are , with speed (km/h), acceleration (), and heading (degrees). We study the “discovery of driving context” in terms of two sub-problems: Segmentation and Causality Analysis.
A segmentation of a trajectory into segments, denoted as , is to find a set of cutting indexes to mark the end points of the segments. Thus, we can define a set of cutting data points for the segmented trajectory as . Note that . Each segment represents a driving pattern and each cutting point , represents a transition between patterns.
We assume the existence of a segment is potentially relevant to extrinsic or intrinsic causes. In this work, the focus is on extrinsic causes that we refer to as . We keep track of events in an event database of the form , where each event occurs in time , in a geographical area whose center is of type . Given a set of cutting points by segmenting trajectory , and the database of events, the second sub-problem (i.e., causality analysis) is one of finding if, and to what extent, each cutting point , , is related to (or caused by) an event .
3. The DriveContext Framework
Figure 1 depicts the overall process of DriveContext where it consists of two major components, dSegment and dDescribe.
3.1. dSegment Component
dSegment is an adaptation of our previously proposed approach (Moosavi et al., 2016) to wisely partition a trajectory based on behavior of driver, such that each resulting segment corresponds to a meaningful driving pattern (e.g., turn, speed-up, etc.). A summary of fundamental parts of dSegment is provided as follows.
Dataset Preprocessing: This step includes several data scrubbing tasks such as removing records with noisy GPS data, rounding values of acceleration and heading to simplify the model, and using change of heading instead of absolute heading to reflect changes clearly.
Markov Model Creationdriving state to another. In order to deal with sparsity of the model and avoid overfitting, we adapt a regularization technique known as the Wedding Cake technique (Krumm and Horvitz, 2006). A portion of the model graph is shown in Figure 2.
Trajectory Transformation: The next step is to transform an input trajectory into a signal
in a new space called Probabilistic Movement Dissimilarity (PMD) space. The generated signal for a trajectory shows how unlikely the behavior of a driver is during different moments of a trip. Algorithm1 summarizes the transformation process.
Trajectory Segmentation: Finally, we apply a dynamic programming approach (Han et al., 2004) to segment the generated signal.
3.2. dDescribe Component
dDescribe analyzes the extracted segments (which are essentially driving patterns) to explore the underlying causes behind these patterns. This step identifies the characteristics for a given context. The existence of a driving pattern is potentially related to extrinsic or intrinsic causes. Recall that segments of a trajectory are represented in form of cutting points . Having a database of events and a cutting point , , the goal is to find whether is related to an event or not. If we find that is related to , then the segment which starts at the cutting index is potentially caused by event . We define the relevancy relationship between a cutting point and an event based on the type of the event. In this study, we consider two types of events: physical fact and temporal-physical event. We measure the relevancy for each type of event as described below.
Physical Fact. An example is the presence of a traffic signal. In such a case, the relevancy can be measured as the distance between the locations of cutting point and event . We then say and are correlated if their locations are within a specified distance threshold.
Temporal-Physical Event. An example is the existence of a traffic congestion in a specific location during a time interval. In this case, we say and are correlated if the two following conditions are satisfied: the time of the trajectory , where , overlaps with the time interval of the event , and the distance between locations of and are lower than a threshold.
We first describe the datasets we used in our study. Then, we evaluate dSegment with respect to a ground-truth dataset. Next, we apply dSegment on a real-word dataset of car trajectories and conduct causality analysis using dDescribe.
We used four different sets of spatiotemporal data sources to build and evaluate components of DriveContext.
Dataset of Annotated Car Trajectories (DACT): We have constructed a dataset of annotated car trajectories to evaluate dSegment (Moosavi et al., 2017). DACT consists of two sets of annotations for each trajectory, one that assumes “flexible” constraints to identify segment borders (Easy-Aggregation), and the other that uses “strict” constraints (Strict-Aggregation). DACT includes trajectories covering about hours of driving data.
Nationwide Trajectories: We use a rich dataset of trajectories provided by an insurance company in the state of Ohio, in the United States. To our knowledge, this dataset, which have named Nationwide Trajectories, is one of the few large scale datasets with driving data for personal vehicles (as opposed to commercial transportation vehicles). The dataset contains 83,406 trajectories and covers about 20,689 hours of driving data. We divided the Nationwide Trajectories into two sets, train and test, with 81,895 and 1,421 trajectories respectively. The former is used to build the dSegment model, and the latter is used to evaluate dDescribe. The test set contains sampled data for 5 popular routes in the city of Columbus Ohio (Table 1).
Physical Facts: Physical facts consist of annotations for routes in the test set of Nationwide dataset and are drawn from two different sources of data, i.e., Open Street Map (OSM)111www.openstreetmap.org (with annotations like exit, merge, and bridge) and Hand-Curated Annotations (HCA) usig Google Street View to complement the former. The set of physical facts contains 1,825 annotations from OSM and 95 from HCA.
Temporal-Physical Events: We employ Bing Traffic API222https://msdn.microsoft.com/en-us/library/hh441725.aspx and Map Quest Traffic API333https://developer.mapquest.com/products/traffic to collect temporal-physical events such as real-time traffic reports. The dataset contains 1,688 records from Bing and 4495 records from MapQuest for routes in the test set.
4.2. dSegment Evaluation
|315 Freeway||14.6 km||120||703||8.9|
We use the training set of Nationwide trajectories to create the Markov model. The regularized Markov model consists of 47,495 states and 5.8 million transitions between states. In order to evaluate our segmentation approach, we use the DACT annotation sets (Moosavi et al., 2017). For comparison purposes, we use following four baselines: Stable Criteria
, where a set of spatiotemporal heuristics (criteria) are used for segmentation(Buchin et al., 2010; Alewijnse et al., 2014); Point of Change Detection, where we employ the change point detection approach in (Liu et al., 2013) for comparison; Equal Length, where we first assume all trajectories have the same number of segments () and then divide a trajectory to equal size segments; and Random, where we find segment borders at random to form segments.
In order to find the upper bound on the number of existing segments, i.e., (see (Moosavi et al., 2016)), we set , where is the length of the trajectory. The minimum length of a segment is assumed to be (see (Moosavi et al., 2017)). Since we have two sets of annotations for trajectories in DACT, we evaluate and compare our approach based on both sets. Also, we use Precision and Recall
as evaluation metrics. Given a trajectorywith annotations , if algorithm finds cutting points for
, then we define precision and recall as follows:
The intersection between and is calculated as follows: to find a match for , we calculate its Haversine distance to all available annotations in . If we find a pair , for , such that their distance is lower than a pre-specified threshold, then we say there is a match for . Once we find such an , we no longer use it to match other cutting points in . We use values in set as distance thresholds (in meters). Figures 2(a) and 2(b) show the comparison between different segmentation approaches using gold sets in DACT. For Equal Length and Random baselines, we use values and for based on Easy and Strict aggregation annotation sets. These numbers are set based on average number of segments in a trajectory, as reported in (Moosavi et al., 2017).
The results show that dSegment outperforms the other baselines by reasonable margins, based on both ground truth datasets. The maximum distance threshold (250 m) is obtained by dividing the average length of routes in the evaluation set (i.e., 10 km, see Table 1) by the average number of segments for a trajectory (i.e., 40444We have the average number of trajectories for easy and strict sets as 30 and 50, respectively (Moosavi et al., 2017).). Note that a solution which maximizes the precision is preferred, because we need valid segments to conduct a precise causality analysis to confidently derive the characteristics for a context.
4.3. dDescribe Evaluation
We define “context” as the combination of location (listed in Table 1) and time (e.g., weekdays between 3pm to 7pm). We use two granularity levels for time: Type of Day (Weekday (WD) versus Weekend (WE)), and Time of the Day with five time intervals, i.e., P1: from 6am to 9:59am, P2: from 10am to 2:59pm, P3: from 3pm to 6:59pm, P4: from 7pm to 9:59pm, and P5: from 10pm to 5:59am. We used the one-year traffic congestion reports by Map Quest for the city of Columbus Ohio (see Figure 4) to derive aforementioned intervals.
Table 1 provides statistics on applying dSegment on the test set, where the total number of extracted segments is 6,674. dDescribe can then be applied on this set to discover properties for each context. For that, we build an event database from physical facts and temporal-physical events. We conduct the causality analysis by introducing a new correlation measure: suppose that for a set of trajectories in context , a sequence of cutting points for each is reported. Given an event database , we use Equation 1 to obtain the correlation for context .
In Equation 1, is a Boolean function depending on the type of event, as discussed below.
Event Data as Physical Facts: First, we use physical facts to build the event database . For this case, we define the function by calculating the Haversine distance between a cutting point and a physical event , and then checking if their distance is lower than a pre-specified threshold (empirically set to ). Figure 4(a) shows the correlation between the extracted segments of different contexts and physical facts. Note that the correlation analysis is only done for those contexts for which enough data exists. We observe that on average, about 76.5% of the driving patterns (segments) are correlated with the physical properties of routes.
Event Data as Temporal-Physical Events: We also employ temporal-physical events to build . To define in this case, we use the following two-step method:
Step 1: Find potential congestion evidences. Given a trajectory , we find sub-trajectories of minimum length (based on (Moosavi et al., 2017)), where the speed of all points in such sub-trajectories is less than (i.e., the average congestion speed in the traffic congestion dataset). We consider such sub-trajectories as showing potential evidence of congestion.
Step 2: Finalization. After finding potential evidence of congestion, we scan through our traffic congestion dataset to see if there are at least 12 instances (i.e., one report per month) in the neighborhood (i.e., ) of ’s location, within the same day of the week and hour of the day (e.g., Tuesday 4pm).
We identify 465 traffic congestion sub-trajectories within 1,421 trajectories of the test set. The function returns if a cutting point is in the neighborhood (i.e., ) of at least one of traffic congestion evidences of a the trajectory . Otherwise, it returns . Figure 4(b) illustrates the correlation analysis results between driving patterns and temporal-physical events. On average, ~10.5% of driving patterns were correlated with traffic congestion.
Event Data as Union of Fact and Events: Finally, we consider both physical facts and temporal-physical events to build . In this case, will function with respect to the type of the event. Figure 4(c) demonstrates the correlation of driving patterns with the set of all existing events. We observe that ~78.1% of segments are correlated with at least one of the event types. Moreover, side-by-side comparison of the results in Figures 4(b) and 4(c), reveals that both analyses lead to almost the same correlation patterns. This confirms that a significant number of segments are correlated with both sources of events.
Results from causality analysis are strong signals that capture the characteristics of a driving context. These insights can be employed in various applications. One example is usage-based insurance to study the behavior of an individual driver in order to evaluate how risky or safe he/she is, regarding the characteristics of context. Insights from our framework may also be used for driver coaching, by recommending further training to those drivers whose driving behavior in a context is not compatible with the properties of that context.
5. Related Work
Our study relates to research in trajectory segmentation (as in dSegment) and making sense of trajectories (as in dDescribe).
Trajectory Segmentation: The task of segmentation has been addressed in the literature in several studies. In (Buchin et al., 2010), a greedy segmentation algorithm exploits a set of monotonic spatiotemporal criteria (e.g., defining relative thresholds for some feature values) on features like speed, heading, etc. Alewijnse et al. extended this work to both monotonic and non-monotonic criteria (Alewijnse et al., 2014). However, criteria-based methods need human input for tuning parameters. Transforming trajectory prior to segmentation has also been previously discussed in (Panagiotakis et al., 2012); however, their transformation is a local approach, based on comparing line segments of an input trajectory. On the other hand, we perform a global, likelihood-based transformation to provide a segmentation where the extracted segments represent meaningful driving patterns.
Making Sense of Trajectories. Akin to dDescribe, there are some other approaches which make sense of driving data and explore insights encapsulated in trajectories (Stenneth et al., 2011; Lou et al., 2009; Su et al., 2015). Wu et al. (Wu et al., 2016) predict traffic based on analysis of external data sources including Point Of Interest (POI) data, collision data, weather data, and geo-tagged tweet data. This work is similar to dDescribe, where we discover correlations between driving patterns and traffic congestion and physical properties of routes. However, we pursue a different goal which is the identification of characteristics of a context.
We present DriveContext, a framework to derive characteristics of a context by extracting meaningful driving patterns (dSegment), and then analyzing the extracted patterns (dDescribe) to derive characteristics. Our analysis shows how the dSegment compares with the state-of-the-art in finding meaningful driving patterns. In addition, the results of dDescribe show the ability of framework to interpret driving patterns which lead to new insights. Our future course of action is to incorporate more sources of event data, such as Twitter feeds, in to dDescribe.
- Alewijnse et al. (2014) Sander Alewijnse, Kevin Buchin, Maike Buchin, Andrea Kölzsch, Helmut Kruckenberg, and Michel A Westenberg. 2014. A framework for trajectory segmentation by stable criteria. In 22nd SIGSPATIAL. ACM, 351–360.
- Buchin et al. (2010) Maike Buchin, Anne Driemel, Marc van Kreveld, and Vera Sacristán. 2010. An algorithmic framework for segmenting trajectories based on spatio-temporal criteria. In 18th SIGSPATIAL. ACM, 202–211.
- Dataset (2016) New York Taxi Dataset. 2009-2016. http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml. (2009-2016). Accessed: 2017-05-08.
et al. (2004)
Tony X Han, Steven Kay,
and Thomas S Huang. 2004.
Optimal segmentation of signals and its application to image denoising and boundary feature extraction. InICIP’04, Vol. 4. IEEE, 2693–2696.
- Krumm and Horvitz (2006) John Krumm and Eric Horvitz. 2006. Predestination: Inferring destinations from partial trajectories. In 8th UbiComp. Springer, 243–260.
- Liu and Salvucci (2001) Andrew Liu and Dario Salvucci. 2001. Modeling and prediction of human driver behavior. In HCI. ”, 1479–1483.
et al. (2013)
Song Liu, Makoto Yamada,
Nigel Collier, and Masashi Sugiyama.
Change-point detection in time-series data by relative density-ratio estimation.Neural Networks 43 (2013), 72–83.
- Lou et al. (2009) Yin Lou, Chengyang Zhang, Yu Zheng, Xing Xie, Wei Wang, and Yan Huang. 2009. Map-matching for low-sampling-rate GPS trajectories. In 17th SIGSPATIAL. ACM, 352–361.
- Moosavi et al. (2017) Sobhan Moosavi, Behrooz Omidvar-Tehrani, R Bruce Craig, and Rajiv Ramnath. 2017. Annotation of Car Trajectories based on Driving Patterns. CoRR abs/1705.05219 (2017), 1–10.
- Moosavi et al. (2016) Sobhan Moosavi, Rajiv Ramnath, and Arnab Nandi. 2016. Discovery of driving patterns by trajectory segmentation. In 3rd SIGSPATIAL PhD Symposium. ACM, 4.
- NAIC (2017) NAIC. 2017. Usage-Based Insurance and Telematics. http://www.naic.org/cipr_topics/topic_usage_based_insurance.htm. (2017). [Online; accessed 05-September-2017].
- Panagiotakis et al. (2012) Costas Panagiotakis, Nikos Pelekis, Ioannis Kopanakis, Emmanuel Ramasso, and Yannis Theodoridis. 2012. Segmentation and sampling of moving object trajectories based on representativeness. TKDE 24, 7 (2012), 1328–1343.
Sathyanarayana et al. (2008)
Pinar Boyraz, and John HL Hansen.
Driver behavior analysis and route recognition by hidden Markov models. InICVES. IEEE, 276–281.
- Stanton et al. (2007) Neville A Stanton, Guy H Walker, Mark S Young, Tara Kazi, and Paul M Salmon. 2007. Changing drivers’ minds: the evaluation of an advanced driver coaching system. Ergonomics 50, 8 (2007), 1209–1234.
- Stenneth et al. (2011) Leon Stenneth, Ouri Wolfson, Philip S Yu, and Bo Xu. 2011. Transportation mode detection using mobile phones and GIS information. In 19th SIGSPATIAL. ACM, 54–63.
- Su et al. (2015) Han Su, Kai Zheng, Kai Zeng, Jiamin Huang, Shazia Sadiq, Nicholas Jing Yuan, and Xiaofang Zhou. 2015. Making sense of trajectory data: A partition-and-summarization approach. In 31st ICDE. IEEE, 963–974.
- Wu et al. (2016) Fei Wu, Hongjian Wang, and Zhenhui Li. 2016. Interpreting traffic dynamics using ubiquitous urban data. In 24th SIGSPATIAL. ACM, Burlingame, CA, 69.