Our digital footprint is growing at an unprecedented scale. We use numerous devices and online services creating massive amount of data 24/7. Some of these data are personal, either concerning an identified or and identifiable natural person; thus, they fall under the protection of the European General Data Protection Regulation (GDPR) European Union (2016b). In fact, to determine whether a natural person is identifiable based on given data, one should take account of all means reasonably likely to be used (by the data controller or an adversary) to identify the natural person. Such a technique is, e.g., singling out; whether it is reasonably likely to be used depends on the specific data and the scoio-technological context it was collected in European Union (2016a).
One specific area where digitalization and data generation are booming is automotive. From a set of mechanical and electrical components, cars have evolved into smart cyber-physical systems. Whereas this evolution has enabled automakers to implement advanced safety and entertainment functionalities, it has also opened up novel attack surfaces for malicious hackers and data collection opportunities for OEMs and third parties. The backbone of a smart car is the in-vehicle network which connects ECUs (Electronic Control Units); the most established vehicular network standard is called Control Area Network (CAN) Voss (2008). CAN is already a critical technology worldwide making automotive data access a commodity. One or more CAN buses carry all important driving related information inside a car. OEMs (Original Equipment Manufacturers, i.e., car makers) collect and analyze CAN data for maintenance purposes; however, CAN data might reveal other, more personal traits, such as the driving behavior of natural persons. Such information could be invaluable to third party service providers such as insurance companies, fleet management services and other location-based businesses (not to mention malicious entities), hence there exist economic incentives for them to collect or buy them.
It has been shown that automobile driver fingerprinting could be practical based on sensor signals captured on the CAN bus in restricted environments Enev et al. (2016). Using machine learning techniques, authors re-identified drivers from a fixed set of experiment participants, thus implementing singling out, which makes this a privacy threat. There is a caveat: the adversary has to know the higher layer protocols of CAN in order to extract meaningful sensor readings. Since such message and message flow specifications (above the data link layer) are usually proprietary and closely guarded industrial secrets, such adversarial background knowledge might not be reasonable. In this case, the research question changes: is it possible for an adversary to re-identify drivers based on raw CAN data without the knowledge of protocols above the data link layer?
Contributions. In this paper we investigate experimentally the potential to identify and extract vehicle sensor signals from raw CAN bus data for the sake of inferring personal driving behavior and re-identifying drivers. As signal positions, lengths and coding are proprietary and vary among makes, models, model years and even geographical area, first, we have to interpret the messages. We emphasize that we do not intend to perform (an even remotely) comprehensive reverse engineering Sija et al. (2018); we focus solely on a small number of sensor signals which are good descriptors of natural driving behavior.
Our contributions are three-fold:
we devise a heuristic method for message decomposition and log pre-processing;
we build, train and validate a machine learning classifier that can efficiently match vehicle sensor signals to a ground truth based on raw CAN data. In particular, we train a classifier on the statistical features of a signal in one car (e.g., Opel Astra), then we use this trained classifier to localize the same signal in a different car (e.g., Toyota). The intuition is that the physical phenomenon represented by the signal has identical statistical features across different cars, and hence can be used to identify the same signal in all cars using the same classifier;
we briefly demonstrate that re-identification of drivers is possible using the extracted signals.
The rest of the paper is structured as follows. Section 2 presents related work. Section 3 gives a background on important characteristics of the Controller Area Network. Section 4 describes our data collection process. Section 5 presents our efforts on message decomposition and log pre-processing. Section 6
presents the design, evaluation and validation of our random forest classifier for extracting sensor signals. Section6.4.2 briefly demonstrates the successful application of the extracted signals for driver re-identification. Finally, Section 7 concludes the paper.
2 Related Work
Driver characterization based on CAN data has gathered significant research interest from both the automotive and the data privacy domain. The common trait in these works is the presumed familiarity with the whole specific CAN protocol stack including the presentation and application layers giving the researchers access to sensor signals. This knowledge is usually gained via access to the OEM’s documentations in the framework of some research cooperation. As such, researchers do not normally disclose such information to preserve secrecy.
Miyajima et al. has investigated Miyajima et al. (2007) driver characteristics when following another vehicle and pedal operation patterns were modeled using speech recognition methods. Sensor signals were collected in both a driving simulator and a real vehicle. Using car-following patterns and spectral features of pedal operation signals authors achieved an identification rate of 89.6% for the simulator (12 drivers). For the field test, by only applying cepstral analysis on pedal signals the identification rate was down to 76.8% (276 drivers). Fugiglando et al. Fugiglando et al. (2018) developed a new methodology for near-real-time classification of driver behavior in uncontrolled environments, where 64 people drove 10 cars for a total of over 2000 driving trips without any type of predetermined driving instruction. Despite their advance use of unsupervised machine learning techniques they conclude that clustering drivers based on their behavior remains a challenging problem.
Hallac et al. Hallac et al. (2016) discovered that driving maneuvers during turning exhibit personal traits that are promising regarding driver re-identification. Using the same dataset from Audi and its affiliates, Fugiglando et al. Fugiglando et al. (2017), showed that four behavioral traits, namely braking, turning, speeding and fuel efficiency could characterize driver adequately well. They provided a (mostly theoretical) methodology to reduce the vast CAN dataset along these lines.
Enev et al. authored a seminal paper Enev et al. (2016) which makes use of mostly statistical features as an input for binary (one-vs-one) classification with regard to driving behavior. Driving the same car in a constrained parking lot setting and a longer but fixed route, authors re-identified their 15 drivers with 100% accuracy. Authors had access to all available sensor signals and their scaling and offset parameters from the manufacturer’s documentation.
In a paper targeted at anomaly detection in in-vehicle networksMarkovitz and Wool (2017), authors developed a greedy algorithm to split the messages into fields and to classify the fields into categories: constant, multi-value and counter/sensor. Note that the algorithm does not distinguish between counters and sensor signals, and the semantics of the signals are not interpreted. Thus, their results cannot be directly used for inferring driver behavior.
3 CAN: Controller Area Network
The Controller Area Network (CAN) is a bus system providing in-vehicle communications for ECUs and other devices. The first CAN bus protocol was developed in 1986, and it was adopted as an international standard in 1993 (ISO 11898). A recent car can have anywhere from 5 up to 100 ECUs, which are served by several CANs. Our point of focus is the CAN serving the drive-train.
CAN is an overloaded term Szalay et al. (2015). Originally, CAN refers to the ISO standard 11898-1 specifying the physical and data link layers of the CAN protocol stack. Second, another meaning is connected to FMS-CAN (Fleet Management System CAN), originally initiated by major truck manufacturers, defined in the SAE standard family J1939; FMS-CAN gives a full-stack specification including recommendations on higher protocol layers. Third, CAN refers to the multitude of proprietary CAN protocols which are make and model specific. This results in different message IDs, signal transformation parameters and encoding. These protocols are usually based on the standardized lower layers, but their higher layers are kept confidential by OEMs. The overwhelming majority of cars use one or more proprietary CAN protocols. Generally, sensor signals in CAN variants have a sampling frequency in the order of 10 ms.
On the other hand, using the standard on-board diagnostics (OBD, OBD-II) is a popular way of getting data out of the car. Originally developed for maintenance and technical inspection purposes and included in every new car since 1996, OBD is also used for telematics applications. Adding to the confusion regarding CAN, OBD has five minor variations including one which is based on the CAN physical layer. Sensor signals carried by OBD have a sampling frequency in the order of 1 second. In certain vehicle makes and models, one or more CANs are also connected to the OBD2-II diagnostic port. In such cars, also utilizing OBD over the CAN physical layer, it is possible to extract fine-grained CAN data via an OBD-II logger device.
Table 1 shows a simplified picture of a CAN message with a 11-bit identifier, which is the usual format for everyday cars; trucks and buses usually use the extended 29-bit version. This example shows an already stripped message, i.e., we do not discuss end of frame or check bits.
|1481492683.285052||0x0208||000||0x8||0x00 0x00 0x32 0x00 0x0e 0x32 0xfe 0x3c|
|1497323915.123844||0x018e||000||0x8||0x03 0x03 0x00 0x00 0x00 0x00 0x07 0x3f|
|1497323915.112910||0x00f1||000||0x6||0x28 0x00 0x00 0x40 0x00 0x00|
Components of a CAN bus message.
Timestamp: Unix timestamp of the message
CAN-ID: contains the message identifier - lower values have higher priority (e.g. wheel angle, speed, …)
Remote Transmission Request: allows ECUs to request messages from other ECUs
Length: length of the Data field in bytes (0 to 8 bytes)
Data: contains the actual data values in hexadecimal format. The Data field needs to be broken to sensor signals, transformed and/or converted to a human-readable format in order to enable further analysis.
Throughout this paper, we focus on the three practically relevant fields: CAN-ID, Length and Data.
4 Data Collection
As CAN data logs are not widely available, we conducted a measurement campaign. For data collection in particular we connected a logging device to the OBD-II port and logged all observed messages from various ECUs. Such a device acts as a node on the CAN bus and is able to read and store all broadcasted messages. Our team developed both the logging device (based on a Raspberry PI 3) and the logging software (in C). Note that it is common that the OBD2 connector is found under the steering wheel. Also note that not all car makes and models connect the CAN serving the drive-train ECUs (or any CAN) to the OBD-II port (e.g., Volkswagen, BMW, etc.); in this case we could not log any meaningful data.
We have gathered meaningful data from 8 different cars and a total number of 33 drivers. We did not put any restriction on the demographics of the different drivers or the route taken. In each case we asked the driver to drive for a period of 30-60 minutes, while our device logged data from every route the drivers took. Drivers were free to choose their way, but still conforming to three practical requirements: (1) record at least 2 hours of driving in total, (2) do not record data when driving up and down on hills or mountains, (3) do not record data in extremely heavy traffic (short runs and idling). Free driving was recorded for all 33 drivers with an Opel Astra 2018: 13 people were between the age of 20-30, 12 between 30-40, and 8 above 40; there were 5 women and 28 men; 11 with less experience (less than 7000 km per year on average or novice driver), 9 with average experience (8-14000 km per year), and 13 with above average experience (more than 14000 km per year).
We gathered data from the following cars: Citroen C4 2005 (22 message IDs), Toyota Corolla 2008 (36 IDs), Toyota Aygo 2014 (48 IDs), Renault Megane 2007 (20 IDs), Opel Astra 2018 (72 IDs), Opel Astra 2006 (18 IDs), Nissan X-trail 2008 (automatic, 34 IDs) and Nissan Qashqai 2015 (60 IDs). We would like to emphasize that the two Opel Astras use completely different prorpietary CAN versions (even the only 2 common IDs correspond to completely different Data). We also recorded the GPS coordinates via an Android smartphone during at least one logged drive per car. Most routes were driven inside or close to Budapest; approximately 15-20% was recorded on a motorway.
5 Can Data Analysis
All recorded messages contained 4 to 8 bytes of data; this made it likely that multiple (potentially unrelated) pieces of information can be sent under the same ID. We first assumed that signals are positioned over whole bytes; this turned out to be wrong. Our investigation revealed that besides signal values a message can also contain constants, multi-value fields and counters. Some values appear only on-demand, such as windscreen or window signals. All data apart from sensor signals are considered noise and, therefore, need to be removed.
Meaningful CAN IDs vary significantly across vehicle makes and models, therefore we expected that the only signals found in all cars with high probability are the basic ones: such as velocity, brake, clutch and accelerator pedal positions, RPM (round per minute) and steering wheel angle. Next, we devise a method that yields a deeper understanding of the Data field in CAN messages and a possibility for sensor signal extraction. Note that from this point we will use the term ID as a reference to both a given type of message and its data stream (time series).
5.1 Bit decomposition heuristics
Extracting the signals from a CAN message is not a trivial challenge. While monitoring the data stream while driving and finding the exact bits that change in reaction to one’s actions is possible, it is highly time consuming, does not scale with hundreds of different existing CAN protocol versions and bound to miss out on potential sensor signals. (We only took this approach with a single car model to generate training data and a validation framework for our machine learning solution.) Our objective here is to present our observations on message types and distributions that leads to a smarter message decomposition method.
First, we examined the message streams literally bit-by-bit. We presumed that inside a given ID with potentially multiple sensor readings there was a difference in their bit value distribution, hence they could be systematically located and partitioned according to some rule. E.g., let us assume that there are two signals sent next to each other under the same ID (i.e., there are no zero bits or other separators between the two. Given that signals are encoded in a big endian (little endian) format, both of their MSBs (LSBs) are rarely s. Therefore, there should be a drop in bit probability (i.e., the probability for a given bit to be ) between the last bit of the first signal and the first bit of the second signal. In order to visualize these drops we represent IDs by their bit distribution: we sum the number of messages for each ID and how many times a given bit was one and divide these two measures:
denotes the binary vector of a given ID, that is the representation of a CAN message’s payload in binary format, and wheredenotes the probability of a bit being 1 at the i position.
When we examined the distribution of the bits in an ID we found that that in some cases it is straightforward to extract a signal: between two signal candidates there were separator bits with or . Other cases were more complex: given Figure 1(a) it is hard to determine signal borders. However, combined with the bit distribution from the same ID and car model but another drive, the signals became clearly distinguishable.
After examining bit distributions we realized that of candidate signal blocks are placed on one or two bytes. In other cases signal borders were not unambiguous, see Figure 2. Our first heuristic suggests a start of a new signal because of the drop at the 23rd and the 24th bits, although it is clearly a counter or a constant on 3 bits, but we can not determine where exactly a new signal starts (is it the 28th bit or the 32nd?). Moreover, the 41th bit is constant bit which might signify some kind of a separator, yet we cannot be certain. After a long evaluation we decided to divide the data part of the messages to bytes and pairs of bytes; as a result for one ID we could define 4 to 8 sensor candidates.
Examining the byte time series resulting from the above approach, we spotted that many series were constant, had very few values, were cyclic (counters) or changed very rarely. As we intended to use machine learning to find the exact signals, not filtering these samples could have caused significant performance loss and a bloated and skewed training dataset with a lot of similar negative samples resulting in a decreased variability of training data potentially to a degree of corrupting the model. Therefore, we evaluated the variation for each sample and excluded those that had a very low variation (”low variation” was also a free variable optimized during evaluation).
Normalization. We scaled all candidate time series to the interval of : we extracted the maximum value for the whole candidate series, then we divided all values by the maximum. Scaling the data solves the problem of transformed (shifted) values, i.e., the same signal can take different values during drives, that can be a result of some transformation on the data in one vehicle or simply the fact that one car was driven in a lower range of velocity in contrast to the other (i.e., one log comes from a drive that did not exceed 50 km/h while the other rarely drove slower than 100 km/h).
Sliding windows. We divided logs into overlapping sliding windows from which we extracted our features for machine learning. The sliding window length (elapsed time) and percentage of overlap with previous and successive windows were free variables which we set to default values and subsequently optimized during run time.
We use machine learning for two purposes; first, to extract signals from CAN messages, and second, to perform driver re-identification using the extracted signals. Therefore, we build two different types of classifiers. In order to extract signals, we train a classifier per signal on the statistical features of the signal in a base car (e.g., Opel Astra 2018) where we exactly know where the signal resides, that is, the message ID and byte number of the CAN message which contains the signal. Then, we use the trained models to identify the same signals in another car (e.g., Toyota) where the locations of the signals (i.e., message ID and byte number) are unknown. The intuition is that the physical phenomenon that a signal represents has identical statistical features irrespective of the car, and hence can be used to identify the same signal in all cars using the same classifier.
For driver re-identification, similarly to previous works (Fugiglando et al., 2018; Miyajima et al., 2007), we use a separate classifier that is trained on the already extracted signals of the car. This classifier learns the distinguishing features of different drivers (and not that of signals like the first classifier) using the signals produced during their drives.
For both signal extraction and driver re-identification, the features computed from each sliding window constitute a single training sample (i.e., a sample vector) used as the input of our machine learning classifiers. Below we describe the classifiers, the division of training and testing samples, and the method used for multi-class classification.
6.1 Multi-class Classification
We implemented multiclass classification using binary classification in a one-vs-rest way (aka, one-vs-all (OvA), one-against-all (OAA)). The strategy involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. For signal extraction, a class represents a pair of message ID and byte number, whereas for driver re-identification, it represents a driver’s identity. A random forest model was trained per class with balanced training data (i.e., containing the same number of positive and negative samples), and its output was binary indicating whether the input sample belongs to the class or not. For signal extraction, as each training/testing sample is a small portion of the time-series (i.e., window) representing a signal, we apply the trained model on all portions of a signal and obtain multiple decisions per signal. Then, the ”votes” are aggregated and the candidate signal with the most number of ”votes” is selected.
We would like to stress that random forests are indeed capable of general multiclass classification without its transformation to binary. We have also tried this general multiclassification approach, however, its results were inferior to the OvA’s results. Moreover successful driver re-identification can already be carried out using a single or only a few signals Enev et al. (2016). In this paper, we use the velocity, the brake pedal, the accelerator pedal, the clutch pedal and the RPM signals to extract for driver re-identification.
|count below mean||0.2113|
|count above mean||0.1482|
|mean abs change||0.1048|
|longest strike below mean||0.0708|
6.2 Feature extraction
Our classifiers use statistical features of the samples; for each sliding window we extracted 20 different statistics that are widely considered as most descriptive regarding time series characteristics (see the best features in Table 2. We finally used 15 features based on their importances calculated from our random forest models. These features are the following:
: Returns the number of values in that are higher than the mean of .
: Returns the number of values in that are lower than the mean of .
: Returns the length of the longest consecutive subsequence in that is bigger than the mean of .
: Returns the length of the longest consecutive subsequence in that is smaller than the mean of .
: First bins the values of into equidistant bins. The parameter was generally set to 10. Then calculates the value of:
where is the percentage of samples in bin .
: Returns the mean over the absolute differences between subsequent time series values which is:
: Returns the mean over the differences between subsequent time series values which is:
This way we created an input vector of features for each sample (one sample corresponds to one window). No smoothing, outlier elimination or function approximation are performed on the samples before feature extraction. For calculating the above statistics, we used thepython package111https://tsfresh.readthedocs.io/en/latest/ .
6.3 Training and model optimization
For training our classifier we need to have a ground truth of sensor signals from a single car. These certified signals then can be compared to the candidate signals from other cars to find the best match. We chose the Opel Astra 2018 as our reference, as we had the most drives logged from this car.
Velocity versus GPS. We recorded GPS coordinates for all drives with the Opel Astra 2018. Setting the Android GPS Logger app to the highest accuracy (complemented by cell tower information achieving an accuracy of 3 meters) and saving the coordinates every second, we ended up with a time series of locations. Using the timestamps, GPS time series also determines the mean velocity between neighboring locations, producing a velocity time series. Intuitively, the GPS based velocity is very close to the one recorded from the CAN bus.
In order to test this hypothesis we applied the Dynamic Time Warp algorithm (DTW) Salvador and Chan (2007). The DTW algorithm is part of time series classification algorithms Bagnall et al. (2016), their important characteristic being that there may be discriminatory features dependent on the ordering of the time series values Geurts (2001). A distance measurement between time series is needed to determine similarity between time series and for classification. Euclidean distance is an efficient distance measurement that can be used. The Euclidean distance between two time series is simply the sum of the squared distances from each point in one time series to the point in the other. The main disadvantage of using Euclidean distance for time series data is that its results are very un-intuitive. If two time series are identical, but one is shifted slightly along the time axis, then Euclidean distance may consider them to be very different from each other. DTW was introduced to overcome this limitation and give intuitive distance measurements between time series by ignoring both global and local shifts in the time dimension. DTW finds the optimal alignment between two time series if one time series may be warped non-linearly by stretching or shrinking it along its time axis.
Before running DTW we excluded the outliers from the GPS-based velocity series. These points are the result of GPS measurement error and materialize in extreme differences between two neighboring velocity values (we used 30 km/h as a limit). We then ran DTW with the GPS-based velocity values against all other sensor candidates of the CAN log. As the result of the DTW algorithm is a distance between two series, the smallest distance yields the best match: in every case it was indeed the same ID by a wide margin (see Table 3). We used manual physical tryouts to corroborate that this ID indeed corresponds to velocity.
Brake vs. accelerator: pedal position. Extracting the brake and the accelerator pedal positions required a different approach. In a normal vehicle the accelerator and the brake pedal are not pressed at the same time because it contradicts a driver’s normal behaviour (excluding race car drivers). Consequently, to extract the accelerator and the brake pedal positions one only have to search for a pair of signals that are almost exclusive to each other. For this end, we compared all pairs of ID byte subseries from multiple drives and listed the candidates that fit the description. Figure 3(a) shows the correct result and Figure 3(b) shows false candidate. False results were easy to exclude because of their characteristics; in this example it is trivial that a piece-wise constant signal cannot possibly signify a pedal position. Finally, we used manual physical tryouts to corroborate that these IDs indeed correspond to the brake and accelerator pedal positions, respectively. Note that older vehicles can have a binary brake (and clutch) signal, as there is no corresponding sensor signal in them.
Clutch vs. RPM vs. velocity.
The clutch pedal position also has very typical characteristics especially when compared with the velocity and RPM values. Once we start to accelerate from km/h usually we change the gears quickly, thus the changes in the rpm and clutch pedal position are easy to detect. Upon gear change the RPM drops, then rises as we accelerate, then drops and rises again until we reach the desired gear and velocity. During the same time we push the clutch pedal every time just before the gear is changed. Moreover, we tend to use the clutch pedal in a very typical way, when the driver releases the clutch there is a slight slip around the middle position of the pedal indicating that the shafts start to connect. (Note that the length of this slip is characteristic for car models, condition (e.g., bad clutch) and driver experience.) Applying this common knowledge we searched for a pair of signals with one of them having a sharp spike (RPM) and the other a small platform (slipping clutch) around the same time. We narrowed our search to cases when the vehicle accelerated from zero to at most 50 km/h. In Figure 4 we can see these signal characteristics compared to each other. We managed to find the clutch pedal position and RPM signals based on the above. As before, we validated our findings with manual physical tryouts.
Optimization. After extracting the ground truth signals, we calculated the feature vectors and trained a random forest classifier for each extracted signal: velocity, brake pedal position, accelerator pedal position, clutch pedal position and engine RPM. For parameter optimization and testing we tested our model on logs from the same car, but driven by another driver on another route.
6.4.1 Signal extraction
Our random forest classifiers used for signal extraction are trained on the CAN logs of a base car (here it is an Opel Astra’18) where the locations of a target signal is known. The classifiers take statistical data vectors as inputs with the 15 statistical features (see in Section 6.2) extracted from each sample (window). In particular, we train a random forest classifier to distinguish a target signal from all other signals, where the positive training samples are composed of the windows of the time series corresponding to the target signal, whereas negative samples are taken from other signals’ time series. Hence, we obtain a classifier per target signal. Recall that signal locations are computed using the techniques described in Section 6.3. We apply each trained classifier on all the samples (windows) of all time series in another (target) car where we want to locate the corresponding target signals. For every classifier, we obtain a classification for each window of each time series in the target car. The time series which receives the largest number of votes (i.e., has the most windows classified as positive) will be the matched signal, i.e. the signal which is the most similar to the target signal.
Best results were obtained using the following parameter settings: the length of windows is set to 2.5 seconds, which is sufficiently large to capture different driver reactions (one can accelerate from 0 to even 30 km/h or can hit the brakes and stop the vehicle). The sampled logs are at least 30 minutes long, the overlap parameter is set to 25%. The pruning parameter is set to 7, i.e. a sample was excluded when its variation is less than 7.
Each trained random forest classifier is tested against samples from logs of all other cars except the base car, and the logs were pre-processed as described in Section 5.2. The matching performed by a classifier is validated by manually extracting the ground truth sensor signal from the target car as described in Section 6.3. In order to measure the accuracy of our classifier, we report the rank of the true signal; each candidate signal is ranked according to the number of votes (i.e., positive classifications) they receive, i.e. the signal having the highest vote ranked first.
Table 4 shows the results of signal extraction using only 30 minutes of data for training and also 30 minutes for matching (testing), where training was performed on CAN logs obtained from our base car (i.e., Opel Astra’18). Three signals (RPM, velocity and accelerator pedal position) are all ranked in the first place (i.e. received the highest number of votes in the classifier),that is, our approach successfully identified all three signals in all the target cars. Note that we did not exract the clutch and the brake pedal position signals as during the validation we realized that these signals do not even exist in most of our cars in the database.
We also report the precision (TP/TP+FP) and recall (TP/TP+FN) of the classifier (where TP=true positive, FP=false positive and FN=false negative) which represent how many positive classifications are correct over all samples of all time-series in the CAN log (precision), and how many samples of the true matching signal are correctly recognized by the classifier (recall). We also compute and report the gap which is the difference between the number of votes (i.e., positive classifications) of the highest ranked true signal and that of the highest ranked false signal divided by the total number of votes. For example if the highest ranked true signal received 50% of the votes and the highest ranked false signal received 20%, then the gap equals . Note that in most cars most sensors appear under several IDs in the log, this causes that more than one candidate with very high votes are all true positives, thus the top high ranks can all be true positives and the highest ranked false signal drops to the fourth, fifth or even lower places.
6.4.2 Driver re-identification
Next we use the extracted signals in a driver re-identification scenario. We use the same preprocessing as in Section 5.2 and the same parameter settings as in Section 6.3, except that we do not use all 15 features, only 11 of them are chosen based on their importances: count above mean, count below mean,
longest strike above mean, longest strike below mean, maximum, mean, mean abs change, median, minimum, standard deviation, variance
count above mean, count below mean, longest strike above mean, longest strike below mean, maximum, mean, mean abs change, median, minimum, standard deviation, variance. We used four extracted signals: accelerator and brake pedal positions, velocity and RPM. The feature vector of a driver consists of 44 features altogether. All drivers used the same car, which was Opel Astra’18, to produce CAN logs. The samples were divided into a training and testing set, where the training and testing data made 90% and 10% of all samples, respectively. We used 10-fold cross-validation to evaluate our approach. We selected 5 drivers uniformly at random, and built a binary classifier for each pair of drivers. Our classifier achieved 77% precision on average (each model was evaluated 10 times). The worst result was just under 70% and the best result was 87%.
7 Conclusion and Future Work
We described a technique to extract signals from vehicles’ CAN logs. Our approach relies on using unique statistical features of signals which remain mostly unchanged even between different types of cars, and hence can be used to locate the signals in the CAN log. We demonstrated that the extracted signals can be used to effectively identify drivers in a dataset of 33 drivers. Although our results need to be evaluated on a larger and more diverse dataset, our findings show that driver re-identification can be performed without the nuisance of signal extraction or agreements with a manufacturer. This means that not revealing the exact signal location in CAN logs is not sufficient to provide any privacy guarantee in practice. Car companies should devise more principled (perhaps cryptographic) approaches to hide signals, and/or to anonymize their CAN logs so that drivers cannot be re-identified.
This work has been partially funded by the European Social Fund via the project EFOP-3.6.2-16-2017-00002, by the European Commission via the H2020-ECSEL-2017 project SECREDAS (Grant Agreement no. 783119) and the Higher Education Excellence Program of the Ministry of Human Capacities in the frame of Artificial Intelligence research area of Budapest University of Technology and Economics (BME FIKP-MI/FM). Gergely Acs has been supported by the Premium Post Doctorate Research Grant of the Hungarian Academy of Sciences (MTA). Gergely Biczók has been supported by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences.
- Bagnall et al. (2016) Bagnall, A., Bostrom, A., Large, J., and Lines, J. (2016). The great time series classification bake off: An experimental evaluation of recently proposed algorithms. extended version. arXiv preprint arXiv:1602.01711.
- Enev et al. (2016) Enev, M., Takakuwa, A., Koscher, K., and Kohno, T. (2016). Automobile driver fingerprinting. Proceedings on Privacy Enhancing Technologies, 2016(1):34–50.
- European Union (2016a) European Union (2016a). Recital 26 of Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation), OJ L119, 4.5.2016. http://eur-lex.europa.eu/legal-content/en/TXT/?uri=CELEX%3A32016R0679.
- European Union (2016b) European Union (2016b). Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). Official Journal of the European Union, L119:1–88.
- Fugiglando et al. (2018) Fugiglando, U., Massaro, E., Santi, P., Milardo, S., Abida, K., Stahlmann, R., Netter, F., and Ratti, C. (2018). Driving behavior analysis through can bus data in an uncontrolled environment. IEEE Transactions on Intelligent Transportation Systems, (99).
- Fugiglando et al. (2017) Fugiglando, U., Santi, P., Milardo, S., Abida, K., and Ratti, C. (2017). Characterizing the driver dna through can bus data analysis. In Proceedings of the 2nd ACM International Workshop on Smart, Autonomous, and Connected Vehicular Systems and Services, pages 37–41. ACM.
- Geurts (2001) Geurts, P. (2001). Pattern extraction for time series classification. In European Conference on Principles of Data Mining and Knowledge Discovery, pages 115–127. Springer.
- Hallac et al. (2016) Hallac, D., Sharang, A., Stahlmann, R., Lamprecht, A., Huber, M., Roehder, M., Leskovec, J., et al. (2016). Driver identification using automobile sensor data from a single turn. In Intelligent Transportation Systems (ITSC), 2016 IEEE 19th International Conference on, pages 953–958. IEEE.
- Markovitz and Wool (2017) Markovitz, M. and Wool, A. (2017). Field classification, modeling and anomaly detection in unknown can bus networks. Vehicular Communications, 9:43–52.
- Miyajima et al. (2007) Miyajima, C., Nishiwaki, Y., Ozawa, K., Wakita, T., Itou, K., Takeda, K., and Itakura, F. (2007). Driver modeling based on driving behavior and its evaluation in driver identification. Proceedings of the IEEE, 95(2):427–437.
- Salvador and Chan (2007) Salvador, S. and Chan, P. (2007). Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis, 11(5):561–580.
- Sija et al. (2018) Sija, B. D., Goo, Y.-H., Shim, K.-S., Hasanova, H., and Kim, M.-S. (2018). A survey of automatic protocol reverse engineering approaches, methods, and tools on the inputs and outputs view. Security and Communication Networks, 2018.
- Szalay et al. (2015) Szalay, Z., Kánya, Z., Lengyel, L., Ekler, P., Ujj, T., Balogh, T., and Charaf, H. (2015). Ict in road vehicles—reliable vehicle sensor information from obd versus can. In Models and Technologies for Intelligent Transportation Systems (MT-ITS), 2015 International Conference on, pages 469–476. IEEE.
- Voss (2008) Voss, W. (2008). A comprehensible guide to controller area network. Copperhill Media.