I Introduction & Background
Modern vehicles rely on dozens to greater than a hundred electronic control units (ECUs), embedded computers that send periodic messages to orchestrate sub-component functionality, including life-critical services, such as brakes. ECUs broadcast messages over a Controller Area Network (CAN), which defines a lightweight protocol that is efficient and functionally sound, but lacks security measures such as authentication and encryption . After-market efforts involving modern vehicles, e.g., performance tuning or adding security measures, usually require the ability to interact with and understand the data sent over this in-vehicle network.
The CAN 2.0 specification defines aspects of the physical and data link layer, particularly the CAN frame format, which is standard across all implementations, and is publicly available . CAN frames (depicted in Figure 1) have several components but there are only two important components to understand the frame—the arbitration ID (AID), which is used to identify the packet as well as determine priority, and the data field, containing up to 64-bits of message contents.
However, unlike the open specification of the physical and data link layer, the full specification of how to decode the data field is completely proprietary; it is held secret by the original equipment manufacturer (OEM) and varies per make, model, year, and trim. This proprietary information is generally stored in a .dbc file format (database for CAN). DBCs contain (1) signal definitions, which describe the following: the segment position (start and end bit indices); the binary-to-decimal encoding scheme (signed vs. unsigned, little vs. big endian); and the conversion information for translating the decimal into a meaningful physical value (offset and scale factor, units, and range of possible values). The DBC also includes (2) message timing attributes with information such as transmission frequency (how often message with particular AID is sent), whether this rate is constant or triggered by an event, etc., and (3) the ECU that sent the message. Nearly all after-market modifications or research on passenger vehicles requires and critically relies on reverse engineering some of the information held in the DBC. For example, much of current CAN intrusion detection research is based on data-driven approaches to determine (2) AID message timing (e.g. [4, 5]) or (3) ECU identification (e.g. [3, 6]), by leveraging physically observable characteristics, namely message timestamps and voltage (physical layer).
I-a Problem Statement
However, reverse engineering (1) the signal definitions is a significantly more difficult problem than (2) and (3). One can monitor and send CAN packets, but understanding how to translate the data field information to what it encodes is not currently possible. An example of a fully defined message with signals shown in color is depicted in Figure 2, which was produced by software for editing DBCs.
The problem of defining a group of signals held in a 64 bit data field is two-fold:
[leftmargin = 0.5cm]
Tokenization: segmenting message contents into tokens, contiguous sequences of bits, that constitute a single signal (determining start/end bit, binary encoding scheme)
Translation: converting the token into a number and understanding its meaning in terms of the vehicle’s function, e.g. front left wheel speed, brake light on, etc. (determining offset, scale factor, units, value range)
Thus, a signal is a token paired with a meaningful mapping, or translation.
I-B Vehicle Information & Diagnostics
While tokenizing and translating CAN signals is not readily available for passenger vehicles, other ground-truth vehicle information has been leveraged. Some research opts for hand-labeled drive states (i.e accelerate, reverse, key in, speedometer reading of 20mph, ect.) [7, 8, 9], while others used external data loggers .
One particularly good source of data is automotive diagnostic data. Vehicles manufactured in 2008 or later have an on-board diagnostic (OBD-II) port, allowing for open access to automotive networks, and mandated for state-wide emission testing and diagnostics by the J1979 standard. Automotive diagnostics exist separately as an application layer for CAN implementations—for diagnostic actions, the application layer is called the Unified Diagnostic Service (UDS). UDS exists as a request-response system in which ECUs respond to interrogation regarding a variety of vehicular states. One can query for data such as engine speed, wheel speed, oxygen sensor readings, each corresponding to a particular Diagnostic OBD-II PID (DID), for which units, ranges, and conversion formulas are public . Importantly, these CAN signals exist in addition to the normal CAN traffic which the vehicle uses for critical functions, although both are seen in the same data stream. Further, this serves as a reliable way to obtain ground-truth data without the need for exogenous data streams.
I-C Related Works
Recent research is emerging to provide solutions for signal extraction from passenger vehicle CAN data fields. Related works either attempt unsupervised tokenization of CAN data frames or leveraging diagnostic data, but not directly for extracting CAN signals.
Markowitz & Wool 
work toward anomaly detection on CAN data, including a method to tokenize CAN data and provide high-level categorization of these tokens (akin to part-of-speech tagging tokenized text data). Focusing on reverse engineering automotive CAN data, Marchetti & Stabili refine the algorithm of Markovitz and Wool , define the semantic categories more rigorously, and test the the method results against a DBC they acquired. Neither works give explicit mappings of CAN message segments to their meanings, but do provide an unsupervised method for tokenization and semantic categorization. Concurrent work by Nolan et. al. 
develops a similar method for unsupervised tokenization of CAN data frames, based on bit flip probabilities. The elegant method simply partitions the 64-bit message frames into appropriately sized segments.
Similar to our approach, previous works have focused on leveraging UDS data as ground truth for analysis, e.g., [15, 16, 17], although they did not address the explicit problem of CAN data interpretation. Li et al.  presented an IDS that used a regression model to learn relationships between physical values, such as vehicle speed, and raw CAN data, whereas Wasicek et al.  develop an IDS based solely on anomalies in diagnostic data correlations. Neither address the problem of actual semantic analysis, and both would require UDS commands to be fired continually during training and IDS deployment.
Huybrechts et al.  is the only previous work that applied UDS annotations towards CAN data translation. They developed an “arithmetic method” that attempts to label sections of data fields based on similarity to simultaneously collected diagnostic data. However, they did not address tokenization, instead considering segmentation by one- or two-byte tokens (a shortcoming of their method that they acknowledge), and they provide no explicit linear mapping to translate segments to real physical values.
We provide a workflow for collecting CAN data alongside available diagnostic information and a novel algorithm, ACTT: Automotive CAN Tokenization and Translation, the first algorithm to bring together these previously separate streams of research. The contributions are as follows:
[leftmargin = .3cm]
ACTT uses diganostic labels to learn CAN signal definitions, including the parameters needed for tokenization (start/end bit, endianess) and translation (offset, scale factor, units, value range).
ACTT furnishes goodness-of-fit scores allowing visibility into what diagnostic codes are directly encoded in CAN data, and allowing discovery of CAN signals that are related but not directly accessible via diagnostic codes (e.g., accelerator depressed indicator). Furthermore, the scoring permits quantifying the percent of the CAN data field translated.
ACTT provides a tuneable parameter that on one extreme forces extraction of only near-perfect diagnostic signals while on the other provides less exact matches clustered by their correlations.
Our work also will aid other streams of related research, providing a preprocessing step and allowing them to refine and more rigorously test their methods. Due to the difficulty of extracting CAN signals, many current vehicle research efforts have avoided fully and explicitly determining these signals. One workaround for this was to manually reverse engineer a few single-function signals in CAN data, e.g., 
, in order to provide proof of concept. Others implemented machine learning methods (Hidden Markov Models, Neural Nets, Manifold Learning, etc.) that implicitly learn relationships between raw binary data and vehicular states[18, 7, 8, 9, 10, 16, 17].
The features used in these machine learning methods were often based on unprincipled decisions regarding tokenization, e.g., considering each byte pair in the data field to be a “signal”. Additionally, supervised methods required vehicle states that were often hand labeled, or relied on other exogenous data sets. Our work therefore provides a preprocessing step for these methods that previously employed these brute force reverse engineering techniques, brittle tokenization schemes, or manual labeling of vehicle states.
For after market tools, knowledge of signal definitions has been shown to be very beneficial in various streams of research. Heavy-duty vehicles’ CANs follow the J1939 standard , which is not proprietary and is like a standardized DBC for all trucks. This largely facilitates rapid development of new features which can be vehicle agnostic. An example of cross-platform integration made possible by the J1939 standard is the Bendix Wingman Advanced which brings adaptive cruise control with braking features along with collision mitigation technology to a variety of trucks111www.bendix.com/en/products/acb/wingmanadvanced_1.jsp. For passenger vehicles, Ford has developed an open source API called OpenXC, which includes a small sample of signal definitions. This has resulted in a wealth of research by individuals creating add on tools such as ‘OpenXCThenThat’, a vehicular task automator, and ‘Smart Battery App’, an EV battery optimizer based on terrain, both created during the Ford Electrified Vehicle Hackathon, an event meant to show off the potential of OpenXC222http://openxcplatform.com/.
Overall, the inability to comprehend passenger vehicle CAN data fields severely limits the range and effectiveness of after-market vehicle research and engineering. Finding vehicle-agnostic methods for syntactic and semantic understanding of CAN data fields promises a wealth of opportunity for after-market development.
We assume we have a CAN capture from a vehicle during a sufficiently long driving period to exercise most variation in the CAN data. Here we define the notation for representing the 64-bit payloads for an AID over time, and then our algorithm for tokenization and translation.
Let be an AID trace, a sequence of time-ordered 64-bit messages from the same AID, where denotes the bit in the th message of the sequence, and messages occur at times .
Let be a Diagnostic trace, a sequence of integer responses from the same DID, where messages occur at times . Note that unlike for AID traces, whose lengths can differ significantly based on priority and transmission rates, all diagnostic traces should be about length . We note that we do not include constant diagnostic responses, that is DID traces s.t. are not considered.
For , and , we define the little-endian (ending bit, , is most significant bit) and big-endian (starting bit, , is most significant bit) integer encodings of the bit subsequence as, respectively,
Ii-a Token Preprocessing
We first examine the bits in each AID trace and categorize each into: a constant 1s bit, a constant 0s bit, or a ‘used’ bit. We note that it is impossible to differentiate between a bit that is defined to be held constant as a buffer between signals (an unused bit) and a bit that is simply unchanged during data collection due a state not being reached in the CAN capture under investigation. Letting denote the sets of bit positions, we can then determine the set of possible valid token boundaries (see Algorithm 1). We consider valid tokens to be any contiguous set of bits that does not include a constant bit. Note that we have defined start and end bit indices to be inclusive (i.e. indicates a 1-bit token).
Ii-B Diagnostic Matching
We next determine whether any valid token of an AID’s 64-bit data field is related to a diagnostic response message in Algorithm 2 by converting a time-varying sequence of bit strings to a sequence of integers, then regressing to see if they linearly fit any time-varying diagnostic sequence collected.
The first step is to determine the integer translation for each valid tokens string (from Algorithm 1 which returns set of possible start and end bits, of a non-constant token). We consider both little- and big-endian unsigned encodings, but do not consider any alternative encodings at this time (e.g. signed binary, one’s complement, two’s complement). Specifically, for each time index , we convert
(a sequence with each element a vector of bits) to two sequences of integers using Equations1 & 2.
For each of these endian encoding of the token trace, say
, we use linear regression to find constants transformingto each diagnostic trace
; of course, we first interpolate the token traceto the diagnostic time points giving . The regression furnishes the coefficients that results in the best linear fit to . Note that we choose to interpolate over the diagnostic points because this is sampled at a much lower rate than AID messages occurring in normal CAN traffic. We then score each model using the coefficient of determination, (see Algorithm 2 for formula), and add the token to our match set if the score exceeds a set threshold . Recall with indicating perfect fit, and indicating fit of a horizontal line.
There are several important things to note about the matching algorithm. Firstly, the goodness of fit for each model gives more than just a score of how well the model performs, but an indication of whether the token actually encodes the diagnostic signal. In some cases it indicates that the CAN token is not an exact match, but does encode a correlated signal, e.g., we have identified 1-byte tokens with high correlation to a continuously changing DID, presumably a binary indicator. Hence can be tightened to isolate near-perfect encodings, or tuned down to discover multiple related signals to each DID.
Secondly, we get both the tokenization (AID, start and end bits ) and the actual mapping (, endian-ness) needed to translate the binary message to the decimal value. Note that in order to get the true value measuring a physical signal, including units, we must apply the conversion for the matched diagnostic response, a formula accessible for these standardized DIDs . We note that changing can change the token boundaries, due to the fact that the algorithm is simultaneously learning both tokenization and translation.
Thirdly, the scoring mechanism is quite flexible. It can be easily altered to consider other binary encodings, more flexible regression, or other time-varying observations in addition to the DIDs.
Ii-C Message Packing
We now have candidate tokens and their mappings, but it is possible for these learned tokens to overlap, (e.g., the token comprised of bits 2 to 8 map to a DID with a high fit and the token comprised of bits 1 to 4 map to a different DID with a high fit), leaving a problem of deciding which to choose for each AID. The final section of the algorithm approaches this problem by determining the optimal packing of matched tokens for each AID data field. The goal is to essentially create as full a DBC data field description as seen in Figure 2 as possible, choosing the tokens with high goodness of fit to a DID and preferring longer tokens to shorter. For a given match set from , we scale the coefficient of determination, , by the token length and then optimize the sum of these scores over non-overlapping matched tokens (see Algorithm 3). While we do not provide the full algorithm, we note that this does not require an exponential-time algorithm—the globally optimal solution can be found in time using dynamic programming (see  for full weighted interval scheduling dynamic programming algorithm).
We refer to the maximum of the sum in Algorithm 3 as the message packing score taking values in . Note that if all 64 bits are used by tokens matched with perfect fit (), then the score is 1. Observing this score over the percent of non-constant or ‘used’ bits gives a measure of how successful the tokenization and translation process is. Examples provided in Results Section.
We tested our method on three vehicles dated 2008, 2015, 2016 of two different makes, three models, and using both gasoline and hybrid. For this short paper we present results and examples from a 20 minute capture from a 2008 gasoline vehicle in city and highway driving conditions. CAN traffic was captured using a Kvaser Leaf Lite V2 (www.kvaser.com) providing CAN-to-USB translation to a Linux OS laptop using SocketCan software (https://elinux.org/Can-utils). This particular car used 25 AIDs and responded on 31 DIDs, which were queried at a rate of 20Hz throughout the capture. OEMS are only required to have their vehicles respond to a subset of about 200 possible DIDs, and for the cars we tested, we found similar subsets of about 30-45 responsive DIDs. We note that we obtained similar, if less comprehensive results for all vehicles tested (2015, 2016 cars had about two times as many AIDs) and that results were similar across multiple captures. We ran the tokenization and translation algorithm with .
We look at three examples of mapped AID traces for this capture to illustrate results, shown in Figure 3. Note that the AIDs are anonymized by replacement with their priority ranks (highest priority AID1, lowest is AID25). The DIDs are the actual OBD-II PIDs corresponding to the documentation , and we have translated both the DIDs and mapped CAN tokens to the appropriate units. Note additionally that although all token boundaries in the shown examples are constant bits, this is not always the case.
Figure 3(a) presents a fully mapped message, that is, every bit is either constant or matched. The two matched tokens have very high match scores giving strong evidence that these two are exactly the signals reporting “Engine RPM” and “Accelerator Pedal Position D” DIDs, respectively. Note the single outlying orange point at s, an artifact of the differing sampling rates.
Figure 3(b) shows a very high priority message, AID3, which is nearly identical to AID4. Both contain two 12-bit tokens that are the signals for “Vehicle speed” DID. Based on our experience working with CAN data, we can quickly identify these as encoding wheel speed—likely for all four wheels, two on each AID3 and AID4. We note that the values of the linear coefficients, , shows these signals are simply the scaled speed encoded with higher precision in the CAN data than DID responses. Compare this to the coefficients in Figure 3(a), whose message scores and visual match are similarly high, but the coefficients indicate a more complex conversion. The two constant 1s bits in byte 6 of Figure 3(b) are likely flag tokens (that do not flip states during the course of the capture) corresponding to the state of each wheel. The final half byte of the message is an unmatched token (shown with -1s). However, visual inspection of the token plot (bottom) show values vacillating between the 4-bit minimum () and maximum () over a regular period, revealing that it is a Counter Token. The characteristics of Counter Tokens, as well as Checksum Tokens, are described in  and are apparently added to prevent particular types of attacks in safety-critical messages (e.g., injections or replay attacks). We note that many of our unmatched tokens may correspond to these Counter or Checksum Tokens. While our algorithm does currently not account for these, they are easy to identify, and we plan to identify them initially along with the constant bits in future iterations.
The message in Fig. 3(c) demonstrates the point that a lower score does not necessarily indicate poor performance, but simply a correlated signal. It also illustrates why it is reasonable for a single DID to match more than one AID token. For example, the first and third token in this message match DIDs that are matched with a higher score in 3(a)
. The single-bit token matched to DID73: “Accelerator Pedal Position D” has only a 0.53 match since only 53% of total variance of the DID signal is explained by the one bit. However, it seems clear that this bit encodes whether or not the gas pedal is depressed—an indicator unavailable via diagnostic query, but easily derived from pedal angle. Likewise, the third matched token is likely physically related to “Engine RPM” DID, but unlike the second token in3(a), it is not an exact match. The second and third tokens matched (with opposite endianness) to “Fuel System Status” DID, which is an enumerated DID—values indicate a particular state, not a physical value (see  for details). Significant deviation from the enumerated DID in portions of the plot may indicate a spurious match or that it communicates a different signal from a related system. Finally, the plots of the token values for this AID highlight the complexity of the translation problem, namely, similarity in the signal variation due to the highly interrelated physical system makes accurate translation/semantic labeling difficult.
Overall, from this capture, we find 69.6% of bits are constant 1s or 0s, and of the remaining bits, our algorithm matches 16.8% leaving 13.6% of bits unknown. By summing the message scores for each AID and dividing by the number of AIDs, we get a total match score of 14.5% out of the 16.8% matched bits for an overall match score of 86.0%. We note that the score should not be interpreted as the performance of the algorithm as illustrated by Figure 3(c) where one can learn correlations or related signals and infer signals that are not DID-encoded (e.g., Figure 3(c)(top)). Changing to .2, we obtain 22.0% of the bits are matched and only 8.4% unmatched with a total match score of 16.1% over 22.0% (72.9%). Hence, we obtain some less direct matches but more insight into unknown but correlated tokens.
Data transmitted over the in-vehicle CAN network is a veritable mine of information regarding vehicular functions. However, the key to decode CAN data, specifically the way to tokenize and translate messages, is completely proprietary, and varies per make, model, year and trim. Consequently, non-OEM augmentation of vehicles is greatly hindered because it operates blind to CAN message syntax and semantics. We develop ACTT, the first algorithm to simultaneously tokenize and translate CAN data, learning message-to-function mappings by leveraging diagnostic information.
Our results show that ACTT both tokenizes and functionally translates CAN data fields providing needed meaning. Specifically, many matched tokens reveal near-perfect DID encodings, while remaining matched tokens are correlated to their matched diagnostic responses providing potentially useful groupings and facilitate actual inference of the signal (if not a direct match) from inspection in some cases. We expect ACTT to provide a needed step in unlocking the DBC-encodings to enable a wide variety of after-market research including CAN security, performance tuning, and driver or vehicle state studies.
Special thanks to Micheal Iannacone and anonymous reviewers. Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the U. S. Department of Energy (DOE) and by the DOE, Office of Science, Office of Workforce Development for Teachers and Scientists (WDTS) under the Scientific Undergraduate Laboratory Internship (SULI) program.
-  K. Koscher et al., “Experimental security analysis of a modern automobile,” in 2010 IEEE Symposium on Security and Privacy. IEEE, 2010, pp. 447–462.
-  R. Bosch GmbH, “CAN specification version 2.0,” 1991.
-  K.-T. Cho and K. G. Shin, “Fingerprinting electronic control units for vehicle intrusion detection.” in USENIX Security Symp., 2016, pp. 911–927.
-  M. R. Moore, R. A. Bridges, F. L. Combs, M. S. Starr, and S. J. Prowell, “Modeling inter-signal arrival times for accurate detection of CAN bus signal injection attacks,” in 12th CISRC. ACM, 2017.
-  M. Gmiden, M. H. Gmiden, and H. Trabelsi, “An intrusion detection method for securing in-vehicle CAN bus,” in Proc. of Sciences and Techniques of Automatic Control and Computer Engineering. IEEE, 2016.
-  W. Choi et al., “Identifying ECUs through inimitable characteristics of signals in controller area networks,” IEEE Trans. Vehicular Technology, 2018.
-  Z. Tyree, R. A. Bridges, F. L. Combs, and M. R. Moore, “Exploiting the shape of CAN data for in-vehicle intrusion detection,” in IEEE CAVS, 2018, preprint arXiv:1808.10840.
-  M. Jaynes, R. Dantu, R. Varriale, and N. Evans, “Automating ecu identification for vehicle security,” in 15th ICMLA. IEEE, Dec 2016, pp. 632–635.
-  S. N. Narayanan, S. Mittal, and A. Joshi, “Using data analytics to detect anomalous states in vehicles,” Dec 2015.
-  T. Huybrechts et al., “Automatic reverse engineering of CAN bus data using machine learning techniques,” in Advances on P2P, Parallel, Grid, Cloud and Internet Computing, F. Xhafa, S. Caballé, and L. Barolli, Eds., vol. 13. Springer, 2018.
-  “OBD-II PIDs,” Oct 2018. [Online]. Available: https://en.wikipedia.org/wiki/OBD-II_PIDs
-  M. Markovitz and A. Wool, “Field classification, modeling and anomaly detection in unknown CAN bus networks,” Vehicular Communications, vol. 9, pp. 43–52, Jul 2017.
-  M. Marchetti and D. Stabili, “READ: Reverse engineering of automotive data frames,” IEEE Transactions on Information Forensics and Security, 2018.
-  B. C. Nolan, B. Mullins, S. Graham, and C. S. Kabban, “Unsupervised time series extraction from controller area network payloads,” in IEEE CAVS, 2018.
-  T. Flach, N. Mishra, L. Pedrosa, C. Riesz, and R. Govindan, “CarMA: towards personalized automotive tuning,” in 9th SenSys. ACM, 2011.
-  A. R. Wasicek, M. D. Pese, and A. Weimerskirch, “Context-aware intrusion detection in automotive control systems,” 2017.
-  H. Li et al., “Poster: Intrusion detection system for in-vehicle networks using sensor correlation and integration,” in SIGSAC, ser. CCS ’17. ACM, 2017, pp. 2531–2533.
-  M. R. Moore, R. A. Bridges, F. L. Combs, and A. L. Anderson, “Data-driven extraction of vehicle states from CAN bus traffic for cyber protection and safety,” Consumer Electronics Magazine, (to appear) https://goo.gl/8LUvNH.
-  S. of Automotive Engineers, Serial Control and Communications Heavy Duty Vehicle Network, June 2012.
-  “Weighted interval scheduling lecture notes.” [Online]. Available: https://courses.cs.washington.edu/courses/cse521/13wi/slides/06dp-sched.pdf