ROAD: The Real ORNL Automotive Dynamometer Controller Area Network Intrusion Detection Dataset (with a comprehensive CAN IDS dataset survey guide)

12/29/2020 ∙ by Miki E. Verma, et al. ∙ Oak Ridge National Laboratory 5

The Controller Area Network (CAN) protocol is ubiquitous in modern vehicles, but the protocol lacks many important security properties, such as message authentication. To address these insecurities, a rapidly growing field of research has emerged that seeks to detect tampering, anomalies, or attacks on these networks; this field has developed a wide variety of novel approaches and algorithms to address these problems. One major impediment to the progression of this CAN anomaly detection and intrusion detection system (IDS) research area is the lack of high-fidelity datasets with realistic labeled attacks, without which it is difficult to evaluate, compare, and validate these proposed approaches. In this work we present the first comprehensive survey of publicly available CAN intrusion datasets. Based on a thorough analysis of the data and documentation, for each dataset we provide a detailed description and enumerate the drawbacks, benefits, and suggested use cases. Our analysis is aimed at guiding researchers in finding appropriate datasets for testing a CAN IDS. We present the Real ORNL Automotive Dynamometer (ROAD) CAN Intrusion Dataset, providing the first dataset with real, advanced attacks to the existing collection of open datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 5

page 6

page 9

page 10

page 11

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Modern vehicles are increasingly drive-by-wire, relying on continual communication of small computers called electronic control units (ECUs). Nearly ubiquitous in modern vehicles, controller area networks (CANs) facilitate the data exchange among ECUs by providing a common network with a standard protocol. While it is lightweight and reliable, the CAN standard has well-known security flaws, such as a lack of message authentication. Furthermore, intra-vehicle CANs are increasingly exposed—often directly by mandatory on-board diagnostics II (OBD-II) ports and potentially indirectly/remotely through a variety of vehicle interfaces, including USB ports and various wireless communications.

Consequently, there has been a significant increase in the attention given to CAN vulnerability in the form of research [1, 2, 3, 4, 5, 6, 7, 8], as well as many proposed CAN intrusion detection systems (IDSs). Four recent in-vehicle IDS surveys show the growth and progression of this field [9, 10, 11, 12]. Categorizations offered by these surveys illustrate the current barriers to CAN IDS progress. Note that these surveys (even when combined) are not comprehensive but provide a representative sample of the described distributions.

CAN IDS methods generally fall into five major categories:

[leftmargin=.5cm,font=]

Rule/Specification-Based:

Uses rules / whitelisting [13, 14]

Physical Side-Channel:

Uses physical layer attributes (e.g., voltage) [15, 16]

Frequency/Timing-Based:

Regards the timing of each frame arbitration ID and/or the sequential nature of IDs [17, 18, 19]

Payload-Based:

Uses a black-box approach that considers the data frame as a string of bits, without recovering the signals these bits represent [20, 21, 22]

Signal-Based:

Requires first decoding raw data field bits into constituent signals, and uses time series of signal values as inputs [23, 18, 24, 25]

The two most recent surveys by Wu et al. [9] and Lokman et al. [10] demonstrate both the increasing pace and asymmetric growth of the field in terms of the five categories described above. A quick meta-analysis of the union of the 30 papers surveyed (up to 2018) yields the following distributions by year: before 2015 (4), 2015 (5), 2016 (6), 2017 (9), 2018 (6); by category: rule/specification-based (3), side-channel (4), frequency/timing-based (14), payload-based (6), signal-based (3).

While the number of publications in the CAN security domain, especially in IDS research, has grown appreciably in the past few years, IDS research is significantly hindered by two major issues: (1) obfuscated CAN messages (not the focus of this work) and (2) lack of quality, publicly available, real CAN data with advanced attacks present (the focus of this work).

The asymmetric growth in the field—in particular the disproportionate number of publications on methods that are timing/frequency-based (and to a lesser extent payload-based) as compared to signal-based—is a direct result of this obfuscated CAN messages issue. Original equipment manufacturers ([OEMs], e.g., Subaru, Ford) of passenger vehicles hold secret their proprietary encodings of signals in the CAN data fields and vary the encodings across models. Consequently, though researchers can easily add a node to monitor and send CAN messages on most vehicles, the data is not understandable. Thus, most have focused on methods that do not require knowledge of signal encodings. While a few researchers have paired with OEMs or done some manual reverse engineering to obtain and develop IDSs based on the de-obfuscated CAN signals, these developments are not vehicle agnostic. Notably, the research community is beginning to address the CAN signal reverse engineering problem, (see Verma et al. [26] for a survey of these works), in large part to facilitate CAN IDS research, but more generally to enable a wide variety of downstream automotive technologies (e.g., [27]). While the obfuscated signal problem is not the problem addressed in this paper, it is necessary context for the second issue, to which we make a contribution.

Problem Addressed

We now turn to the second barrier, which is that it is difficult to obtain CAN data with real high-fidelity labeled attacks.

Such data is unavailable for three reasons. First, CAN data with real attacks are costly to produce, with the exception of fabrication (simple message injection) attacks. Facilitated by open-source (e.g., SocketCAN/CANutils

[28]) and proprietary software (e.g., CANalzyer,111https://www.vector.com/int/en/products/products-a-z/software/canalyzer/ VehicleSpy222https://intrepidcs.com/products/software/vehicle-spy/) and OBD-II access to many vehicle CANs, collecting ambient CAN data from real vehicles is relatively straightforward, as is collection while sending extra messages; thus fabrication attacks are common. For more subtle attacks, researchers must have a dedicated modern vehicle for study—a relatively costly hardware investment when purchase price, maintenance, insurance, etc. are taken into account. Discovery and execution of more subtle, physically verifiable CAN attacks require ample research time and effort, and (thanks in part to issue (1) above) CAN attacks/vulnerability analyses are usually a per-vehicle endeavor.

Second, producing realistic CAN attack data carries inherent risks to the passengers, bystanders, and to the vehicle itself. Ideally, dynamometers allowing driving in a safe and controlled laboratory environment are used, but such facilities are large investments and are usually outside the researchers’ control. Furthermore, risks of permanent damage loom (e.g., “bricking” a vehicle’s ECU), and successful implementation of cyber attacks with finesse requires per-vehicle research efforts in themselves.

Third, disclosure of sensitive information is an inhibitor. OEMs consider their CAN encodings intellectual property. Additionally, responsible vulnerability disclosure may be necessary if new attacks are discovered, which at a minimum pauses release of data. Further, releasing data with targeted attacks may be viewed unfavorably by OEMs, resulting in lawsuits if not handled responsibly.

To our knowledge, there are currently only six publicly available vehicle CAN datasets with labeled attacks (see Table I). Likely due to the inherent difficulties in producing real CAN attack data described above, all of these datasets are either real fabrication (simple message injection) attacks or simulated attacks (created by manipulating CAN data post collection). Both methods have significant limitations when supporting CAN IDS development. Fabrication attacks are generally simple to detect with timing-based methods and are thus limited in scope. Due to the complex dynamics of the broadcast CAN protocol, the simulated CAN attacks ignore aberrations in message timing, content, and presence that naturally occur, and therefore change data quality in unknown ways. Physical verification of the effect of the attack on the vehicle is not possible with pure simulation. In short, there is no publicly available, real CAN data with labeled attacks that is of sufficient quality to permit assessment of many CAN IDS methods.

Consequences

A result of the difficulty to obtain sufficient CAN data with attacks is simply that CAN IDSs are often not evaluated on real CAN data with real attacks. A survey by Loukas et al. [11]classifies 17 surveyed automotive CAN IDS papers by the evaluation method: “analytical” (theoretical only, no evaluation on data), “simulation” (evaluated on simulated CAN data or real CAN data with simulated attacks), and “experimental” (evaluated on real CAN attacks). Note our classification language is slightly different than that in the survey Loukas et al., as we consider IDSs evaluated on real attacks (even if not in situ), to be “experimental.” The distribution of the surveyed papers is: analytical (3), simulated (8), experimental (6). This illustrates the first consequence: relatively few IDSs are evaluated on CAN data with real attacks.

A second consequence to the community is that CAN IDS works are not comparable, or at least not compared. Rajbahadur et al. [12] surveys an even larger set of papers (with a much wider scope of “Anomaly Detection for Connected Vehicle Cybersecurity”), finding that

Much of the research is performed on simulated data (37 out of the 65 surveyed papers)… much of the research does not evaluate the newly proposed techniques against a baseline (only 4 out of the 65 surveyed papers do so), which may lead to results that are difficult to quantify.

This reinforces the findings of Loukas et al. regarding simulated data, but also articulates a second issue that is even larger in magnitude: very few CAN IDSs are evaluated against a baseline. Our experience with the CAN IDS literature is that authors are contributing from a wide variety of backgrounds. While this milieu provides a diverse set of approaches (a benefit), the area suffers by lacking a uniform body of knowledge, and the lack of depth seems to inhibit the steady development of ideas and systematic, quantifiable progress. To again quote Rajbahadur et al.,

The varied use and scattered publication of anomaly detection [for connected vehicle cybersecurity] research has given rise to a sprawling literature with many gaps and concerns… we urge researchers to address these identified shortcomings.

To summarize, quantifiable comparison across competing and complementary IDS methods is currently not possible. Standardized datasets are necessary for head-to-head comparisons and for replicability (or better reproducibility). To continue to progress in an empirically verified and scientific manner, the CAN IDS research community needs to produce and adopt a publicly shareable collection of CAN datasets with labeled attacks. This sentiment was reiterated and acted on by Hanselmann et al. [23] in their recent CAN IDS work:

To the best our knowledge, there is no standard data set for comparing methods. We try to close this gap by evaluating our model on both real and synthetic data, and we make the synthetic data publicly available. We hope that this simplifies the work of future researchers to compare their work with a baseline.

Finally, we find that IDSs are often evaluated against inappropriate test data. For example, IDSs promising detection of advanced, subtle attacks are tested only on CAN data with exceptionally noisy attacks, or (another example) works use attacks that disrupt timing, then ignore timing in evaluation to test payload-based detection. In order to not disparage other IDS works, we cite our own insufficient evaluation of CAN IDS ideas as examples [29, 30]. The consequence is that many promising IDS methods, which are excessive for the easily detected attacks in data available, are never truly evaluated on the more advanced attacks they target.

Contributions

We add to these cries for a more systematic progression of CAN IDS research and to the request to also use appropriate test datasets by offering the following contributions.

We provide a comprehensive (to the best of our knowledge) survey of publicly available CAN datasets that contain labeled attacks. Our survey includes simulated attacks and frameworks for manufacturing attacks in post-processing. We itemize these datasets, their download links, and citations in tables to provide easy reference.

After performing quality analysis investigations on both the data and documentation presented in each previously released CAN dataset, we provide detailed description of the data, a discussion to illuminate the benefits and drawbacks of each dataset, and recommendations for appropriate use of each dataset when developing a CAN IDS.

Our foremost contribution is a real CAN dataset collected from a passenger vehicle with a variety of physically verified CAN attacks, and ample training data with no attacks. This dataset provides a fuzzing fabrication attack, many targeted fabrication attacks that are maximally stealthy (manipulating only the necessary portions of the data field and sending a single manipulated message per ambient message of the same ID), and two advanced attacks that include no fabricated (injected) messages. For each targeted injection attack, we also include an augmented CAN capture by deleting the targeted ambient message to simulate a masquerade attack. This is the highest fidelity dataset with attacks that vary dramatically in difficulty to detect, so as to allow appropriate testing and head-to-head comparisons of the wide variety of proposed CAN IDS methods.

Section II provides necessary background on CAN protocol and vehicle attack terminology; Section III comprises the survey, analysis and discussion of all previous CAN attack datasets; Section IV introduces our new CAN attack dataset.

We hope that our contributions help facilitate advancement in the structure and impact of this growing field.

Ii CAN Data and Security

Fig. 1: CAN data frame [31]: The two primary components are the Arbitration ID used for message identification and arbitration (prioritizing messages) and the Data Field, containing up to 8 bytes of message contents.

Ii-a CAN Protocol

CAN is a message-based protocol standard [32] that defines the first two Open Systems Interconnection (OSI) layers (physical and data link). Using this protocol, ECUs (e.g., Power Control Module [PCM], Antilock Braking System [ABS]) continually broadcast data frames with information relating to the current state of the vehicle. A standard CAN data frame (or packet), depicted in Fig. 1, contains several fields, of which two are relevant for the scope of this paper: the 11-bit Arbitration ID, and the 64-bit Data field.

The Arbitration ID, or simply ID, is the message header that identifies the frame and is used for arbitration, the process by which frames are prioritized when multiple ECUs concurrently transmit—the lower the ID, the higher the priority. The RTR bit is an indicator of a remote frame. Any ECU can request the data on an ID by sending the ID and the RTR bit indicating the request. This remote frame would be immediately followed by a response with the requested ID and data. The Data Field contains the actual message contents of up to 8 bytes, where each distinct piece of information carried in the message is called a signal. CAN frames with the same ID encode the same set of signals in the same format and are usually sent with a fixed frequency to relay updated signal values. In general, each ECU is assigned a set of IDs that only it transmits. For example, the PCM may transmit: ID 0x102 containing engine RPM, vehicle speed, and odometer signals every 0.05s, and ID 0x45D with signals encoding the angle of the gas and brake pedals every 0.01s.

The CAN standard also defines a robust error handling mechanism that is designed to prevent erroneous messages from being propagated or faulty nodes from disrupting communications. For example, if two nodes attempt to concurrently transmit different messages with the same ID, both nodes will transmit their frame until they send opposing bits simultaneously, at which point one will incur an error. If a node’s error count gets too high, it will enter a “bus off” mode, meaning it cannot read or transmit messages on the bus until it is reset. See previous works [33, 7] for more details on CAN error handling.

Ii-B CAN Attacks

While lightweight, CAN lacks encryption and authentication, and is therefore vulnerable to exploitation. There have been a number of successful attacks on vehicular CANs published in the past several years, some remote, and some requiring physical access. Koscher et al. [1] provide a comprehensive overview of CAN-based ECU vulnerabilities. Their exploration involves applications that facilitate CAN communication, such as the Unified Diagnostic Service (UDS), a standardized set of commands which can change the state of a targeted ECU or directly read and write to memory addresses. Instead of focusing on these types of applications, our attack data focuses on inherent vulnerabilities found within the CAN protocol.

The attacks surveyed here begin with the assumption of a compromised node on the bus. Cho & Shin [34] provide a well-defined adversary and attack model with terminology that is widely used. A weakly compromised ECU is a node that an adversary is able to silence, suspending any message transmission, while a fully compromised ECU is a node over which the adversary has complete control, with the ability to send fabricated messages and access the node’s memory. Note that the method of connecting to a vehicle’s CAN via the OBD-II port is considered a fully compromised ECU in this model.

Using this terminology, Cho & Shin introduce the following three general categories of attacks, which are in turn useful for describing attack sophistication and the types of IDSs that would be able to detect them.

Ii-B1 Fabrication Attacks

A fabrication attack uses a strongly compromised ECU to inject messages with malicious IDs and Data Fields. The majority of the attacks in the CAN IDS literature fall into this category, including the following:

[leftmargin=.5cm]

DoS Attack — Messages with ID 0x000 and an arbitrary payload are injected at a high frequency. Since ID 0x000 always wins arbitration and is not usually issued by legitimate ECUs, flooding the bus with these high priority messages prohibits legitimate messages from being transmitted. This results in a host of unusual effects, such as flashing dash indicators, intermittent accelerator/steering control, and even full vehicle shutdown.

Fuzzing Attack — Messages with random IDs and arbitrary payloads are injected at a high frequency. The effect is similar to that of the DoS attack: the bus becomes occupied with mostly injected messages, displacing real messages. In addition, unlike the DoS attack, injected messages may have an ID that appears in normal traffic, so receiver nodes expecting these ID messages will read and use the information in the malicious payload, causing a wide variety of unexpected results. There are two slight variations of this attack: some researchers inject only IDs that appear during normal traffic (e.g., [35]), while others inject arbitrary random IDs (e.g., [36]).

Targeted ID Attack — Messages are injected with a specific target ID and manipulated data field. When only the bits in a specific signal—that is, a select part of the 64-bit data field—are modified, we refer to this as targeting a signal, rather than an ID.

Fabrication attacks are characterized by the inherent problem of message confliction, described by Miller & Valasek [6],

The biggest problem with CAN message injection is that, while attackers can inject arbitrary messages onto the bus, the original sender of the message (i.e., the legitimate ECU) is still sending legitimate messages…The result of the ECU continuously sending messages along side our attack messages is message confliction. From the perspective of the receiving ECU, inconsistent messages are received (and it must) decide what to do with this conflicting information.

In general, ECUs will regard the last seen data frame on a given ID; thus, to effectively overwrite legitimate messages with the target ID, the injected frames must occur on the bus very soon after the true frame. Not all data frames (especially injected frames) are regarded independently and acted upon; simply reverse engineering a signal to inform targeted injections will often not result in the desired or any response from the vehicle. Miller & Valasek [6] provide potential techniques for side-stepping message confliction, but the desired effect is the same—de-conflicting ambient and fabricated data frames by suspending the ambient messages.

The first two fabrication attacks described (DoS and fuzzing) require almost no understanding of or reconnaissance on the target vehicle, nor do they allow for finesse in execution. On the other hand, the targeted ID attack can be more sophisticated. Targeting manipulation of specific functionality requires knowledge of at least one of the IDs’ signals and requires the data field designed to have a particular effect based on the given ID’s signal definitions.

Furthermore, targeted ID attacks can, similar to the first two attacks, be accomplished by flooding the bus, simply meaning that messages are sent at a very high frequency, although this is blatant and easy to detect. Research hackers Miller & Valesek used this tactic to successfully attack a Toyota Prius, injecting fabricated collision prevention system messages at a high frequency, causing the ABS to engage the brakes [5].

The most stealthy targeted ID attack is a flam attack—immediately after each target ID’s legitimate message, an injected message is sent (with the same ID but manipulated data), so that the true message state is not physically realized before the spoofed message alters the car to the target state. Injected frames and true target ID frames are in one-to-one correspondence. This type of attack was pioneered by Hoppe et al. [3], who essentially disabled a car’s warning lights by sending a “lights off” frame immediately after any legitimate frame’s “lights on” message was sent, resulting in the lights appearing continually off. Provided the attacker can reverse engineer the target ID’s payload, only the bits involved in the targeted signal need to be manipulated.

Dataset Org. Year Dataset URL 20cmReal
Attacks
# Attack Types
T.T. T.O.
CAN Intrusion (OTIDS) [35] HCRL 2017 http://ocslab.hksecurity.net/Dataset/CAN-intrusion-dataset 2 11
Survival Analysis for Automobile IDS [37] HCRL 2018 http://ocslab.hksecurity.net/Datasets/survival-ids 3 0
Car Hacking for Intrusion Detection [36, 38] HCRL 2018 http://ocslab.hksecurity.net/Datasets/CAN-intrusion-dataset 4 0
SynCAN [23]3 Bosch 2019 https://github.com/etas/SynCAN 2 3
Automotive CAN Bus Intrusion v2 [39] TU Eindhoven 2019 https://doi.org/10.4121/uuid:b74b4928-c377-4585-9432-2004dfa20a5d 63 1
Can Log Infector4 CrySyS Lab 2020 https://www.crysys.hu/research/vehicle-security/ 0 7

1.875pt1-8.51.5

plus1fil minus1fil

ORNL Dynamometer CAN Intrusion ORNL 2020 https://0xsam.com/road/ 5 5
  • The sole advanced attack, the “Impersonation Attack,” may not actually be useful for researchers developing an IDS (see description in Sec. III).

  • Dataset does not contain any real raw CAN data—all data is synthetic (not just the attacks), and it contains decoded signal values rather than 64-bit data fields.

  • Due to the rather crude method of creating synthetic injections (see description in Sec. III) which involves modifying timestamps, classification is less clear.

  • Dataset does not technically include attack data—instead, includes several ambient captures and a python script for programmatically modifying ambient logs to create different types of simulated attacks.

TABLE I: Open CAN IDS datasets. All datasets include ambient data for the given vehicle(s) in addition to attack data. “T.T.” and “T.O.” refer to Timing Transparent (alters normal timing characteristics) and Timing Opaque (does not alter normal timing characteristics), respectively.

Ii-B2 Suspension Attacks

An adversary mounting a suspension attack needs a weakly compromised ECU, preventing it from transmitting some or all messages. For example, an adversary could suspend all messages on a particular safety-critical ID, thus disrupting other systems that rely on this constantly updated data.

Ii-B3 Masquerade Attacks

Finally, the most sophisticated category, masquerade attacks, involve an adversary first suspending messages of a specific ID from a weakly compromised target ECU, and then using a strongly compromised ECU to inject spoofed messages with this ID at a realistic frequency, thus masquerading as the target ECU. Using this more advanced strategy, a targeted ID attack can be carried out without message confliction, thus allowing for a more stealthy attack. Miller & Valasek’s infamous remote Jeep Cherokee hack [6] employed a masquerade attack. Unlike the Prius [5], the Cherokee ABS system dealt with message confliction by simply turning off the collision prevention system, and thus they were unable to mount a similar fabrication attack; instead, they had to first suspend legitimate messages (in addition to a few other steps) in order to mount an attack.

In another example, Cho & Shin cleverly use a strongly compromised ECU in order to weakly compromise a target ECU by causing it to go into bus off mode, at which point they run a masquerade attack [7]. Interestingly, if an attacker is not careful when mounting a fabrication attack, this same mechanism can result in the attacker’s own strongly compromised ECU getting bussed off. In fact, we have done this on many occasions, in essence running a suspension attack on ourselves!

This previous research has shown that masquerade attacks are indeed possible, but they require enormous hacking expertise and thus far more in-depth per-vehicle research. Further, white-hat CAN hackers and CAN intrusion detection research communities are working independently with seemingly different skill sets toward a common goal. Thus, no real CAN masquerade attacks are publicly available, and the evidence from the CAN IDS research community is that defensive researchers do not have the skills or resources to create such advanced attacks, yet they need to access to attack data in order to create a robust IDS.

Timing Transparent vs. Timing Opaque

Unsurprisingly, these attack categories map to intrusion detection techniques that have matured alongside offensive developments. Fabrication and suspension attacks are accurately detectable by frequency-based IDSs, which regard the timing of each ID and/or the sequential nature of IDs. Masquerade attacks require more sophisticated methods: for example, attempting to identify the sending ECU [34, 16], luring an added node to reveal itself [35], or inspecting the data field [30, 23, 20].

We define a Timing Transparent (T.T.) attack to be any that is hypothetically detectable using a frequency-based method, specifically, a fabrication attack, detectable by unusually fast message timing or the appearance of new IDs, or a suspension attack, detectable from unusually slow or disappearance of usually present IDs.

An attack that is Timing Opaque (T.O.), on the other hand, is defined as an attack that does not disrupt normal timing or ID distributions, and thus would not be detected with a frequency-based IDS. Instead, a more sophisticated method, such as a payload-based detector that uses the data field, or perhaps a side-channel method monitoring the physical layer, would be needed. In short, developing a comprehensive and robust IDS requires hardening (and therefore testing) against timing opaque attacks. A masquerade attack is the primary example, but other attacks that may alter the overall state of the vehicle (e.g., the “accelerator attack” in the ORNL dataset, Sec. IV-A3), may also be included in this category.

Iii Previous Datasets

Here we itemize the publicly available CAN datasets with attacks, including descriptions of the attacks present and whether they are real or simulated, as well as dataset benefits and drawbacks. We examined each dataset, and attempted to verify the accuracy of documentation. Refer to Table I.

HCRL CAN Intrusion Dataset (OTIDS)

Lee et al. at the Hacking & Countermeasures Research Lab (HCRL) published this dataset as a supplement to their CAN IDS paper [35], which presented OTIDS: Offset Ratio and Time Interval based Intrusion Detection System, a novel IDS based on the timing of remote frame responses. The general method was to transmit remote frame requests for a given ID, time how long it took an ECU to respond to the request, and test whether this delay was anomalous—the idea being that a compromised ECU, being controlled by an adversary, would respond with an unusual delay. The published dataset contains artifacts of this method, specifically, the remote frames (which are labeled as such and thus easy to remove in preprocessing), and legitimate as well as spoofed responses (less easy to identify and remove). The documentation of the dataset on the website and in the paper, their figures describing the dataset, and the data itself have discrepancies. We provide our assessment as to what is actually contained in this dataset.

Attacks

— Real. Includes DoS and fuzzing fabrication attacks. According to the authors, the dataset also includes a masquerade attack, which they call an “impersonation attack,” but it does not seem as though the target node was actually suspended, and the only message injections appear to be the spoofed remote frame responses. Thus, while this may be a useful simulation for testing whether their remote-frame–based IDS would detect a masquerade attack, it would not be useful for other IDSs that do not leverage this aspect of the protocol. Rough estimates of injection intervals are given.

Benefits

— The fuzzing attack provided in this dataset is the slightly stealthier version that involves only spoofing IDs that appear in normal traffic. This is the only example of this kind of fuzzing attack in an open dataset. This is also the only open dataset with remote frames and responses.

Drawbacks

— First and foremost, the injected messages are not labeled, and the documentation on the injection intervals is unclear and possibly incorrect. Authors indicate that in the DoS attack all 0x00 messages are injections (these take place the during the entire capture), and the fuzzy and impersonation attacks start after 250s. However, our analysis indicates that these attacks take place during the entire capture. Furthermore, this disagrees with their paper, which depicts a injected message during the fuzzing attack at 0.1565s, well before 250s. Second, as explained above, the “impersonation attack,” while characterized as masquerade attack in their paper, does not seem to be a true masquerade attack since the message transmission by the legitimate node is suspended. While it is possible that we misunderstood their documentation, our confusion on the matter and the various discrepancies we found are a testament to poor documentation. Finally, the presence of remote frame requests and responses results in small timing changes that are not usually in ambient traffic, and may be problematic for testing and training a frequency-based detector. Overall, unless it used for leveraging remote frames for an IDS, this dataset is not recommended.

HCRL Survival Analysis Dataset for Automobile IDS

Alongside their frequency-based CAN IDS paper, Han et al. [37] at HCRL published their dataset composed of three different vehicles (Hyundai Sonata, Kia Soul, Chevrolet Spark). On each car, they collected ambient data and ran three different types of injection attacks that caused the vehicles to malfunction. Note that a different version of this dataset is also published for an IDS challenge (http://ocslab.hksecurity.net/Datasets/datachallenge2019/car), which we do not include in our survey.

Fig. 2: Time between messages during fuzzing attacks on four different vehicles, with top three plots from each vehicle in the HCRL Survival Analysis Dataset, and the bottom plot from the ORNL dataset. While the injections (in red) result in a significant disruption in the overall message timings in the HCRL dataset, the fuzzing attack in the ORNL dataset does not, and would therefore be slightly more difficult to detect using a frequency-based IDS. This also illustrates that the bus load and overall message frequency distribution varies widely across vehicles.
Attacks

— Real. Includes all three flooding fabrication attack types: DoS (they call this “flooding”), fuzzing, and targeted ID (flooding delivery). Each attack capture is 25–100 seconds long, and contains between 1 and 4 five-second injection intervals. Each injected message is labeled.

Benefits

— This is the only dataset that contains real attacks on multiple vehicles; furthermore, the same set of attacks are repeated multiple times on each one. One of the values of training and testing a frequency-based IDS on multiple vehicles is illustrated in Fig. 2, depicting a fuzzing attack mounted on four different vehicles (top three plots from this dataset). This illustrates how the bus load (% of time the bus is occupied with a message) can differ dramatically across vehicles, demonstrating that a frequency-based IDS must be adaptable to such differences. Additionally, authors confirm that these attacks have a real effect on the vehicle. This dataset would be a good choice for training and testing a vehicle-agnostic, frequency-based detector.

Drawbacks

— All of the attacks are basic and could be detected with a very simple frequency-based detector. Even conditioned on being fabrication flooding attacks, these are particularly un-stealthy examples. See the top three plots of Fig. 2, as the frequency of the entire bus is significantly disrupted by the exceptionally high injection frequency, which is not the case the ORNL fuzzing attack (bottom plot of Fig. 2). The targeted ID attack, which they call the “malfunction” attack, is done somewhat blindly: there is no indication of what the function of the target IDs are, and the injected payloads (data fields) are chosen by either cycling through random values (on the Soul) or a single random value (on the Spark and Sonata). As for the ambient captures, only 60–90s of data are provided per vehicle, which is likely not sufficient for robust training, let alone for testing false positive rates. Finally, the ambient data and attack data are in differently formatted CSVs, which is undesirable.

HCRL Car Hacking Dataset for Intrusion Detection

Fig. 3: HCRL Car Hacking Dataset contains unintentional artifacts of data collection; in particular, in each of the four attack datasets, right after conclusion of the attack, there is a prolonged period during which no messages appear on the bus. This depicts the end of the DoS dataset, starting from the last four injection intervals (red), followed by 53s of ambient traffic (blue), and a 22s transmission gap before ambient message resume again. Note that the first point after messages resume (with a ) has been omitted for scale. We hypothesize that this gap is due the CAN bus going into a “stand-by” mode due to inactivity, that is, the vehicle is not being operated and no messages are being injected.

The Car Hacking dataset is the most recent dataset released by HCRL and was used by these researchers to train and test two deep learning IDSs

[36, 38]. The dataset includes 500s of ambient driving data, and four different attacks.

Attacks

— Real. Includes all three flooding fabrication attack types—DoS, fuzzing, and two targeted ID attacks (flooding delivery) in which they spoof the drive gear and the RPM gauge. Each capture is upwards of 40 minutes and contains 300 attack intervals lasting 3–5 seconds. Each injected message is labeled.

Benefits

— The attack captures are very long and contain a large number of instances per attack. Unlike the other two HCRL datasets, the function of the IDs in the targeted ID attacks is provided (drive gear and RPM gauge). This dataset seems to be the most widely used dataset in the CAN IDS research community, and is thus clearly a very important niche. (Unfortunately, many recent publications seem clearly to be using this dataset without citation.)

Drawbacks

— All the attack captures contain a significant artifact of data collection that may pose a problem for researchers using this data, particularly since it is not noted in the documentation. At the conclusion of each attack (soon after the 300th injection), there is a large gap in messages where it appears no messages are being transmitted. This is depicted for the DoS attack in Fig. 3 where there is a gap of 22s. In the other three captures containing attacks, this gap is much longer, in the order of 3000s! Researchers Berger et al. [40] also noted these jumps in timestamps. Having dealt with this issue in our own data collection efforts, we hypothesize that this message transmission gap arises from the bus going into a “stand-by” mode due to vehicle inactivity once the researchers stop injecting messages. We suggest that researchers using this dataset, particularly for a frequency-based IDS, should trim the attack captures to right before the gap—at the mark for the DoS, Fuzzing, Gear, RPM datasets, respectively.

While this relatively simple fix renders the dataset usable for testing, the hypothesized source of the issue, namely that the car wasn’t being driven during the attacks (which we verified by decoding the signals), poses a problem that cannot be solved in post-processing. As the car is being driven in the ambient data, the test data fundamentally differs from the training data outside of the injections, making it an unsuitable test set. Given these issues, this dataset does not seem like a good choice, even for testing a simple detector.

Similar to the HCRL survival dataset, these attacks are particularly un-stealthy with respect to disrupting overall bus timing (see Fig. 3). Finally, ambient and attack data are in different formats (fixed width format and CSV).

Bosch SynCAN

Hanselmann et al. [23] at Bosch GmbH (the company that created CAN), constructed a signal-translated, synthetic CAN dataset that they used for training and testing their CAN IDS, “CANet”, and, noting the lack of a standard, sufficient CAN benchmark dataset, Hanselmann et al. published theirs for the research community to use. Unlike all others, this dataset contains timestamped signal values rather than the raw binary data fields; thus, it is the only available dataset for evaluating a signal-based model.

Attacks

— Simulated. Includes the following attack types: fabrication targeted ID (flooding delivery), suspension, and masquerade. Includes one capture of each attack, each containing a hundred 2–4s attack intervals. Each simulated intrusion signal is labeled.

Benefits

— This is the sole signal-based dataset, which is a particularly impactful contribution since the encodings of signals into most vehicles’ CAN data are unknown. It is clear from the treatment in the paper that Hanselmann et al. have true expertise in CAN protocol and data. This dataset contains the most nuanced masquerade attacks (e.g., signal values slowly drifting from a real to a target value, or replayed from normal traffic) currently available (including our dataset). Further, this is the only dataset (other than ours) that contains attacks targeting a single signal, rather than the full 64-bit data field. This allows for testing a very advanced, signal-based IDS.

Drawbacks

— Synthetic data is clearly an imperfect proxy for real data; Hanselmann et al. train and test their IDS on the synthetic and real data, and remark that the former is “somewhat ‘cleaner’ than in the real case.” As a separate issue, simulated attacks are inherently problematic since their effect on a vehicle cannot be verified. Additionally, authors claim that all ambient data should be used for training but do not provide any additional ambient data for testing; thus, it would would difficult to test an IDS’s false positive rate—an attribute of the utmost importance. Finally, since authors do not provide a version of the data with CAN signals packed into frames, many CAN detectors, which require CAN data in the usual format (time-stamped IDs with 64-bit data fields) could not use this data.

TU Eindhoven Lab Automotive CAN Bus Intrusion Dataset

This dataset was published by researchers in the Department of Mathematics and Computer Science at Eindhoven University of Technology (TU/e), who collected data from three different vehicles/CANs: an Opel Astra; a Renault Clio; and a CAN testbed, which consists of a VW instrument cluster, two Arduino boards (programmed to be a legitimate and a strongly compromised ECU, respectively) and a joystick programmed to replicate the throttle that sends messages used by the speedometer in the instrument cluster. They simulate a set of attacks on each CAN, and for all but one attack, simply augmented the recorded data in post-processing by doing the following: for fabrication attacks, they “added packets manually and adjusted timestamps accordingly”; for suspension attacks, they deleted particular frames; and for masquerade attacks, they replaced the data field of particular frames. The dataset also includes one real attack on their CAN testbed, a targeted ID fabrication attack. The joystick is used to send messages during normal traffic, and during the attack, messages with this ID are injected by the compromised ECU.

Attacks

— Simulated. Includes the capability to create seven kinds of masquerade attacks. Specifically, given a start point, a target ID, and a set of contiguous bytes in the data field to target, the original value of each byte can be either replaced (by a constant value, random value, or an increasing/decreasing sequence over each frame) or incremented (by a specified value, or a increasing/decreasing sequence over each frame). The modified copy, containing the simulated injections, does not have attacks labeled, but as all modification parameters are passed by the user, these could presumably be determined.

Benefits

— This dataset includes the only diagnostic protocol attack publicly available, and the only suspension attack (simulated) in real CAN data. (Recall that SynCAN contains suspension attacks but is signal data from a simulated CAN.) The same set of attacks is available for testing on multiple vehicles/CANs.

Drawbacks

— First and foremost, adjusting timestamps in post-processing alters data and diminishes fidelity in a critical way: message timing on the bus is dependent on each ID’s frequency and priority through the arbitration process. Thus, changing message timings risks creation of a synthetic dataset that is not realistic. Secondly, many attacks are unrealistic. For the DoS attack, they simply overwrite 10s worth of frames, which is not how real DoS attacks appear, and many attacks are far too short (e.g., 10 messages) with the injections often too dispersed to affect vehicle functionality in practice. With respect to the prototype, it is unclear how they generated ambient traffic (e.g., were they recorded and replayed from another car?), which clearly affects fidelity, and such a testbed is an imperfect proxy for a real vehicle. Finally, attack labels are in an unstructured text file, so there is no way of programmatically reading what/when packets were injected.

CrySyS Lab Can-Log-Infector & Ambient CAN Traces

The Laboratory of Cryptography and System Security (CrySyS Lab) at Budapest University of Technology and Economics published an ambient CAN dataset (along with GPS data) from a comprehensive set of driving scenarios. They pair this data with their open-source CAN Log Infector tool (https://github.com/CrySyS/can-log-infector), which is used to manipulate data contents of the CAN log traces to simulate an attack. Using this tool, a variety of different masquerade attacks can be created in ambient CAN data by modifying only the data field of a specified target ID.

Attacks

— Simulated. Includes the capability to create seven kinds of masquerade attacks. Specifically, given a start point, a target ID, and a set of contiguous bytes in the data field to target, the original value of each byte can either be replaced (by a constant value, random value, or an increasing/decreasing sequence over each frame) or incremented (by a specified value, or a increasing/decreasing sequence over each frame). The modified copy, containing the simulated injections, does not have attacks labeled, but as all modification parameters are passed by the user, these could presumably be determined.

Benefits

— While these authors do not provide any CAN data with attacks, the authors provide their framework for simulating a wide variety of masquerade attacks; this facilitates the creation of unlimited masquerade attacks—for example, combining different payload manipulation techniques simultaneously on different IDs, in any CAN data. Furthermore, this software is open source, and can be easily extended to add new attacks. Additionally, this is the only dataset with real CAN data that does not have only “injections” a constant target value over time, and is the only dataset (other than ours) that allows for modifying only part of a data field, and thus enables targeting signals. Finally, other than ours, this is the only dataset furnished with descriptions of the driver’s actions during ambient captures, which is highly valuable when for training and testing an IDS.

Drawbacks

— As attacks are added in post-processing, there is no guarantee that these attacks would actually affect vehicle function. Moreover, these attacks are completely blind to the function and signal mapping of particular target IDs and meaning of the values being modified. There are also logical problems with Can-Log-Infector’s implementation, most notably that whole bytes must be changed and all selected bytes must change uniformly. Since CAN-Log-Infector can only change each of the eight bytes in the data field, signals that do not exactly fill a set of bytes cannot be solely targeted (recall that payloads are composed of several signals of varying lengths and positions whose bits often cross byte boundaries). Furthermore, incrementing whole bytes means multi-byte signals will vary in a highly discontinuous manner. Finally, the method for specifying the injection interval is rather irksome—rather than specifying a starting timestamp, the user passes “a value between 0 and 1 (indicating) the ratio when the attack should start regarding the full length of the capture,” and the attack end point cannot be specified.

Iv ORNL Dataset

The ORNL dataset (https://0xsam.com/road/, DOI:
10.13139/ORNLNCCS/1728694) consists of 33 attack captures totalling about 30 minutes, and 12 ambient captures containing about 3 hours of ambient data. These are enumerated in Table II; attacks are described in more detail in Sec. IV-A; important syntactic metadata appears in Sec. IV-B, and further descriptions are provided in the documentation published with our dataset.

We collected CAN data using SocketCAN [28] software on a Linux computer with a Kvaser Leaf Light V2 connecting to the OBD-II port. All of the data is from a single vehicle, the make/model of which we do not disclose. The published data has been obfuscated in a way that maintains the anonymity of the vehicle, while preserving all important aspects of the data for an IDS (see Sec. IV-C). During all of the attacks, the vehicle was on a dynamometer, and was actively being driven. Ambient data was collected both on the dynamometer and on roads, while performing a variety of normal and sometimes unusual but benign driving activities (e.g., unbuckled seatbelt or opened door while driving).

Iv-a Attacks

Attack captures are detailed below in order of expected difficulty to detect.

Iv-A1 Fuzzing Attack

We mounted the less stealthy version of the fuzzing attack, injecting frames with random IDs (cycling in order from 0x000 to 0x255) with maximum payloads (0xFFFFFFFFFFFFFFFF) every .005s (as opposed to only injected IDs seen in ambient data). Many physical effects of this attack were observed—accelerator pedal is impotent, dash and lights activated, seat positions move, etc. By injecting messages with maximal payload, we prevent incidental ECU bus-off.

Iv-A2 Targeted ID Fabrication & Masquerade Attacks

We performed targeted ID fabrication attacks using the flam delivery, meaning a message is injected immediately after a legitimate message with the target ID is seen. As discussed in Section II-B, the flam technique allows for dynamic injection; that is, the legitimate ID message is read, only the bits corresponding to the target signal are modified with malicious values, and then this spoofed message is injected. When only part of the message is modified, we refer to this as targeting a signal, rather than an ID. Designing these attacks required reverse engineering of signals for this vehicle, which we completed using CAN-D [26] signal reverse engineering algorithm and manually verifying the results. The targeted ID fabrication attacks and masquerade attacks are as follows:

  • [leftmargin=.5cm]

  • Correlated Signal — The single ID message communicating the four wheels’ speeds (each is a two-byte signal) is injected with four false wheel speed values that are all pairwise very different. This effectively kills the car—it rolls to a stop and inhibits the driver from effecting acceleration, usually until the car is restarted.

  • Max Speedometer — The speedometer signal (one byte) is targeted. We modify this signal value to be the maximum (0xFF), causing the speedometer to falsely display a maximum value.

  • Max Engine Coolant Temperature — We target the engine coolant signal (one byte), modifying the signal value to be the maximum (0xFF). The physical effect is an “engine coolant too high” warning light on the dash illuminates.

  • Reverse Light — A binary (one bit) signal communicating the state of the reverse lights (on/off) is targeted. We perform two slight variations of the attack, where we manipulate the value to off (on), while the car is in Reverse (Drive), respectively. The effect is that the reverse lights do not reflect what gear the car is using.

Modified # Logs
Accelerator Attack (In Drive) 2
Accelerator Attack (In Reverse) 2
Correlated Signal Fabrication Attack 3
Correlated Signal Masquerade Attack 3
Fuzzing Attack 3
Max Engine Coolant Temp Fabrication Attack 1
Max Engine Coolant Temp Masquerade Attack 1
Max Speedometer Fabrication Attack 3
Max Speedometer Masquerade Attack 3
Reverse Light Off Fabrication Attack 3
Reverse Light Off Masquerade Attack 3
Reverse Light On Fabrication Attack 3
Reverse Light On Masquerade Attack 3
Dynamometer Various Ambient 10
Road Various Ambient 2
TABLE II: Logs in ROAD CAN Intrusion Detection Dataset

For all of these targeted ID attacks, we provide two versions of the same CAN data captures: the original fabrication attack, and a version slightly modified in post-processing to simulate a masquerade attack. Refer to the Table II, which itemizes the fabrication/masquerade pairs. A subset of the fabrication/masquerade pairs attacks are visualized in Table III.

The fabrication attack versions are the original altered capture, including both the legitimate target ID frames and the injected frames. Because these are real, physically verified attacks with the minimally occurring injected frames (due to the flam delivery), they provide perhaps the best (i.e., most stealthy/most difficult to detect), current, public data for testing frequency-based IDSs.

The masquerade attack versions provide a more advanced version of the attack, in which we remove the legitimate target ID frames preceding each injected frame, simulating a masquerade attack. In effect, this removes message confliction in the data, and makes it appear as though only the spoofed messages are present during the injection interval. With these masquerade datasets, frequency-based approaches will almost certainly fail to provide accurate detection. It is important to note that while the masquerade aspect is simulated through post-processing, this means of alteration avoids problematic issues with synthetic data. Namely, the effect of the attack on the vehicle was physically verified; every message appearing in the data was actually seen by the car in the order it appears in the data; and no aspect of CAN protocol was violated. As discussed in the introduction, there are no publicly available, real CAN data captures with real masquerade attacks, and the extreme hacking skill required to implement such an attack on a real vehicle seems to be preventing CAN IDS researchers from implementing such an attack. This provides the highest fidelity alternative possible.

Iv-A3 Accelerator Attacks

The “accelerator attack” is an advanced attack, that does not fit into the general of framework injection attacks introduced Sec. II-B. This attack exploits a vulnerability particular to the vehicle make/model that puts the ECUs into a compromised state. We have responsibly disclosed this vulnerability to the OEM, and will not disclose details of how to implement this attack.

We do not include the CAN data during the exploit. After the exploit, the effect is that the vehicle is in a state that has less control by the driver as follows:

  • When put into drive gear, the vehicle accelerates to a fixed speed and then holds this speed (regardless of accelerator pedal position or cruise control setting).

  • In reverse, the vehicle accelerates to a (different) fixed speed and holds this speed (regardless of accelerator pedal position or cruise control setting).

  • Cruise control is disabled.

  • Touching the brake pedal results in the acceleration ceasing and the brakes engaging normally.

  • When the brake is released, the vehicle commences to accelerate as described above.

The Accelerator Attack captures have no injected messages, but simply record the CAN data when the vehicle is in this state. Discrepancies in the driver inputs (e.g., pressing the accelerator pedal) with the vehicle’s actions are present.

Iv-B Syntactic Description

All of the CAN data files are logged using the standard can-utils [41] candump format:

Note that all data fields in these logs contain the full 8 bytes, which we padded with zeros if necessary. The channel is always

can0, so this column can be dropped. We provide metadata (in JSON format) for each capture, including a general description of driving activities, the length of the capture in seconds, and whether or not the car was on the dynamometer. For attack captures, we also include whether the capture was modified (i.e., masquerade attacks), the injection ID and data field, and the interval of injection (start, end) corresponding to the time of the first/last injected message in elapsed seconds. Importantly, we do not label individual messages as attack/normal, because the software we used to collect did not have that capability. However, with injection ID, data, and intervals, these can be labeled in post-processing fairly easily. Examples of the provided metadata are shown in Fig. 4.

ambient_dyno_drive_basic_short: {     description: "start from park; basic drive activities (e.g., drive; accelerate; brake; reverse; ect.)"     elapsed_sec: 444.75061,     on_dyno: True } correlated_signal_attack_1:{     description: "start from driving; accelerate; start injecting; car rolls to stop; stop injecting; accelerate"     elapsed_sec: 33.101852,     injection_id: "0x6e0",     injection_data_str: "595945450000FFFF",     injection_interval:         [9.191851, 30.050109],     modified: False,     on_dyno: True}
Fig. 4: Snippet of metadata provided for each capture, with an example of ambient (right) and attack (left) entries.

We use a wildcard character “X” in the injection_data_str field when a signal was targeted (only some, not all bytes in the message are manipulated) to indicate that the byte in the given position was not modified in the injection. Similarly, “X” appears in the injection_id field to indicate that no particular ID was targeted, which is only the case in the fuzzing attack. For the accelerator attack, the injection_id and injection_data_str are null, and the injection interval is just the start and end time of the capture. (This will be included in the full documentation.)

Message Timing Target ID Data Field (Near Attack Start) Target ID (Select) Signals

 

Correlated Signal Attack

Fabrication

Masquerade

 

Max Speedometer Attack

Fabrication

Masquerade

 

Reverse Light Off Attack

Fabrication

Masquerade

 

TABLE III: Depiction of six of the targeted ID & masquerade attacks in the ORNL dataset, showing both the Timing Transparent (fabrication, with message confliction) and Timing Opaque (masquerade, without message confliction) versions for three attack types. The -axis of all plots are elapsed time (s), and the red dashed lines demarcate the attack interval. The three main columns visualize different aspects of each of the six attacks: Message Timing: Inter-message arrival time (ms) between all messages shown in the Top All Messages subplot, and between only the target ID messages in the bottom Target ID Messages (Near Attack Start) subplot, which zooms in to 15s before to 20s after the attack start. blue dots/red x’s indicate legitimate/injected messages. Compare the six All Messages (Top) subplots with Fig. 2 to see overall bus timing is nearly undisturbed, whereas previous attacks are blatant. The six Target ID Messages (bottom) subplots illustrate that the fabrication attacks (using flam injection delivery) cause unusually short inter-message times for the target ID, while masquerade attacks do not cause perceptible timing changes. Target ID Data Field: Time series of 64-bit binary data during the time period near the attack start (black denotes 1s, white denotes 0s). If only part of the message was altered (i.e., one target signal), the section of altered bits are delimited with red solid lines. Through visual inspection, fabrication attacks are more obvious due to message confliction, and both fabrication and masquerade attacks are more noticeable when the entire message is targeted (e.g., Correlated Signal Attack), rather than just a single signal. Target ID Signals: The time series of signals in the target ID message are depicted, annotated with signal names and bit ranges, which are made boldface for target signals (note not all non-target signals in the message may be shown). Notice the Max Speedometer and Reverse Light Off attack target different signals in the same ID. While even the masquerade versions of the first two attack types are somewhat visually identifiable at the signal level due to discontinuities and extreme values, the Reverse Light Off Attack targeting a 1-bit signal is difficult to discern without understanding more complex signal relationships or by examining signals in other messages.

Iv-C Obfuscation

While other public CAN datasets provide information on the make, model, and year of the vehicles attacked, it would be irresponsible, given our previous disclosure, to release such information. Furthermore, we have taken steps to obfuscate the CAN data in such as way as to preserve the characteristics necessary for CAN IDS development, while ideally preventing users from knowing the make, model, and year of the vehicle. Below we itemize the augmentations performed on the data to preserve anonymity:

  • [leftmargin=.5cm]

  • Absolute timestamps may be all shifted by a scalar, but relative times are preserved.

  • Messages from particular arbitration IDs that were deemed unimportant were replaced with the “filler message” FFF#0000000000000000 (ID#Data in hex). Relative timestamps are still preserved for these messages.

  • Messages on reserved IDs (greater than 0x700: e.g., diagnostic messages) have been removed.

  • Arbitration IDs have been anonymized in such a way that arbitration order/priority is not preserved. There is a one-to-one mapping between the original and the anonymized IDs for a given vehicle (not including the “filler messages” under ID 0xFFF). For example, if ID 0x10 is converted to ID 0x821 in an anonymized log, every anonymized log for the given vehicle will have ID 0x10 converted to ID 0x821.

  • Data fields have been scrambled in such a way that signals have been preserved, and fields are scrambled in a consistent way for each ID/Vehicle. For example, if the first byte is moved to the end of the field for ID 0x10 on a given vehicle, it will be shifted this way in all messages from ID 0x10 in all logs from that vehicle.

V Discussion

By design, our dataset contains attacks requiring detectors of varying types and sophistication. The desired alteration of vehicle functionality was physically verified for all attacks included.

To illustrate characteristics of the data, refer to Table III, which includes three different visualizations of six attack captures in our dataset. The three attack types in Table III are presented roughly in order of difficulty to detect (using a non-timing–based IDS).

  • The Correlated Signal attacks break the relationship between all the usually correlated wheel speed signals in the message, and cause signal discontinuities as well.

  • The Max Speedometer attacks causes a signal discontinuity for the target signal without affecting other portions of the message.

  • The Reverse Light Off attack breaks relationships of CAN signals.

Details and takeaways are documented in the caption.

Ambient dynamometer driving includes activities (e.g., reverse, drive, accelerate, brake) and can be used for training detectors and/or testing for false positives.

The fuzzing attack should be very apparent to most detectors, but should be accurately detectable with timing/frequency approaches. As discussed above, by using the flam technique (one injected message right after each ambient message, and only necessary bytes in the data field manipulated), our dataset provides the stealthiest possible fabrication attacks. This is the hardest attack we expect a timing/frequency based detector to accurately detect. These fabrication attacks enhance the quality in terms of stealth and realism over what is currently available.

Next, each fabrication attack is accompanied by a masquerade version where the ambient messages in 1:1 correspondence with the injected messages are synthetically removed after the capture. As discussed previously, while Miller & Valasek have exhibited masquerade attacks on real vehicles (even remotely!), the hacking expertise and time required has stymied any defensive researchers from acquiring such data. Our dataset provides the best possible alternative: real data with real attacks that are physically verified seem to have the ability to produce a bona fide masquerade attack. This provides the best possible alternative and should be T.O. attacks, not detectable by timing-base means. We believe that while these attacks are T.O., they are likely detectable by intra-signal (or per-signal) models—those detectors modeling each signal’s time series.

Finally, the Accelerator Attacks are unique examples of CAN data from a functioning vehicle with only the vehicle’s ECUs transmitting messages, but in a compromised manner. There is no disruption of the timing of the CAN messages, thus this attack should be T.O. This attack permits adequate testing of advanced IDSs that must rely on some understanding of the payloads and their correlations with each other.

Having data from a wide variety of real vehicles in actual road driving conditions with simple to very sophisticated attacks is of course ideal; thus, many limitations to this dataset are indeed present. Notably, we only present data from a single vehicle. Secondly, the dataset includes only dynamometer data, which is well known to cause subtle differences in actual road driving. Thirdly, we provide attack intervals rather than labeling each message. Finally, our masquerade attacks were the best possible versions we (as non-professionals in offensive security) could provide; as such, they rely on a small amount of simulation.

Nearly all detectors in the literature that hold the promise of detecting T.O. attacks have yet to be tested on an appropriate dataset. The previous two types of attacks provide the first such real data for these detectors to be validated. It is our hope that this dataset, and hopefully others to follow, will allow for better comparison and validation of these proposed detectors, and perhaps contribute to other advances in CAN bus security.

Acknowledgment

Special thanks to Suzanne Parete-Koon and Ross Miller for assistance in posting the dataset online. Thanks to Pablo Moriano, Stacy Prowell, & John Baston for helping us polish this document. Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the U. S. Department of Energy.

References

  • [1] K. Koscher et al., “Experimental security analysis of a modern automobile,” in 2010 IEEE S&P.    IEEE, 2010.
  • [2] S. Checkoway, D. McCoy, B. Kantor, D. Anderson, H. Shacham, S. Savage, K. Koscher, A. Czeskis, F. Roesner, and T. Kohno, “Comprehensive experimental analyses of automotive attack surfaces,” in Proceedings of the 20th USENIX conference on Security, ser. SEC’11.    USENIX Association, Aug 2011.
  • [3] T. Hoppe, S. Kiltz, and J. Dittmann, “Security threats to automotive CAN networks practical examples and selected short-term countermeasures,” Reliability Engineering & System Safety, vol. 96, Jan 2011.
  • [4] S. Woo, H. J. Jo, and D. H. Lee, “A practical wireless attack on the connected car and security protocol for in-vehicle CAN,” IEEE Transactions on Intelligent Transportation Systems, pp. 1–14, Sep 2014.
  • [5] C. Valasek and C. Miller, “Remote exploitation of an unaltered passenger vehicle,” p. 91, Aug 2015.
  • [6] C. Valasek and D. C. Miller, “CAN message injection,” p. 29, Jun 2016.
  • [7] K.-T. Cho and K. G. Shin, “Error handling of in-vehicle networks makes them vulnerable,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.    ACM, 2016, pp. 1044–1055.
  • [8] T. K. S. Lab, Experimental Security Research of Tesla Autopilot, Mar 2019.
  • [9] W. Wu, R. Li, G. Xie, J. An, Y. Bai, J. Zhou, and K. Li, “A survey of intrusion detection for in-vehicle networks,” 2019.
  • [10] S.-F. Lokman, A. T. Othman, and M.-H. Abu-Bakar, “Intrusion detection system for automotive controller area network (CAN) bus system: a review,” EURASIP Journal on Wireless Communications and Networking, vol. 2019, no. 1, p. 184, Jul 2019.
  • [11] G. Loukas, E. Karapistoli, E. Panaousis, P. Sarigiannidis, A. Bezemskij, and T. Vuong, “A taxonomy and survey of cyber-physical intrusion detection approaches for vehicles,” Ad Hoc Networks 2019, vol. 84, p. 27, Mar 2019.
  • [12] G. K. Rajbahadur, A. J. Malton, A. Walenstein, and A. E. Hassan, “A survey of anomaly detection for connected vehicle cybersecurity and safety,” p. 6, 2018.
  • [13] U. E. Larson, D. K. Nilsson, and E. Jonsson, “An approach to specification-based attack detection for in-vehicle networks,” in 2008 IEEE Intelligent Vehicles Symposium.    IEEE, Jun 2008, pp. 220–225. [Online]. Available: http://ieeexplore.ieee.org/document/4621263/
  • [14] N. Salman and M. Bresch, “Design and implementation of an intrusion detection system (IDS) for in-vehicle networks,” Ph.D. dissertation, Chalmers University of Technology, 2017. [Online]. Available: http://publications.lib.chalmers.se/records/fulltext/251871/251871.pdf
  • [15] K.-T. Cho and K. G. Shin, “Viden: Attacker identification on in-vehicle networks,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’17.    ACM, 2017, pp. 1109–1123. [Online]. Available: http://doi.acm.org/10.1145/3133956.3134001
  • [16] W. Choi, K. Joo, H. J. Jo, M. C. Park, and D. H. Lee, “Voltageids: Low-level communication characteristics for automotive intrusion detection system,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 8, pp. 2114–2129, Aug 2018.
  • [17] M. R. Moore, R. A. Bridges, F. L. Combs, M. S. Starr, and S. J. Prowell, “Modeling inter-signal arrival times for accurate detection of CAN bus signal injection attacks: a data-driven approach to in-vehicle intrusion detection,” in Proceedings of the 12th Annual Conference on Cyber and Information Security Research - CISRC ’17.    ACM Press, 2017, pp. 1–4. [Online]. Available: http://dl.acm.org/citation.cfm?doid=3064814.3064816
  • [18] A. Tomlinson, J. Bryans, S. A. Shaikh, and H. K. Kalutarage, “Detection of automotive CAN cyber-attacks by identifying packet timing anomalies in time windows,” Jun 2018, pp. 231–238. [Online]. Available: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8416254
  • [19] Y. Hamada, M. Inoue, H. Ueda, Y. Miyashita, and Y. Hata, “Anomaly-based intrusion detection using the density estimation of reception cycle periods for in-vehicle networks,” SAE International Journal of Transportation Cybersecurity and Privacy, vol. 1, no. 1, pp. 39–56, May 2018.
  • [20]

    A. Taylor, S. Leblanc, and N. Japkowicz, “Anomaly detection in automobile control network data with long short-term memory networks,” in

    2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

    .    IEEE, Oct 2016. [Online]. Available: http://ieeexplore.ieee.org/document/7796898/
  • [21]

    M.-J. Kang and J.-W. Kang, “Intrusion detection system using deep neural network for in-vehicle network security,”

    PLOS ONE, vol. 11, no. 6, Jun 2016.
  • [22] M. Marchetti, D. Stabili, A. Guido, and M. Colajanni, “Evaluation of anomaly detection for in-vehicle networks through information-theoretic algorithms,” in 2016 IEEE 2nd International Forum on Research and Technologies for Society and Industry Leveraging a better tomorrow (RTSI).    IEEE, Sep 2016. [Online]. Available: http://ieeexplore.ieee.org/document/7740627/
  • [23] M. Hanselmann et al., “CANet: An unsupervised intrusion detection system for high dimensional CAN bus data,” IEEE Access, vol. 8, 2020.
  • [24] S. Nair Narayanan, S. Mittal, and A. Joshi, “Obd_securealert: An anomaly detection system for vehicles,” May 2016. [Online]. Available: https://ieeexplore.ieee.org/document/7501710
  • [25] A. R. Wasicek, M. D. Pese, and A. Weimerskirch, “Context-aware intrusion detection in automotive control systems,” p. 14, 2017.
  • [26] M. E. Verma, R. A. Bridges, J. J. Sosnowski, S. C. Hollifield, and M. D. Iannacone, “CAN-D: A modular four-step pipeline for comprehensively decoding controller area network data,” arXiv:2006.05993 [cs, eess], Jun 2020, arXiv: 2006.05993. [Online]. Available: http://arxiv.org/abs/2006.05993
  • [27] “Comma AI,” https://comma.ai/.
  • [28] “SocketCAN,” https://python-can.readthedocs.io/en/master/interfaces/socketcan.html.
  • [29] Z. Tyree, R. A. Bridges, F. L. Combs, and M. R. Moore, “Exploiting the shape of CAN data for in-vehicle intrusion detection,” arXiv:1808.10840 [cs], Aug 2018, arXiv: 1808.10840. [Online]. Available: http://arxiv.org/abs/1808.10840
  • [30] K. Pawelec, R. A. Bridges, and F. L. Combs, “Towards a CAN IDS based on a neural network data field predictor,” in Proceedings of the ACM Workshop on Automotive Cybersecurity, ser. AutoSec ’19.    ACM, 2019, pp. 31–34, event-place: Richardson, Texas, USA. [Online]. Available: http://doi.acm.org/10.1145/3309171.3309180
  • [31] “Automotive buses,” https://training.dewesoft.com/online/course/automotive-buses-can-measurement.
  • [32] R. Bosch et al., “CAN specification version 2.0,” Rober Bousch GmbH, Postfach, vol. 300240, p. 72, 1991.
  • [33] W. Voss, A Comprehensible Guide to Controller Area Network, 2008.
  • [34] K.-T. Cho and K. G. Shin, “Fingerprinting electronic control units for vehicle intrusion detection,” in 25th USENIX Security Symposium USENIX Security 16), 2016, pp. 911–927.
  • [35] H. Lee et al., “OTIDS: A novel intrusion detection system for in-vehicle network by using remote frame,” in PST.    IEEE, 2017.
  • [36] E. Seo, H. M. Song, and H. K. Kim, “Gids: Gan based intrusion detection system for in-vehicle network,” in 2018 16th Annual Conference on Privacy, Security and Trust (PST), Aug 2018. [Online]. Available: https://ieeexplore.ieee.org/document/8514157
  • [37] M. L. Han, B. I. Kwak, and H. K. Kim, “Anomaly intrusion detection method for vehicular networks based on survival analysis,” Vehicular Communications, vol. 14, Oct 2018.
  • [38]

    H. M. Song, J. Woo, and H. K. Kim, “In-vehicle network intrusion detection using deep convolutional neural network,”

    Vehicular Communications, vol. 21, Jan 2020.
  • [39] Dupont, G. (Guilaume), Lekidis, A. (Alexios), Den Hartog, J. (Jerry), and Etalle, S. (Sandro), “Automotive controller area network (CAN) bus intrusion dataset v2,” 2019. [Online]. Available: https://data.4tu.nl/repository/uuid:b74b4928-c377-4585-9432-2004dfa20a5d
  • [40] I. Berger, R. Rieke, M. Kolomeets, A. Chechulin, and I. Kotenko, Comparative Study of Machine Learning Methods for In-Vehicle Intrusion Detection.    Springer International Publishing, Jan 2019, vol. 11387, pp. 85–101. [Online]. Available: http://link.springer.com/10.1007/978-3-030-12786-2_6
  • [41] “can-utils,” https://github.com/linux-can/can-utils.