AVTPnet: Convolutional Autoencoder for AVTP anomaly detection in Automotive Ethernet Networks

by   Natasha Alkhatib, et al.
Télécom Paris

Network Intrusion Detection Systems are well considered as efficient tools for securing in-vehicle networks against diverse cyberattacks. However, since cyberattack are always evolving, signature-based intrusion detection systems are no longer adopted. An alternative solution can be the deployment of deep learning based intrusion detection system (IDS) which play an important role in detecting unknown attack patterns in network traffic. To our knowledge, no previous research work has been done to detect anomalies on automotive ethernet based in-vehicle networks using anomaly based approaches. Hence, in this paper, we propose a convolutional autoencoder (CAE) for offline detection of anomalies on the Audio Video Transport Protocol (AVTP), an application layer protocol implemented in the recent in-vehicle network Automotive Ethernet. The CAE consists of an encoder and a decoder with CNN structures that are asymmetrical. Anomalies in AVTP packet stream, which may lead to critical interruption of media streams, are therefore detected by measuring the reconstruction error of each sliding window of AVTP packets. Our proposed approach is evaluated on the recently published "Automotive Ethernet Intrusion Dataset", and is also compared with other state-of-the art traditional anomaly detection and signature based models in machine learning. The numerical results show that our proposed model outperfoms the other methods and excel at predicting unknown in-vehicle intrusions, with 0.94 accuracy. Moreover, our model has a low level of false alarm and miss detection rates for different AVTP attack types.


Machine Learning Applications in Misuse and Anomaly Detection

Machine learning and data mining algorithms play important roles in desi...

RX-ADS: Interpretable Anomaly Detection using Adversarial ML for Electric Vehicle CAN data

Recent year has brought considerable advancements in Electric Vehicles (...

Robust PCA for Anomaly Detection in Cyber Networks

This paper uses network packet capture data to demonstrate how Robust Pr...

AutoIDS: Auto-encoder Based Method for Intrusion Detection System

Intrusion Detection System (IDS) is one of the most effective solutions ...

An Efficient Anomaly Detection Approach using Cube Sampling with Streaming Data

Anomaly detection is critical in various fields, including intrusion det...

Detecting In-vehicle Intrusion via Semi-supervised Learning-based Convolutional Adversarial Autoencoders

With the development of autonomous vehicle technology, the controller ar...

Should I Raise The Red Flag? A comprehensive survey of anomaly scoring methods toward mitigating false alarms

A general Intrusion Detection System (IDS) fundamentally acts based on a...

I Introduction

Since the advent of powerful electronic components such as sensors and actuators as well as a robust in-vehicle infrastructure for efficient data exchange between them, driving has become safer (i.e. 360- degree surround view parking assistance, and collision avoidance systems) [1] and more pleasant (i.e. infotainment features)[2] [3] during the last several decades. Ethernet, a flexible and scalable networking technology in communication systems, is recently standardized and adopted for in-vehicle communication [4][5] between different Electronic Component Units (ECU). In fact, it fulfills basic automotive requirements which existing in-vehicle protocols LIN, CAN, and FlexRay are not designed to cover, including reduced connectivity costs, cabling weight and support for high data bandwidth.

To ensure low-latency and high-quality transmission of time-critical and prioritized streaming data for high-end infotainment and ADAS systems, the IEEE 1722 audio-video transport protocol (AVTP)[6] is adopted. In fact, AVTP specifies a protocol for audio, video, and control data transportation on a Time-Sensitive Networking (TSN) capable network [7]. As a result, we believe that AVTP protocol will be a critical protocol for automotive ethernet based in-vehicle network in motor vehicles.

Despite the advantages of automotive ethernet, the drive toward connectivity has significantly expanded the attack surfaces of automobiles, making automotive ethernet-based in-vehicle networks increasingly susceptible to cyberattacks, posing significant security and safety issues [8]. In fact, automotive ethernet can be attacked by exploiting its vulnerabilities [9][10]. These security breaches can affect protocols working on top of automotive ethernet including AVTP protocol and might therefore lead to the interruption of a media stream.

To address this, intrusion detection systems (IDS) should be used in addition to specific security measures as an extra layer of protection. These systems can be classified based on their analyzed activity (i.e., monitoring a network or a host activity logs) and their detection approach (i.e., signature-based or anomaly-based detection). Deep learning models, usually referred to as anomaly-based intrusion detection techniques, are in general neural network models with a large number of hidden layers. These models can learn extremely complicated non-linear functions, and their hierarchical layer structure allows them to acquire meaningful feature representations from incoming data. Researchers have explored deep learning techniques for in-vehicle intrusion detection on Controller Area Network (CAN) bus protocol since 2015

[12] [13]. However, only few work has been addressed to study the efficiency of developing deep learning based IDS for systems employing automotive ethernet due to the lack of relevant and public datasets. Among them, Alkhatib et al. [11] proposed a deep learning-based sequential model for offline intrusion detection on Scalable Service-Oriented Middleware over IP (SOME/IP) application layer protocol on top of automotive ethernet. Moreover, Jeong et al [25] presented an intrusion detection method for detecting audio-video transport protocol (AVTP) stream injection attacks in automotive ethernet-based networks.

In this paper, we propose an anomaly detection based IDS Convolutional Autoencoders (CAE) for detecting anomalies on the AVTP protocol. The CAE consists of an encoder and a decoder with convolutional neural network (CNN) structures that are asymmetrical, i.e. the encoder architecture isn’t similar to the decoder architecture. Anomalies in AVTP packet stream are therefore detected by applying the reconstruction error of each sliding window of AVTP packets transformed to images as an anomaly score.

The main contributions of this paper are as follows:

  • We propose a novel anomaly based intrusion detection method to detect unknown cyberattacks on AVTP protocol used in ethernet based in-vehicle network for media streaming. To the best of our knowledge, we are the first to leverage anomaly based approaches for intrusion detection on the recently adopted in-vehicle network protocol.

  • We develop an IDS to evaluate the performance of our intrusion detection method by using a recently published real dataset called ”Automotive Ethernet Intrusion Detection” [24]. Moreover, we have also created a new attack type, called ”replay attack” from the normal packet traces of the published dataset.

  • We have compared the performance of our model with other state-of-the art traditional anomaly detection and signature based models in machine learning for different attack types. The numerical results show that our proposed model outperfom their performance and excel at predicting unknown in-vehicle intrusions, with 0.94 accuracy. Moreover, our model has a very low level of false alarm and miss detection rates for different AVTP attack types.

Towards this end, our paper is organized into six Sections. In Section 2, we present an overview of media stream transportation using AVTP network protocol. Section 3 discusses the detection of network anomalies using convolutional autoencoder and relevant research work. In Section 4, we present the considered AVTP dataset and the different considered attacks. The process for building our suggested anomaly based architecture, CAE, its training and its evaluation are presented in Section 5. We discuss our experimental results in Section 6. Finally, we conclude our paper with future work direction.

Ii Transmission of Media Streams using AVTP

Traditional in-vehicle networks are mostly based on bus technology that can not keep up with the growing communication demands of the self-driving car. In fact, they cannot meet the in-vehicle network requirements for high bandwidth, reliability and real-time communication expectations. Automotive Ethernet, a novel in-vehicle network communication technology, is implemented to ensure an appropriate level of quality of service (QoS) which is essential for time-critical automotive applications.

Audio Video Bridging (AVB) over Ethernet, a set of technical standards, provides improved synchronization, low-latency, and reliability for switched Ethernet networks between multimedia devices. Recently, a lot of automotive products such end-nodes device (i.e.,speakers, cameras, digital signal processors) and network hub (i.e., AV Bridges) support it. In fact, end-nodes can be a talker, listener or both. A talker is the transmitter of a data stream or the source of the AVB system and a listener is the receiver of the destination of the AVB system. These end-nodes are connected by an AVB Bridge which acts as a switch that receives time-critical data from the AVB talker and forwards it to the AVB listener. This interconnection between these three components, as presented in Fig.1, is called AVB Ethernet Local Area Network (LAN).

Fig. 1: Typical AVB Network.

As previously mentioned, AVB has diverse sub-standards to support time-critical in-vehicle applications such as IEE 802.1 Qav, IEEE 802.1 Qat, IEEE 802.1 AS and IEEE 1722. In our work, we are only considering automotive cyberattacks on IEEE 1722 which is a stream transmission protocol in charge of transporting control data and audio and video streams. As depicted in Fig. 2, the IEEE 1722 packet and its content are sent through an Ethernet frame. The IEEE 802.1Q header is also included in the Ethernet packet. Furthermore, the priority information encapsulated within is critical for the functioning of AVB QoS concept. Moreover, only AVB listener members that share the same AVB talker’s VLAN tag can receive the audio/video stream. In the case of The IEEE 1722, the ethertype field’s hexadecimal value is 0X22F0.

Fig. 2: IEEE 1722 packet format. Source: [14]

In terms of IEEE 1722 streaming packets, the header, the stream ID, the ”Presentation time,” payload information, and the payload itself are all included therein. The data type of the A/V stream is specified in the header which also includes its sequence number needed by AVB listeners to detect missing packets. The MAC address of the talker is used to produce the stream ID, which identifies a single data stream. The format of the data within the payload is directly related to the field of payload information. The AVBTP is a time presentation which is to specify when a received packet should be delivered to the AVB listener application [14].

We will provide in Section IV the attack model on AVTP protocol, created by Jeong et al. [25] and us, and list also the relevant AVTP features to be leveraged for anomaly detection.

Iii State-of-the-art Anomaly-based IDS using Convolutional Auto-encoder

Intrusion Detection Systems (IDSs) are considered as an efficient tool to guarantee the confidentiality, integrity and availability of network data. In fact, network intrusions can be detected and identified by comparing their attack signature to a dataset which contains a pre-defined list of cyberattack patterns. This approach is called signature-based intrusion detection. However, a regular updating of signature databases is not practicable because of the constant evolution of innovative attack tactics. An alternative solution could be the adoption of anomaly-based IDSs which find pattern in the data that deviates from other observation and indicates the presence of malicious activities in the network traffic.

Deep learning techniques are increasingly used to address the developement of complex anomaly detection based IDSs [15]

. One of the most commonly studied feature learning techniques is the use of autoencoder (AE) neural networks which can be adapted to detect anomalies in high-dimensional data and for different data types including images/videos, sequence data and graph data. In this work, we will leverage convolutional auto-encoders (CAE) to detect anomalies in an AVTP stream transformed into images. We describe the adopted method in Section


The autoencoder AE, introduced by Rumelhart et al. [16]

, seeks to learn a low-dimensional feature representation space suitable for reconstructing the provided data instances. During the encoding process, the encoder maps the original data onto low-dimensional feature space, while the decoder tries to retrieve the original data from the projected low-dimensional space. Reconstruction loss functions are used to learn the parameters of the encoder and decoder networks. Its reconstruction error value must be minimized during the training of normal instances and therefore used during testing as an anomaly score. In other words, compared to the typical data reconstruction error, anomalies that differ from the majority of the data have a large data reconstruction error. The following equations govern the behavior of an AE:


where n is the number of samples, is the encoding network with the parameters , is the decoding network with the parameters , and is a reconstruction error-based anomaly score to x, known as the mean squared error [15].

In addition, the convolutional auto-encoder CAE design employs convolutional and pooling layers for encoding and deconvolutional and unpooling layers for decoding. Due to the weight-sharing, CAE is capable of absorbing localized frequent patterns in a 2D structure [21]. Therefore, researchers have been widely using them for the anomaly detection of concrete defects [17], in automated video surveillance [18], on system logs [19], and on network application protocols such as HTTP [20].

Iv Dataset Description

In order to detect intrusions on Audio Video Transport Protocol (AVTP), we have used the ”Automotive Ethernet Intrusion Dataset” dataset [24] created by Jeong et al. [25] and which contains benign and malicious AVTP packet captures from their physical automotive Ethernet testbed.

Dataset # Normal packets # Injected packets Size
Indoor Original 1 139,440 N/A 63.3 MB
Indoor Injected 1 139,440 65,988 93.3 MB
Indoor Injected 2 307,020 130,906 198.8 MB
Replay attack (new) 139,440 78,880 99.1 MB
TABLE I: Automotive Ethernet Intrusion Dataset (before pre-procesing)

The datasets are recorded in the PCAP file format and, therefore, are viewed using prevalent programming libraries and packet analyzers (such as Wireshark). In fact, the dataset contains four benign (attack-free) packet captures and four malicious ones collected in different environments. The malicious packet captures represent the message injection of arbitrary stream AVTP data units (AVTPDUs) into the IVN since the attacker’s goal is to output a single video frame, at a terminal application connected to the AVB listener, by injecting previously generated AVTPDUs during a certain period. For our experiment, we have only considered the AVTP packets collected indoor, presented in Table I. Furthermore, we have created a new type of attack using Python[26], called replay attack, which represents an attacker injecting previously sent legitimate AVTP packets, as seen in Table I.

Iv-a Dataset Preprocessing

We have pre-processed the AVTP dataset, provided in pcap format, using Python programming language [26] and its efficient libraries.

  • Step 1 (AVTP packet labeling): In order to evaluate the performance of our proposed IDS, the AVTP dataset must be labeled. in order to differentiate if a certain AVTP packet is injected or normal. Hence, we have used the packet crafting library Scapy [27] for decoding and labeling. A packet is labeled as injected if it belongs to the ”single-MPEG-frame.pcap” pcap file, otherwise it is considered normal.

  • Step 2 (Sliding window process): After labeling the AVTP dataset, it needed to be properly reshaped into images for use in our convolutional autoencoder-based IDS. Each packet in our dataset has 438 bytes/features, each of which has a decimal value between 0 and 255. After tuning, the most suitable number of bytes used to detect anomalies is the first 48 bytes of each AVTP packet. We then used Feature-based Sliding Window (FSW)[28] to partition all data records into square windows, where each window is a 48x48x1 three-dimensional matrix and the slide size is 1, as presented in formula (5).


    We use to denote kth sliding window, is an AVTP packet feature such that and , .

  • Step 3 (Sliding window labeling): After grouping our data records into sliding windows, we had to label each AVTP window by ”1” if it contains any injected or replayed packets, otherwise it is labeled as ”0”.

After pre-processing, the AVTP dataset, presented in Table II

, is ready to be leveraged for anomaly detection. Since we are training our anomaly detector through semi-supervised learning approach, the dataset ”Indoor Original 1” with normal data records was used for training while datasets ”Indoor Injected 1” and ”Indoor Injected 2” and ”Replay attack” were used for testing. For reproducibility, readers can access the pre-processed datasets, provided in pickle format on our github repository


Dataset # Normal SW # Injected SW Size
Indoor Original 1 139,393 N/A 2.6 GB
Indoor Injected 1 49,377 156,004 3.8 GB
Indoor Injected 2 106,822 331,057 8.1 GB
Replay attack (new) 256 268,017 5.1 GB
TABLE II: Automotive Ethernet Intrusion Dataset (After pre-processing)

V Proposed Anomaly-based CAE

The purpose of this work is to investigate the usefulness of a CAE’s reconstruction error as an anomaly detector for AVTP packet captures. Firstly, AVTP packet captures and features are extracted from the AVTP dataset, as described in section IV, which are then fed as input data to the CAE. A CAE is trained on normal events, and the AVTP packets’ frame reconstruction error (FRE) is used to classify both normal and abnormal events. Following that, the Balanced Accuracy, False Positive Rate (FPR) and False Negative Rate (FNR) are used to quantify classification performance.

V-a Proposed Architecture & Training Approach

One method of learning AVTP packet transmission behavior is by the use of a fully convolutional auto-encoder (CAE), which can learn important characteristics of normal communication with minimal reconstruction error. Conversely, abnormal events are predicted to have large reconstruction errors (not present in the training set). Our CAE architecture, shown in Fig.3, is composed of two convolutional layers and two pooling layers on the encoder side, and two deconvolutional layers with one convolutional layer on the decoder side.

Layer Dimensions
Input 48 x 48 x 1
Conv1 48 x 48 x 32
Pool1 24 x 24 x 32
Conv2 24 x 24 x 32
Pool2 12 x 12 x 32
Deconv1 24 x 24 x 32
Deconv1 48 x 48 x 32
Conv3 48 x 48 x 1
TABLE III: Dimensions of each layer of the CAE
Fig. 3: Architecture of the CAE proposed in this work.
Indoor Injected 1 Indoor Injected 2 Replay Attack
Balanced Acc FPR FNR Balanced Acc FPR FNR Balanced Acc FPR FNR
CAE, (proposed) 0.87 0.25 0.008 0.87 0.25 0.007 0.85 0.01 0.28
CAE, (proposed) 0.86 0.25 0.02 0.87 0.25 0.02 0.77 0.25 0.21
CAE, (proposed) 0.98 0 0.04 0.98 0 0.04 0.85 0.01 0.27
Isolation Forest [33] 0.57 0.34 0.51 0.58 0.6 0.45 0.32 0.85 0.5

Local Outlier Function

0.28 0.64 0.8 0.57 0 0.85 0.5 0 0.99
CNN for AVTP [25] 0.82 0.32 0.004 0.82 0.32 0.005 0.73 0.34 0.19
TABLE IV: Comparision of CAE with other models

In the first convolutional layer, the CAE architecture is composed of 32 filters with stride 1 and a kernel size of (3,3). It produces 32 features with resolution of 48 × 48 pixels. Next comes the first pooling layer that produces 32 features with resolution of 24 × 24 pixels. All pooling layers have a 2 × 2 kernel, performing sub-sampling by the max-pooling method. The second convolutional layer has also a filter of size 32. The last encoder layer produces 32 features maps of 12 × 12 pixels. The decoder reconstructs the input by applying two deconvolutional and one convolutional operation. The output of final decoder layer is the reconstructed version of the input. Table

III summarizes the details of each layer of the CAE. The inputs to the CAE are sliding window frames SW extracted from AVTP packet captures; see Section IV.

The CAE is trained using the back propagation algorithm to minimize the mean frame reconstruction error (MFRE) depicted below:


where n represents the total number of normal instances used for training, and represent the value of the feature for the packet in the original frame and the reconstructed frame, respectively. To optimize the loss function, we use Adam optimizer [30].

V-B Evaluation metrics

For measuring the performance of our anomaly based IDS, we use the balanced accuracy (Balanced Acc), the false positive rate (FPR), and the false negative rate (FNR).

The Balanced Accuracy (Balanced Acc) is usually used in binary classification, especially when dealing with imbalanced data. It’s the arithmetic mean of true positive rate (TPR) and true negative rate (TNR).


The True Positive Rate (TPR) measures the proportion of correctly predicted anomalies out of total AVTP anomaly prediction made by the model.


The True Negative Rate (TNR) measures the proportion of correctly predicted normal AVTP window frames out of total normal AVTP window frame prediction made by the model.


The false positive rate (or ”false alarm rate”) usually refers to the expectancy of the false positive ratio. In our case, it represents the ratio of normal AVTP window frames predicted as abnormal.


The false negative rate (or ”miss rate”) represents the ratio of abnormal AVTP window frames predicted as normal frames.


Where: TP = True Positive; FP = False Positive; TN = True Negative; FN = False Negative.

Hence, the model is assumed to have a large predictive power in detecting cyberattacks on AVTP protocol if the balanced accuracy is near to 1.0 and and its false positive and negative rates approach zero.

Additionally, in order to determine the most suitable threshold for discriminating between normal and abnormal AVTP frames, multiple thresholds were tested, using the MFRE distribution from normal instances as a reference to identify the threshold that achieves the best performance. The experiments include thresholds based on the mean frame reconstruction error (MFRE) (

) and k standard deviations (

) from normal examples. In our case, , therefore, the following thresholds were considered: , , .

Vi Results

The CAE model proposed in this work is trained using Keras


, a deep learning framework built on top of TensorFlow 2

[32]. All experiments were carried out on a dedicated GPU server Tesla V100S PCIE 32GB, powered by NVIDIA Volta™, with 32GB of RAM.

For anomaly detection, the threshold used to identify anomalies has a significant impact on model performance, including balanced accuracy, FPR, and FNR. In this paper, we optimize the model in terms of thresholds in order to obtain the model’s optimal performance. As depicted in Table IV, the model’s balanced accuracy, FPR and FNR varies when modifying the threshold . It has the highest balanced acccuracy (0.94 in average) and false positive and negative rates approaching zero value, for different attack types when the threshold value is . In other words, our proposed model classifies well the normal and abnormal AVTP packet frames.

In addition, we compare our proposed CAE to other machine-learning based anomaly detection algorithms such as the Isolation Forest [33] and the Local Outliler Factor [34]. Moreover, we also compare it to a reproduced version of the CNN model proposed by Jeong et al. [25]

and which is trained using supervised learning approach with a normal and abnormal ”Indoor Injected 1” packet trace, through 3-fold cross-validation. It is observed, in Table

IV, that the Isolation Forest and Local outlier Factor models, usually used for anomaly detection, had a poor performance in capturing the spatio-temporal information. As seen, they were unable to classify well the different classes and had in average high false negative and positive rates. However, the CNN model significantly achieves better performance and reasonable balanced accuracy than the traditional algorithms. Moreover, our model had a 12% gain in balanced accuracy over the CNN model in detecting the replay attack, which wasn’t used for its training. Hence, our reconstruction-based model, trained using a semi-supervised approach, has proved to be efficient in detecting different and unknown cyberattacks in an AVTP communication.

Vii Conclusions and Future Work

Anomaly detection in in-vehicle network protocols, especially in automotive ethernet, is a burgeoning study area. With the development of realistic datasets which represents automotive cyberattacks on this protocol, we are able to develop anomaly-based detection models using deep learning and to evaluate them. In this paper, we suggested a CAE architecture for learning normal AVTP communication behavior to identify cyberattacks on this protocol. The numerical results show that our proposed model outperfom the state-of-the art machine learning based model performance and excel at predicting unknown in-vehicle intrusions, with balanced accuracy value approaching 0.94 and low false positive and negative rates for different attacks and packet traces. For future work, we aim to test the performance of diverse sequential models such LSTM and Transformer on different cyberattacks types applied on AVTP protocol and in real-time scenarios.