A Lightweight, Efficient and Explainable-by-Design Convolutional Neural Network for Internet Traffic Classification

02/11/2022
by   Kevin Fauvel, et al.
HUAWEI Technologies Co., Ltd.
8

Traffic classification, i.e. the identification of the type of applications flowing in a network, is a strategic task for numerous activities (e.g., intrusion detection, routing). This task faces some critical challenges that current deep learning approaches do not address. The design of current approaches do not take into consideration the fact that networking hardware (e.g., routers) often runs with limited computational resources. Further, they do not meet the need for faithful explainability highlighted by regulatory bodies. Finally, these traffic classifiers are evaluated on small datasets which fail to reflect the diversity of applications in real commercial settings. Therefore, this paper introduces a Lightweight, Efficient and eXplainable-by-design convolutional neural network (LEXNet) for Internet traffic classification, which relies on a new residual block (for lightweight and efficiency purposes) and prototype layer (for explainability). Based on a commercial-grade dataset, our evaluation shows that LEXNet succeeds to maintain the same accuracy as the best performing state-of-the-art neural network, while providing the additional features previously mentioned. Moreover, we demonstrate that LEXNet significantly reduces the model size and inference time compared to the state-of-the-art neural networks with explainability-by-design and post hoc explainability methods. Finally, we illustrate the explainability feature of our approach, which stems from the communication of detected application prototypes to the end-user, and we highlight the faithfulness of LEXNet explanations through a comparison with post hoc methods.

READ FULL TEXT VIEW PDF
09/10/2020

XCM: An Explainable Convolutional Neural Network for Multivariate Time Series Classification

We present XCM, an eXplainable Convolutional neural network for Multivar...
08/26/2021

Towards Self-Explainable Graph Neural Network

Graph Neural Networks (GNNs), which generalize the deep neural networks ...
09/07/2019

Explainable Deep Learning for Video Recognition Tasks: A Framework Recommendations

The popularity of Deep Learning for real-world applications is ever-grow...
02/12/2020

LUCID: A Practical, Lightweight Deep Learning Solution for DDoS Attack Detection

Distributed Denial of Service (DDoS) attacks are one of the most harmful...
09/09/2022

Explanation Method for Anomaly Detection on Mixed Numerical and Categorical Spaces

Most proposals in the anomaly detection field focus exclusively on the d...
11/18/2020

Res-GCNN: A Lightweight Residual Graph Convolutional Neural Networks for Human Trajectory Forecasting

Autonomous driving vehicles (ADVs) hold great hopes to solve traffic con...

1. Introduction

Traffic classification, i.e., the identification of the type of applications flowing in a network (Dainotti12), is a strategic building block in modern communication networks (Aceto20; Boutaba18). It is crucial for key traffic management tasks like intrusion detection, quality of experience monitoring and application aware routing.

The sharp rise of encrypted traffic in the last decade hampered traditional rule-based techniques, pushing the adoption of machine learning-assisted classification 

(pacheco18comst) to supplant deep packet inspection (Cisco18; Huawei21). Inspired by seminal work (crotti07ccr), a recent wave (Aceto19mimetic; Aceto20; Beliard20; Liu19; Lotfollahi20; Nascita21; Rezaei20; Wang20)

of deep learning models (CNNs or recurrent architectures like Long Short-Term Memory - LSTMs) successfully tackles the encryption challenge by just leveraging the size and direction of the first few packets in a flow 

(crotti07ccr).

However, these studies still face some challenges. First, due to the widespread use of mobile devices and the vast diversity of mobile applications, large-scale encrypted traffic classification becomes increasingly difficult (Aceto19; Aceto20). At the same time, academic models are generally tested with few tens of classes, which are significantly below commercial settings (hundreds to thousands classes) (Cisco18; Huawei21). Second, even though focusing on relatively small datasets, academic models may still use a disproportionate amount of resources (Yang21) (e.g., models with millions of weights). This clashes with the need for near real-time classification (e.g., due to latency sensitive traffic) on the one hand, and the limited computational resources (Dias19) available on network devices (e.g., routers) on the other hand. Last but not least, regulatory and standardization bodies highlight faithful explainability as a pillar for accountability, responsibility, and transparency of processes including AI components (Dignum17; EU21; Phillips21), which could prevent the deployment of the latest machine learning techniques that do not have this feature. Faithfulness is critical as it corresponds to the level of trust an end-user can have in the explanations of model predictions, i.e., the level of relatedness of the explanations to what the model actually computes (Fauvel20). In the context of traffic classification, a faithfully explainable model (explainable-by-design) would be beneficial from both a compliance and a business standpoint (e.g., SLAs). Nonetheless, the faithfulness of such a model should not negatively impact the prediction performance, nor come at the cost of an increased model computational and memory complexity. A few models in traffic classification (Beliard20; Nascita21; Rezaei20) account for explainability, but they cannot provide faithful explainability as they rely on post hoc model-agnostic explainability methods (Rudin19). With the above constraints in mind, we therefore propose a lightweight, efficient and explainable-by-design CNN for traffic classification, and evaluate it on a commercial-grade dataset.

In order to integrate explainability into the design of our architecture, we considered this feature in the first stage of its development. Some recent studies propose explainable-by-design CNN approaches (Zhang18; Chen19; Elsayed19; Chen20). In particular, ProtoPNet(Chen19) presents an approach that aligns with the way humans describe their own thinking in classification, by focusing on parts of the input data and comparing them with prototypical parts of the data from a given class. This form of explanations is particularly interesting for network experts as class-specific prototypes can be used to form application signatures. Nonetheless, ProtoPNet has been designed for deep architectures, without considering the model size and efficiency. Therefore, we adopted a prototype-based CNN and reduced the number of weights and inference time compared to ProtoPNet, while increasing the accuracy.

ProtoPNet consists of a regular CNN backbone which extracts discriminative features, then a prototype block computes similarities to the learned class-specific prototypes, and finally the classification is performed based on these similarity scores. Firstly, concerning the backbone, multiple work about lightweight and efficient CNN architectures have been published recently (Ma18; Howard19; Tan19; Han20; Yang21CN2). Most of these studies optimize a proxy metric to measure efficiency (number of floating-point operations - FLOPs), which is not always equivalent to the inference time (Ma18). The CNN ResNet (He16) is usually faster than its recent competitors, offering a better accuracy/inference time trade-off (Ridnik21) (confirmed by our experiments in Section 5.1). Thus, we selected ResNet as a starting point for the CNN backbone of our prototype-based network, and worked to reduce its number of weights and inference time. The prediction performance of ResNet is supported by an increase in the number of channels along the depth of the network to learn elaborated features, which implies residual blocks with a higher number of output channels than input channels. A guideline from (Ma18) states that efficient convolutions should keep an equal channel width to minimize memory access cost and inference time. Therefore, (i) we present a new residual block computing solely convolutions with equal channel width, and generating the additional feature maps by cost-efficient operations (new shortcut connection and linear combinations of feature maps). This new block reduces the number of weights and inference time compared to the original residual block, while preserving network accuracy. Then, the prototype block of ProtoPNet has been designed to process features generated by deep CNNs. Thus, we have revisited this block and (ii) propose a new unique prototype layer which preserves the explainability-by-design and significantly reduces the number of weights and inference time of our network compared to ProtoPNet, while increasing the accuracy. Summarizing our main contributions:

  • We present a new Lightweight, Efficient and eXplainable-by-design CNN (LEXNet) for encrypted traffic classification, which relies on a new residual block and prototype layer;

  • Based on a real-world commercial-grade dataset, we show that LEXNet is more accurate than the current state-of-the-art explainable-by-design CNN (ProtoPNet (Chen19)) in traffic classification, and maintains the same accuracy as the best performing state-of-the-art deep learning approach (ResNet (He16)) that does not provide faithful explanations;

  • We demonstrate that LEXNet significantly reduces the model size and inference time compared to the current state-of-the-art neural networks with explainability-by-design and post hoc explainability methods;

  • We illustrate the explainability of our approach which stems from the communication of detected application/class prototypes to the end-user. Moreover, we highlight the faithfulness of LEXNet explanations by comparing them to the ones provided by state-of-the-art post hoc explainability methods.

2. Related Work

Traffic classification can be formulated as a Multivariate Time Series (MTS) classification task, where the input is packet-level data (e.g., packet size, direction) related to a single flow111A packet is a unit of data routed on the network, and a flow is a set of packets that shares the same 5-tuple (i.e. source IP, source port, destination IP, destination port, and transport-layer protocol) (Valenti13)., and the output is an application label. The state-of-the-art approaches (Aceto19mimetic; Beliard20; Liu19; Lotfollahi20; Nascita21; Rezaei20; Wang20) adopt deep learning methods (vanilla CNN and/or LSTM) with heavyweight architectures (from 1M to 2M weights), and do not discuss the impact of their model size on the accuracy and inference time. Some of them (Beliard20; Nascita21; Rezaei20) propose classifiers along with an explainability method to support their predictions. However, the post hoc explainability methods employed, such as SHAP (Lundberg17), cannot provide perfectly faithful explanations with respect to the original model (Rudin19). Finally, the results of these studies are not directly comparable as they are evaluated on different datasets. Plus, these datasets (mostly private, from 8 to 80 applications and from 8k to 950k flows) fail to reflect the diversity of applications in real commercial settings.

Consequently, an accurate, lightweight, efficient and explainable-by-design approach for traffic classification evaluated on a public commercial-grade dataset is necessary. We identified CNNs as having the potential to fulfill these needs. In the next sections, we present the corresponding machine learning state-of-the-art (MTS classification, explainability, lightweight and efficient architectures) on which we position our approach.

2.1. MTS Classification

The state-of-the-art MTS classifiers are composed of three categories: similarity-based, feature-based and deep learning methods. First, similarity-based methods make use of similarity measures (e.g., Dynamic Time Warping (Seto15)) to compare two MTS. Feature-based methods include shapelets (Wistuba15; Karlsson16) and bag-of-words (Baydogan14; Baydogan16; Tuncel18; Schafer17) models: the former use subsequences (shapelets) to transform the original time series into a lower dimensional space that is easier to classify, while the latter convert time series into a bag of discrete words, and use a histogram of words representation to perform the classification. Finally, deep learning methods (Karim19; Zhang20; Fauvel21) are based on CNNs and/or LSTMs.

The results published show that the top two most accurate MTS classifiers on average on the public UEA archive (Bagnall18) are deep learning methods (top-1 XCM (Fauvel21), top-2 MLSTM-FCN (Karim19)

). XCM extracts discriminative features related to observed variables and time directly from the input data using 2D and 1D convolutions filters, while MLSTM-FCN stacks a LSTM layer and a 1D CNN layer along with squeeze-and-excitation blocks to generate latent features. Therefore, in this work, we choose to benchmark our LEXNet to these two MTS classifiers. Plus, we integrate in our benchmark the commonly adopted state-of-the-art CNNs in computer vision (ResNet 

(He16), DenseNet (Huang17)), and the vanilla 1D CNN widely used in traffic classification studies.

2.2. Explainability

Explainability has recently emerged as a crucial feature for the practical deployment of machine learning models. In particular, machine learning methods have to be assessed based on the extent to which they can supply their decisions with explanations that reflect what the model actually computes (faithful explainability) (Alvarez18).

There are several methods belonging to different categories (Du20): explainability-by-design, post hoc model-specific explainability and post hoc model-agnostic explainability. Post hoc methods are the most popular ones for deep learning models. Post hoc model-specific explainability methods are designed for a particular model (e.g., the saliency method Grad-CAM (Selvaraju17) for CNNs), whereas post hoc model-agnostic explainability methods provide explanations from any machine learning model (e.g., the explainable surrogate model SHAP (Lundberg17)). Some recent studies (Rudin19; Chen20) show that post hoc explainability methods face a faithfulness issue, and suggest the development of self-explanatory models that incorporate explainability directly into their structures (explainability-by-design).

Lately, there have been some work proposing explainable-by-design CNN approaches (Zhang18; Chen19; Elsayed19; Chen20). Two of these approaches (Zhang18; Elsayed19) provide the explainability-by-design feature but at the cost of some lag behind the state-of-the-art CNNs in terms of accuracy, which remains a prerequisite for our task. The first one (Zhang18) presents a loss to push a filter in high convolution layers toward the representation of an object for interpretation, and the second one (Elsayed19)

provides relevant portions of the data supporting predictions with a novel hard attention model as a pretraining step. Then, Chen et al. (2020) present an approach combining accuracy and explainability-by-design: Concept Whitening. It is an alternative to a batch normalization layer which decorrelates the latent space and aligns its axes with known concepts of interest. However, this approach requires that the concepts/applications are completely decorrelated. So, this approach is not suited for our traffic classification task as some applications can be correlated (e.g., different applications from the same editor). Finally, another approach combines accuracy with explainability-by-design: ProtoPNet 

(Chen19). It consists of a CNN backbone to extract discriminative features, then a prototype block that computes similarities to the learned class-specific prototypes as basis for classification. This approach is particularly interesting for our task as the class-specific prototype explanations can be used to form application signatures for network experts.

Therefore, with the objective to combine accuracy and explain- ability-by-design suited for traffic classification, we adopted a proto- type-based CNN approach. Then, we worked to reduce the number of weights and inference time of the approach while maintaining the accuracy, in particular with regard to the CNN backbone.

2.3. Lightweight and Efficient Architectures

Multiple studies about lightweight and efficient CNN architectures have been published over the last years (Ma18; Howard19; Tan19; Han20; Yang21CN2). These studies introduce different mechanisms. First, ShuffleNetV2 (Ma18) follows four practical guidelines: keep equal channel width, limit the number of group convolutions, and reduce the degree of network fragmentation and element-wise operations. Then, EfficientNet (Tan19) presents a scaling method that uniformly scales all dimensions of depth/width/resolution using a compound coefficient. Next, MobileNetV3 (Howard19) combines automated network search techniques and optimized nonlinearities on an architecture based on inverted residual blocks. Afterward, GhostNet (Han20)

introduces a novel Ghost module based on linear transformations and, CondenseNetV2 

(Yang21CN2) relies on a new Sparse Feature Reactivation module which reuses a set of most important features from preceding layers.

These CNNs are evaluated on the same public dataset (ImageNet). However, most of these evaluations optimize a proxy metric to measure efficiency (number of floating-point operations, FLOPs), which is not always equivalent to the direct metric we care about - the inference time 

(Ma18). Using FLOPs as the reference, these studies do not evaluate the accuracy and inference time on a comparable model size (number of weights). A recent work (Ridnik21) shows that ResNet CNN architecture is usually faster than its latest competitors, offering a better accuracy/inference time trade-off (which is also confirmed by our experiments in Section 5.1). Therefore, we adopted ResNet as CNN backbone and revisited its residual block to make it lighter and more efficient (see Section 3). We include ShuffleNetV2, EfficientNet, MobileNetV3, GhostNet and CondenseNetV2 in our benchmark, and perform a comparison of the accuracy and inference time based on a comparable model size in Section 5.1.

3. Algorithm

We now propose our new Lightweight, Efficient and eXplainable-by-design CNN classifier (LEXNet), which inherits its explainability from ProtoPNet (Chen19). First, we present the initial design of ProtoPNet. Then, we explain how LEXNet, with the introduction of a new residual block and prototype layer, reduces the number of weights and inference time of this network, while preserving the explainability-by-design. Our evaluation in Section 5 shows that LEXNet also outperforms ProtoPNet in terms of accuracy.

Figure 1. The prototype block of ProtoPNet (Chen19) and the new lightweight prototype layer (LProto) of LEXNet. The dimensions and number of classes correspond to the ones from the experiments in Section 5. Sim. M. - similarity matrix.

3.1. ProtoPNet

ProtoPNet integrates in its design a form of explainability that agrees with the way humans describe their own thinking in classification tasks: it focuses on parts of the input data and compares them with prototypical parts of the data from a given class (illustrated in Section 5.2

). The network consists of a regular CNN as backbone to extract features, followed by a prototype block and a fully connected layer. Given an input sample, the convolutional layers of the model extract useful features for prediction. Then, the prototype block learns class-specific prototypes, with the number and size of prototypes per class set as hyperparameters of the algorithm. The left side of Figure 

1 illustrates the prototype block of ProtoPNet with the hyperparameter setting of our experiments (detailed in Section 5): two prototypes of size (1, 1) per class with 203 classes in the dataset. Each prototype in the block (e.g., size of each prototype in Figure 1: ) computes the L2 distances with all patches of the last convolutional layer (e.g., size of the last layer in Figure 1: ), and inverts the distances into similarity scores. The result is an activation map of similarity scores (e.g., size of a similarity matrix in Figure 1:

) whose value indicates how strong the presence of a prototypical part is in the input sample. This activation map preserves the spatial relation of the convolutional output, and can be upsampled to the size of the input sample to produce a heatmap that identifies the part of the input sample most similar to the learned prototype. To make sure that the learned class-specific prototypes correspond to training samples patches, the prototypes are projected onto the nearest latent training patch from the same class during the training of the model. The activation map of similarity scores produced by each prototype is then reduced using global max pooling to a single similarity score, which indicates how strong the presence of a prototypical part is in some patch of the input sample. Finally, the classification is performed with a fully connected layer based on these similarity scores.

3.2. LEXNet 

LEXNet relies on the same pipeline as ProtoPNet: a CNN backbone for feature extraction, a prototype block for explainability-by-design, and a fully connected layer for classification. We select ResNet 

(He16) for the CNN backbone as it obtains the best accuracy on our traffic classification dataset (see results in Section 5.1). However, ResNet exhibits only the second best inference time, an aspect which is critical for our application. Therefore, our first contribution is to introduce a new lightweight and efficient residual block which reduces the number of weights and inference time of the original residual block, while preserving the accuracy of ResNet. To further reduce the size of LEXNet and improve its accuracy, we propose as a second contribution to replace the prototype block of ProtoPNet with a new lightweight prototype layer. We present in the next sections these two contributions, and end with the overall architecture of our network.

3.2.1. Lightweight and Efficient Residual Block

ResNet is a state-of-the-art CNN. It is composed of consecutive residual blocks which consist of two convolutions with a shortcut added to link the output of a block to its input in order to reduce the vanishing gradient effect (Zagoruyko16). When the number of output channels equals the one of the input channels, the input data is directly added to the output of the block with the shortcut. Nonetheless, the prediction performance of ResNet is supported by an increase in the number of channels along the depth of the network to learn elaborated features, which implies residual blocks with a higher number (usually twice) of output channels than input channels. In this case, before performing the addition, another convolution is applied to the input data in order to match the dimensions of the block output. A residual block is illustrated on the left side of Figure 2 with n input channels and 2n output channels. We can see in this Figure that two convolutions have a different number of output channels than input channels (Convolutions 1 and 3). However, a guideline from (Ma18) (ShuffleNetV2) states that efficient convolutions should keep an equal channel width to minimize memory access cost and inference time. Therefore, we propose a new residual block that only performs convolutions with an equal channel width, and the additional feature maps are generated with cheap operations (see block on the right side of Figure 2). First, the authors of GhostNet (Han20) state that CNNs with good prediction performance have redundancy in feature maps, and that substituting some of the feature maps with a linear transformation of the same convolution output does not impact the prediction performance of the network. Therefore, we propose to double the number of feature maps from the first convolution in a cost-efficient way using a series of linear transformations of its output ( linear kernel). These new feature maps are concatenated to the ones from the first convolution output to form the input of the second convolution. Thus, the first and second convolutions in our new residual block have an equal channel width (n and 2n respectively - see Figure 2). We have also considered having both convolutions 1 and 2 with an equal channel width n, and moving the concatenate operation with the feature maps from the linear transformation after the convolution 2. However, our experiments showed that this configuration leads to a decrease in accuracy without noticeable drop in inference time. The second operation concerns the shortcut link. Instead of converting the input data to the block output dimensions with a third convolution (see Res block in Figure 2), we propose to save this convolution by keeping the input data and concatenating it with the output from the first convolution. The experiments on the traffic classification dataset show that our new residual block LERes allows ResNet to reduce the number of weights of the backbone by 19.1% and the CPU inference time by 41.3% compared to ResNet with the original residual block, while maintaining its accuracy (99.3% of the original accuracy) - see details in Section 5.1.

Figure 2. ResNet (He16) original residual block (Res) and our new lightweight and efficient residual block (LERes) with n input channels and 2n output channels.

3.2.2. Lightweight Prototype Layer

Next, we adopt a new lightweight prototype layer instead of the prototype block of ProtoPNet. As presented in Section 3.1 and illustrated in Figure 1, the prototype block of ProtoPNet adds two convolutional layers with an important number of channels (recommended setting:

128) to the CNN backbone. Then, it connects the prototype layer with a Sigmoid activation function. Finally, the prototypes have the same depth as the last convolutional layer (e.g., 128) for the computation of the similarity matrices with the L2 distance.

We propose to remove the two additional convolutional layers (see right block of Figure 1). As a consequence, the prototypes have a much smaller depth (same as CNN backbone output: 32 versus 128). Thus, our new prototype layer allows us to reduce the number of weights by 29.4% compared to the original prototype block of ProtoPNet, and implies a 17.0% reduction in CPU inference time - see details in Section 5.1. Then, we replace the last activation function from the CNN backbone with a Sigmoid. As suggested in (Sandler18) and confirmed by our experiments (see Section 5.1

), replacing the ReLU activation function from the last layer of our CNN backbone with a Sigmoid improve the accuracy of our network. Finally, we further improve the accuracy of our network by adding an L2 regularization on the weights of the prototypes in order to enhance its generalization ability. Considering the limited number of prototypes, and supported by the results from our experiments, we have selected an L2 regularization over an L1 sparse solution.

3.2.3. Network Architecture

The overall architecture of LEXNet is presented in Table 1. The CNN backbone consists of an initial convolutional layer with 8 filters, followed by a sequence of four new LERes blocks which ends with 32 filters. Then, our new prototype layer LProto computes the similarity matrices. Finally, the similarity scores from the Max Pooling operation are classified with the fully connected layer. In order to limit the size of our network, the number of LERes blocks has been determined by cross-validation on the training set. Thus, additional LERes blocks would not increase the accuracy of the network on our traffic classification dataset. A higher number of filters would not increase the accuracy either. Concerning the explanations supporting network predictions, as presented in Section 3.1, activation maps from the prototype layer can be upsampled to the size of the input sample to produce a heatmap that identifies the part of the input sample most similar to the learned prototype. It has been shown in (Fauvel21) that applying upsampling processes to match the size of the input sample can affect the precision of the explanations. Therefore, to preserve the precision of our explanations, we keep the feature map dimensions over the network the same as the input sample dimensions (20 2 - detailed in Section 4.1

Dataset) using fully padded convolutions.

Input Dimensions Operator Stride # Out # Cumul Params
1 20 2 Conv 3x3 + BN 1 8 88
8 20 2 LERes Block 1 16 3,088
16 20 2 LERes Block 1 16 7,760
16 20 2 LERes Block 1 32 19,520
32 20 2 LERes Block 1 32 38,080
32 20 2 LProto Layer - 406 51,072
406 20 2 Max Pooling - 406 51,072
406 1 1 FC - 203 133,490
# Cumul Params - cumulative number of trainable parameters
# Out - number of output channels
Table 1. Overall architecture of LEXNet.

4. Evaluation Setting

In this section, we present the methodology employed (dataset, algorithms, hyperparameters and configuration) to evaluate our approach.

4.1. Dataset

Our dataset contains 7.9M flows/MTS belonging to 203 classes (169 TCP and 34 UDP, about 200 applications). The MTS have a length of 20 (20 packets) and 2 variables (packet size and direction). An anonymized version of our dataset is under preparation for publication (cf. Appendix). Table 2 presents its structure. We observe a class imbalance in the dataset as the top 10 applications represent more than 40% of the flows. This is characteristic of the traffic classification task and we show in Section 5.1 that our classifier is robust to this class imbalance. Moreover, we can observe that there is no distortion between the proportion of TCP applications (83%) and TCP flows (84%).

Classes Flows (M) Total Flows % TCP %
10 3.3 42 75
20 4.8 60 80
50 6.5 82 83
100 7.5 95 84
203 7.9 100 84
Table 2. Composition of our dataset. Applications are presented in descending order of their popularity.

4.2. Algorithms

We compare our algorithm LEXNet to the state-of-the-art classifiers presented in Section 2. All the networks have been run with the parameter settings recommended by the authors in the original papers, and the number of layers for each network has been set in order to obtain comparable model sizes (around 300k trainable parameters). The number of trainable parameters is set at the level at which the best performing network (ResNet - see Section 5.1) does not show an increase in accuracy through the addition of new layers.

Specifically, we used the authors’ implementation of CondenseN- etV2 (Yang21CN2), MLSTM-FCN (Karim19), ProtoPNet (Chen19), XCM (Fauvel21)

. Moreover, we used the PyTorch Hub 

(PyTorchHub21)/Torchvision (Torchvision21) implementations of DenseNet (Huang17), EfficientNet (Tan19), GhostNet (Han20), MobileNetV3 (Howard19), ResNet (He16), and ShuffleNetV2 (Ma18). Finally, we have implemented with PyTorch in Python 3.6 the Vanilla 1D CNN, and LEXNet using the public implementations of ProtoPNet (Chen19) and ResNet (Torchvision21).

Concerning the post-hoc explainability methods, we used the public implementations available for Grad-CAM (Selvaraju17) and SHAP (Lundberg17).

4.3. Hyperparameters

Adopting a conservative approach, we performed a 50% train/50% test split of the dataset. Hyperparameters have been set by grid search based on the best average accuracy following a stratified 5-fold cross-validation on the training set.

As recommended by the authors, we let the number of LSTM cells vary in for MLSTM-FCN, and the size of the window vary in for XCM. Considering the size of our input data (20 2), we let the hyperparameters of ProtoPNet and LEXNet (number and size of the prototypes) vary in . For the other architectures, we use the default parameters suggested by the authors.

4.4. Configuration

All the models have been trained with 1000 epochs, a batch size of 1024 and the following computing infrastructure: Ubuntu 20.04 operating system, GPU NVIDIA Tesla V100 with 16GB HBM2. With regard to the limited computational resources of network devices, we also evaluate the models without GPU acceleration on an Intel Xeon Platinum 8164 CPU (2.00GHz, 71.5MB L3 Cache).

5. Results and Discussions

In this section, we first present the performance results of LEXNet on the traffic classification task. Then, we illustrate the explainability of our approach and highlight the faithfulness of LEXNet explanations through a comparison with post hoc explainability methods. Finally, we end this section with a discussion on the cost of the explainability-by-design.

Metric Vanilla CNN ResNet MobileNetV3 ShuffleNetV2 CondenseNetV2 DenseNet MLSTM-FCN XCM GhostNet EfficientNet
Accuracy (%) 79.1 87.7 67.8 84.5 85.4 86.9 77.2 83.6 85.7 64.0
Inference GPU (/sample) 1.4 1.6 2.2 2.5 3.1 3.3 4.6 8.1 11.5 13.3
Inference CPU (/sample) 26.6 34.6 48.6 69.8 80.3 91.8 186.3 355 473.8 610.8
Number of Trainable Params (k) 274 307 312 309 331 330 396 317 324 312
FLOPs - Multiply-Adds (M) 0.8 2.1 0.7 0.5 0.4 8.3 3.6 1.3 6.9 2.2
Table 3. Accuracy and inference time of the state-of-the-art CNNs on the TCP+UDP test set.

5.1. Classification

CNN Backbone First, in order to determine the backbone of our LEXNet, we compare the state-of-the-art classifiers on the traffic classification dataset. Table 3 presents the accuracy and inference time of the classifiers (with comparable model size around 300k weights/trainable parameters). As presented in Section 2, this approach already allows us to reduce the model size by a factor of five compared to the current state-of-the-art traffic classifiers (1-2M of trainable parameters) while maintaining the same level of accuracy. We observe that the model obtaining the best accuracy on our dataset is ResNet (87.7%). Moreover, ResNet exhibits the best accuracy/inference time trade-off; it obtains the second position with regard to the inference time on both GPU and CPU. All the state-of-the-art efficient CNNs exhibit a higher inference time than ResNet. In particular, some efficient CNNs with really low FLOPs (e.g., CondenseNetV2: 0.4M) compared to ResNet (2.1M) have a much higher inference time (e.g., CondenseNetV2: GPU 3.1/CPU 80.3 versus GPU 1.6/CPU 34.6). The extensive use of operations that reduce the number of FLOPS (e.g., depthwise and 1x1 convolutions) do not translate into reduction in inference time due to factors like memory access cost (Ridnik21). Furthermore, when compared on the same number of trainable parameters, some state-of-the-art efficient CNNs have both a higher number of FLOPs and inference time than ResNet (e.g., EfficientNet). Thus, our experiments show that optimizing FLOPs does not always reflect equivalently into inference time, and emphasize the interest of comparing model performance on the same model size as basis of comparison.

Nonetheless, a model is faster than ResNet: the Vanilla CNN (GPU 1.4/CPU 26.6 versus GPU 1.6/CPU 34.6

). The absence of residual connection in the Vanilla CNN can explain this lower inference time. Therefore, we have selected ResNet for our CNN backbone and worked to reduce its number of weights and inference time, while maintaining its accuracy.

Lightweight and Efficient Residual block As presented in Section 3.2.1, we introduce a new residual block (LERes) to reduce the number of weights/trainable parameters and enhance the efficiency of our CNN backbone ResNet. Our LERes block integrates two new operations: generation of cost-efficient feature maps with a linear transformation to perform solely convolutions with an equal channel width, and the replacement of the convolution on the shortcut link by a concatenation of the input data with the output from the first convolution. Table 4 shows an ablation study of ResNet with the new LERes block. We can observe that both operations reduce the number of trainable parameters and the inference time, with the second operation (concatenate on shortcut connection) having twice the impact of the first one. Overall, the new LERes block allows ResNet to reduce the number of trainable parameters of its backbone by 19% and the CPU inference time by 41% (GPU 19%), while maintaining the accuracy (99.3% of the original ResNet). Thus, this new LERes block enables ResNet to be faster than the Vanilla CNN (GPU 1.3/CPU 20.3 versus GPU 1.4/CPU 26.6). We call ResNet with the new LERes block LEResNet, we adopt its backbone for our LEXNet and use it as the baseline for evaluating the potential cost of the explainability-by-design of our approach.

Accuracy (%) Number of Parameters (k) Inference GPU (/sample) Inference CPU (/sample)
(0) ResNet 87.7 307 1.6 34.6
(1):(0)+Linear -0.4 -3 -0.1 -5.7
(2):(1)+Shortcut -0.2 -6 -0.2 -8.6
(2) LEResNet 87.1 298 1.3 20.3
Summary 99.3% -19% (backbone) -19% -41%
Linear - generation of feature maps with a linear transformation,
Shortcut - concatenate operation on the shortcut link.
Table 4. Ablation study of ResNet with the new LERes block.

LEXNet versus ProtoPNet The starting point of our approach, and our state-of-the-art explainable-by-design CNN baseline, is ProtoPNet. As detailed in Section 3.2, LEXNet relies on two contributions: a new lightweight and efficient residual block (LERes) and a lightweight prototype layer (LProto). Table 5 shows the ablation study from ProtoPNet to LEXNet. The performances reported correspond to the ones with the best hyperparameter configuration obtained by cross-validation on the training set for both networks: two prototypes of size (1, 1) per class. First, ProtoPNet with ResNet as CNN backbone exhibits an accuracy of 81.6% with 201k trainable parameters. This reduction in model size compared to ResNet (307k parameters - see Table 3) comes from the reduction in size of the fully connected network used for classification (ProtoPNet: 406 values as input - 2 similarity scores/prototypes per class). Then, we can see that the replacement of the prototype block of ProtoPNet by LProto significantly reduces the number of parameters (-29% of trainable parameters), while increasing the accuracy of the network by 7% to reach 87.7%. As a consequence, the adoption of LProto also leads to a significant reduction in inference time (GPU: -27%/CPU: -17%). Specifically, the reduction of the number of parameters and inference time come from the removal of the convolutional layers, which also decreases the depth of the prototypes. And, the increase in accuracy mainly comes from the L2 regularization which improves the generalization ability of our approach. Next, we observe that the replacement of the original residual block by LERes block in ProtoPNet provokes the same performance evolution (-19% backbone parameters, -19% CPU inference time, preserve accuracy) as in ResNet (see previous section). Overall, our explainable-by-design LEXNet is more accurate (+7%), lighter (-34%) and faster (-33%) than the current state-of-the-art explainable-by-design CNN network ProtoPNet. Plus, LEXNet maintains the same accuracy as LEResNet (87.1%), while reducing the number of parameters by 55%. The higher inference time of LEXNet compared to LEResNet (CPU 113.9 versus 20.3) does not hold when accounting for explainability. This point is discussed in Section 6.

Accuracy (%) Number of Parameters (k) Inference GPU (/sample) Inference CPU (/sample)
(0) ProtoPNet 81.6 201 5.5 169.7
(1):(0)+No Add +0.2 -59 -1.5 -29.3
(2):(1)+Sigmoid +1.2 -0.6
(3):(2)+L2 Reg +4.7 +1
(3): (0)+LProto 87.7 142 4.0 140.8
Summary +7% -29% -27% -17%
(4):(3)+Linear -0.5 -3 -0.1 -11.9
(5):(4)+Shortcut -0.1 -6 -0.2 -15
(5): LEXNet 87.1 133 3.7 113.9
Summary +7% -34% -33% -33%
No Add - removal of the additional convolutional layers in the prototype block,
L2 Reg - L2 regularization of the prototypes,
Linear - generation of feature maps with a linear transformation,
Shortcut - concatenate operation on the shortcut link,
Sigmoid - replace last activation of the CNN backbone.
Table 5. Ablation study of LEXNet.

LEXNet Predictions In this section, we analyze the prediction performance of LEXNet on traffic classification. First, our results show that Internet encrypted flows can be classified with a high state-of-the-art accuracy of 87.1% using solely the values of two prototypes of size (1, 1) per application based on MTS containing the first 20 packets of a flow, with per-packet size and direction as variables. More particularly, two prototypes of size (1, 1) per application are sufficient to classify both TCP and UDP applications with a high state-of-the-art accuracy of 86.4% for TCP and 96.2% for UDP (consistent with (Yang21)). This best hyperparameter configuration (two prototypes of size (1, 1)) also informs us that the information necessary to discriminate applications is often not long sequence of packets sizes or directions, but the combination of both packet sizes and directions at different places of the flow. Figure 3 illustrates this aspect with lower accuracies on longer prototypes. We have also experimented with prototypes of sizes (_, 2) which combine, at the same place of the flow, the packet size and direction. However, this setting leads to lower prediction performance. This observation emphasizes the relevance to address traffic classification as an MTS task, instead of as an univariate one with the direction as the sign of the packet size.

Figure 3. LEXNet accuracy on the TCP+UDP test set according to the number and size of prototypes.

Then, Internet traffic is highly imbalanced, with a few applications generating most of the flows. As an example, more than half of the classes (103 classes) represent less than 5% of the total number of flows in our dataset (see Table 2). Our results show that, by maintaining the same high accuracy as ELResNet, the addition of a prototype layer for explainability doesn’t alter the robustness of our model to class imbalance. Among the 103 less popular classes, more than a third of them are classified with an accuracy above 75% (see Figure 4).

Figure 4. LEXNet accuracy per class on the TCP+UDP test set.

5.2. Explainability

Illustration The explainability of our approach stems from the communication of the two prototypes from the predicted application to the end-user. Based on the first 20 packets of a flow from the most popular TCP application, Figure 5 illustrates this explainability by identifying in red the two class-specific prototypes that have been used for the prediction. The flow/MTS is represented by a heatmap of the normalized values. The two prototypes of size (1, 1) are highlighted by two red rectangles identifying precisely the region of the input data that has been used for prediction. In this example, a descending packet in position 2 and a packet of medium size in position 10 have been detected, which are characteristic of the class containing the most popular TCP application. Another example of explainability for a sample from the most popular UDP application is available in Figure 9 (see Appendix).

Figure 5. Example of a sample from the most popular TCP application of our dataset with the prototypes identified by LEXNet with its explainability-by-design in red.
Figure 6. Example of a sample from the most popular TCP application of our dataset with the two prototypes identified by LEXNet with its explainability-by-design and the post hoc explainability methods Grad-CAM and SHAP.

Faithfulness Then, we highlight the faithfulness of LEXNet explanations, which is critical from both a compliance and business standpoint. We compare the two most important regions of size (1, 1) identified by the faithful (by definition) LEXNet explainability-by-design - the two class prototypes - to the ones from the state-of-the-art post hoc model-specific method Grad-CAM and model-agnostic method SHAP. First, based on one sample from the most popular TCP application, we show in Figure 6 that the post hoc explainability methods, applied on the same model LEXNet, identify none of the expected regions/prototypes used by LEXNet. These methods identify regions which can be far from the expected ones (e.g., Direction: Grad-CAM number 14 versus LEXNet number 2), and can miss the identification of discriminative variables (e.g., SHAP: no Packet Size identified). Second, in order to quantitatively assess this difference across the dataset, considering that two prototypes need to be identified, we calculate the top-2 and top-10 accuracy in a similar fashion as the top-1 and top-5 accuracies in computer vision evaluations. The results in Table 6 show that Grad-CAM slightly better identifies the regions of the input data that are important for predictions compared to SHAP, but overall both post-hoc methods poorly identify the expected regions (Grad-CAM: top-2 8.7%/top-10 39.9%, SHAP top-2 6.1%/top-10 27.8%). When using the 10 first predicted (1, 1) regions from the post hoc explainability methods, i.e. 25% of the size of the input data, the expected prototypes are identified with an accuracy of less than 40%. Therefore, this experiment clearly emphasizes the importance of adopting explainable-by-design methods compared to post-hoc ones in order to ensure faithfulness.

Grad-CAM SHAP
Top 2 Accuracy (%) 8.7 6.1
Top 10 Accuracy (%) 39.9 27.8
Table 6. Faithfulness evaluation of the state-of-the-art post hoc explainability methods on the TCP+UDP train set.

The Cost of Explainability We have seen that our new lightweight, efficient and explainable-by-design CNN (LEXNet) for traffic classification maintains the same accuracy as the best state-of-the-art neural network on traffic classification (LEResNet), while having 55% less trainable parameters. Moreover, when accounting for explainability, LEXNet exhibits a lower inference time than LEResNet. Table 7 shows the performance of LEXNet (accuracy/size/ inference) in comparison with the current state-of-the-art explaina- ble-by-design CNN ProtoPNet with the same backbone as LEXNet, and the best performing CNN (LEResNet) which can only rely on state-of-the-art post hoc explainability methods. In addition to the faithful explainability, we observe that LEXNet is around 2.5 times faster than LEResNet with the most popular post-hoc explainability method for CNNs (the saliency method Grad-CAM), while maintaining accuracy (87.1%) and having around half its model size (133k parameters).

Accuracy (%) Number of Parameters (k) GPU (/sample) CPU (/sample)
LEResNet+Grad-CAM 87.1 298 9.5 278.6
LEResNet+SHAP 87.1 298 8,246.3 68,440.6
ProtoPNet 81.6 201 5.5 169.7
LEXNet 87.1 133 3.7 113.9
Table 7. Comparison of LEXNet and the state-of-the-art CNNs with explainability-by-design and post-hoc explainability on the TCP+UDP test set.

Nonetheless, the explainability-by-design of LEXNet has a cost on the inference time compared to the best performing state-of-the-art CNN LEResNet without explainability methods (GPU 3.7/CPU 113.9 versus GPU 1.3/CPU 20.3 - see Table 4). Based on our experiments, around 80% of this additional inference time is due to the L2 distance calculations in the prototype layer to generate the similarity matrices. Therefore, some of this cost could be reduced by optimizing the generation of the similarity matrices.

Figure 7. Example of the main classification error performed by a flow from the top TCP application, finding a higher similarity to prototypes from another application.

6. Conclusion

We have presented LEXNet, a new lightweight, efficient and explain- able-by-design CNN for traffic classification which relies on a new residual block and prototype layer. LEXNet exhibits a significantly lower model size and inference time compared to the state-of-the-art explainable-by-design CNN ProtoPNet, while being more accurate. Plus, LEXNet is also lighter and faster than the best performing state-of-the-art neural network ResNet with state-of-the-art post hoc explainability methods, while maintaining its accuracy. Our results show that Internet encrypted flows can be classified with a high state-of-the-art accuracy using solely the values of two prototypes of size (1, 1) per application based on MTS containing the first 20 packets of a flow, with per-packet size and direction as variables. These two class prototypes detected on a flow can be given to the end-user as faithful explanation to support LEXNet application prediction.

While LEXNet constitutes a first fruitful attempt to provide faithful explainability for traffic classification with high prediction performance, it could be further improved. For instance, our analysis of LEXNet classification errors reveals that the best hyperparameter configuration on average on our dataset (two prototypes of size (1, 1)) is not optimal for every applications. The prototype size of (1, 1) offers a high flexibility to cover the needs of different applications. However, the low number of two prototypes per class does not cover applications characterized with more than two values, or applications with a higher diversity of flows. For example, we observe that for the most popular TCP application (test set accuracy: 94%), more than half of the classification errors are due to the identification of the same wrong application based on a closer similarity to its prototypes (descending packet 13 and low packet 5 size - see Figure 7), which occurs regularly in the dataset. This suggests that some applications would be better characterized by a higher number of prototypes.

In our future work, we would like to (i) improve LEXNet prediction performance by learning a different number of prototypes per application in order to better characterize the diversity of flows, and (ii) optimize the generation of the similarity matrices in the prototype layer to enhance LEXNet efficiency.

Appendices

Dataset

Our dataset has been collected from four customer deployments in China. The dataset is composed of the TCP (e.g., HTTP) and UDP (e.g., streaming media, VoIP) traffic activity over four weeks across tens of thousands network devices. Specifically, it contains 7.9M flows/MTS belonging to 203 classes (169 TCP and 34 UDP, about 200 applications - see application categories in Figure 8). For each flow/MTS, we have available some per-packet information (two variables: packet size and direction) of the first 100 packets. This per-packet information can be collected from any encrypted traffic. We set the MTS length to 20 packets for TCP and 10 packets for UDP. These values reflect the relevant time windows to identify the applications and sustain line rate classification of the traffic. In order to have only one classifier for both TCP and UDP applications, UDP time series are padded with zeros to match the TCP length of 20. We normalized the dataset as a preprocessing step.

Concerning the labeling, each flow has been annotated with application names provided by a Huawei commercial-grade deep packet inspection (DPI) engine, i.e. a traditional rule-based method. Traffic encryption in China is not as present as in the Western world yet, so DPI technologies still offer fine-grained view on traffic.

An anonymized version of our dataset is under preparation for publication. While the dataset we used does not contain information that constitutes a privacy-related risk (e.g., IP addresses), it contains business-sensitive information (e.g., fine grained labels and packet size sequences). Such information requires further processing (e.g., label obfuscation, time series shuffling) to ensure that researchers can still carry meaningful experimental activities, while not affecting the business.

Figure 8. Breakdown of fraction of bytes and flows by category for the applications of our TCP+UDP dataset.

Explainability Illustration on UDP

Figure 9 shows an example of a sample from the most popular UDP application of our dataset with the prototypes identified by LEXNet with its explainability-by-design in red.

Figure 9. Example of a sample from the most popular UDP application of our dataset with the prototypes identified by LEXNet with its explainability-by-design in red.