With the onion routing , Tor disassociates several million privacy-conscious users a day with their visiting websites by forwarding the multi-layer encrypted packets through a number of volunteer proxies . However, by observing the side-channel pattern without decryption on any one of the on-path routers before entering Tor’s entry proxy, an attacker could still compromise a user’s browsing privacy via traffic analysis, i.e., website fingerprinting attack (WFA). In the past decade, an extensive number of existing studies [3, 4, 5, 6, 7, 8, 9, 10, 11, 12] have achieved remarkable success, even with only a few training samples per website [13, 14, 15].
Nonetheless, most existing methods make a couple of artificial assumptions implicitly: (1) Each time a user visits only a single website and visit a number of websites separately and sequentially, i.e., single-tab WFA (ST-WFA); (2) Raw website fingerprinting data are pre-trimmed manually so that a single trace involves one website of interest. These are obviously not true in typical real-world situations. Often, a user would open multiple tabs to visit different websites spontaneously and change their focus from one tab to another over time, i.e., multi-tab WFA (MT-WFA).
In the literature there are only very few studies on MT-WFA [16, 17, 11, 18]. Importantly, they are highly limited in different aspects, and none of them investigate fully the real-world user behaviors and all the native challenges with MT-WFA. Specifically, Gu et al.  only discussed the classification of two-tab traces for SSH without considering the localization of individual traces at all. Wang et al.  moved a step further by studying how to split Tor’s two-tab traces so that the MT-WFA problem can be solved by previous ST-WFA models. Later on, Xu et al.  improved the splitting of two-tab traces but limited to only classifying the clean front segment of the first website fingerprinting. More recently, Gong et al.  considers a simplified MT-WFA setting without overlapping between website fingerprinting in a defense perspective.
In this work, we for the first time investigate a realistic MT-WFA problem setting. Unlike previous attempts above, we minimize the introduction of artificial assumptions and conditions for enabling a true indication of model’s deployability and scalability in the application of real-world WFA tasks. Concretely, we redefine the problem of MT-WFA as follows: The input is raw untrimmed traffic data of multi-tab traces potentially with random overlapping between adjacent visits. The objective is to find where (start and end) and what (website label) each individual full monitored trace of any monitored website is.
To solve the proposed MT-WFA problem, we introduce the first end-to-end Website Fingerprint Detection
(WFD) model. This is a result of our drawing inspiration from the success of object detection methods in computer vision, viewing asingle full monitored trace as an object and a long multi-tab trace
as an image. Along with feature extraction, localization and classification of all full monitored traces are jointly optimized in a single deep learning architecture, i.e., end-to-end. This is drastically different from and more elegant than all the previous WFA methods[18, 17] that model the localization (i.e., trace splitting) and classification independently and hence suffer from the notorious localization error propagation problem. As a consequence, their methods are poor in tackling the realistic WFA situations (see Table 1).
Our contributions are summarized as follows:
(I) We investigate for the first time the realistic and more challenging MT-WFA problem. Instead of classifying individual, manually trimmed trace segments as in existing methods, we take a detection perspective. This eliminates the unrealistic assumptions of sequential single website visiting and manual trimming. For quantitative evaluation, we introduce new performance metrics (mean Average Precision (mAP), and MB per second (MBps)) for MT-WFA.
(II) We propose the first end-to-end Website Fingerprint Detection (WFD) model. This is based on our perspective of considering each monitored trace of interest as a specific object in an image. Hence this allows us to adapt the success of object detection in computer vision for MT-WFA. In design, our model is superior as all the model components (feature extraction, scale encoding/perception, and two-head prediction) of WFD can be optimized jointly in training, which is impossible with existing alternative methods. To improve the model efficiency, we introduce a novel compact input representation (namely Burst), specially-designed lightweight components, and a fast training strategy.
(III) We conduct extensive experiments to validate the superiority of our WFD method in comparison to the start-of-the-art models. In particular, given the most challenging traffic data with 17-tab traces, WFD achieves an mAP of 58.54%, vs. 5.73% by the best alternative CDSB . Interestingly, against the latest defense GLUE , WFD even further improves mAP to 80.56%. This suggests the proposed MT-WFA is already more challenging than GLUE-defensed traffic data due to its consecutive single-website visiting assumption. In terms of attack speed, WFD achieves 1208.30 Megabytes per second (MBps) with an ordinary NVIDIA 3090 GPU, allowing to attack thousands of Tor users (8516-10113) simultaneously at real time (assuming 1619 seconds to load a 2.27 MB sized web-page on average).
2 Related Works
2.1 Single-Tab WF Attack
With the anonymized network traffic data of the single web-page as the input, different WF attacks have been proposed. They can be mainly divided into three categories: traditional ST-WFA [4, 5, 6, 9, 10, 11], deep learning based ST-WFA [3, 7, 8, 12] and few-shot ST-WFA [13, 14, 15].
Traditional single-tab WF attacks.
These attack methods mainly use hand-designed feature engineering and traditional machine learning classifiers. Representative methods includek-NN , CUMUL , k-FP , and so forth. In the closed-world scenario, they can all achieve over 90% accuracy. They have been shown to be effective even in the more difficult open-world scenario. For instance, k-FP can determine 30 monitored web pages from 100,000 unmonitored ones with 85% true positive rate (TPR), and 0.02% false positive rate (FPR). Their success relies on good feature designing by rich domain knowledge, limited to specific characteristics of the training data alone. Once there is big changes in data pattern, these methods will degrade drastically.
Deep learning based ST-WFA. Unlike traditional methods, deep learning models can learn feature representation from the training data, without the need for manual design, hence more scalable and more friendly. Importantly, strong results have been obtained. For example, the DF model  achieves 98.3% accuracy in the closed-world setting, 99% precision with 94% recall in the open-world
setting, and even 90.7% accuracy against the well-known WTF-PAD defense.
Few-shot ST-WFA. Deep models often need a large number of training data, which could be difficult to collect in many cases. To overcome this challenge, few-shot learning has been introduced with representative works including TF , TLFA , and HDA . In particular, TLFA can achieve over 90% accuracy with only one sample per new website through knowledge reuse of seen websites.
Commonly, all the above works assume well trimmed traces each involving a single website. Further, they assume singe-tab visiting which is an artificial and unrealistic assumption. As a consequence, these methods are supposed to be unsuitable for real-world applications, as a user often uses multiple tabs in internet surfing.
2.2 Multi-Tab WF Attack
In 2015, Gu et al.  presented the first two-tab WF attack on SSH. By selecting exclusive features of SSH, such as TCP connections, total per-direction bandwidth and inter-packet time, they exploited a Mahalanobis distance metric to determine whether a trace is partly overlapped or not. They further selected fine-grained features to deanonymize the first page and coarse features to identify the second page. With a two seconds delay of two pages visits, their attack can classify the first page at 75.9% TPR and the second one at 40.5% TPR among 50 websites in a closed-world setting. Unlike their work, we skip the split decision phase and only focus on detecting monitored trace segments.
In 2016, Wang et al.  firstly studied the two-tab WF attack on Tor. They noticed the importance of time gap and used a time-based k-NN (Time-kNN) model to deal with split decision and split finding. However, this model struggles when the two visits are consecutive or overlapped – a 32% split accuracy for the overlapped cases. Besides, they first discussed the negative impact of not-appropriate splitting on the successive classification.
In 2018, Xu et al. 
advanced the MT-WFA with the same two-step pipeline but only focused on the first-page attack. They proposed a BalanceCascade-XGBoost model to find the start of the second page trace by combining XGBoost and BalanceCascade . Further, they selected the most useful features out from 452 candidate ones with IWSSembeddedNB 
and modeled the attack with random forests. This model can identify the first page at a TPR of 64.94% on Tor. However, it cannot detect the overlapped portion of the first trace.
In 2020, Gong et al.  studied the MT-WFA problem from a defense perspective, subject to an assumption of single-tab visiting at a time. They proposed a GLUE defense to glue these non-consecutive traces as consecutive. The motivation is that MT-WFA is a much harder problem as compared to ST-WFA. The FRONT defense is also used to ensure the front privacy of glued traces. They designed a new Coarse-Decided Score-Based method (CDSB) based on uses a random forest classifier and 511 features to do split decision and split points finding. This follows the same split and classification pipeline as . After trace segments are split, existing ST-WFA classification models can be then applied, such as k-NN , CUMUL , k-FP  and DF . In case of ordinary consecutive 16 traces, CDSB can achieve over 45% TPR and over 41% precision. With GLUE, the performance drops significantly, achieving the defense effect. In this work, we instead show that this GLUE defense is actually simpler than the realistic MT-WFA setting. This means that multi-tab WF traces are naturally challenging on their own.
In terms of model design, all the above methods separate the trace split (i.e., localization) and classification components, which disables their joint and interactive optimization during training. Importantly, it is typical that trace location is largely inaccurate. This will further challenge the subsequent classification, resulting in a localization error propagation problem that cannot be solved in the classification step. To address this fundamental limitation, we introduce a novel Website Fingerprinting Detection (WFD) model that is end-to-end trainable in a unified deep learning framework.
3.1 Positioning a Threat Model
Aiming at the user’s browsing privacy, WFA is a passive, undetectable and unpreventable traffic analysis. It does not modify, drop, or delay any packets in the traffic. Different from previous WFA works limited to the local environment such as LAN, we consider that a WF threat model can settle on a core router for large-scale real-time attacking, as illustrated in Figure 1.
3.2 Common WFA Settings
Existing WFA works mainly consider these settings: single-tab vs multi-tab, closed-world vs open-world, and without- vs with- background traffic. We describe their key aspects below.
The majority of existing WFA works consider the single-tab situation, that is, each time a user visits only a single website with a single tab in the browser. This is clearly not common user behaviors. Instead, multi-tab situations would be the norm in real-world case. Hence, MT-WFA is a more realistic problem setting.
The closed-world setting assumes that only a fixed number of websites are visited, i.e., the set of monitored websites. In real world, this is often not true. The open-world setting eliminates this restriction by considering the presence of unmonitored websites in the traffic data. An important notion is the base rate defined as the rate at which the user visits a page from the set of monitored websites. It is user specific and relates the evaluation in the open-world case. Setting a too low base rate will lead to overly optimistic results, i.e., the “base rate fallacy” phenomenon .
A set of monitored websites. They are the target websites an attacker wants to monitor. Each website in can be labeled as a specific class. So ST-WFA is often formulated as a classification problem. For all the unmonitored websites in the open-world scenario, they are usually collectively labeled as a single class different from all the classes assigned to monitored websites.
The background traffic is generated often due to file downloading, online music streaming, and so on. Given that the current Tor browsing is relatively slow , these background activities are usually not considered as they will further slower the flow of traffic data and reduce the user experience.
4 Real-World Multi-Tab WF Attack
4.1 Internet User Behaviors
Adjacent website visiting is one of the most challenging situations for WFA. There are three typical types of user behaviors (Fig. 1) as below. (1) The blue-colored user dwells on an old page for a while before opening a new page, yielding a traffic sequence with two separated and sequential single-tab traces. (2) The yellow-colored user reloads a new page in the current tab that stops the loading of the old incomplete page, yielding a traffic sequence with two consecutive without a noticeable time gap. This is a multi-tab trace with individual traces all clean. (3) The red-colored user opens a new page in a new tab while the old page is in loading, yielding a traffic sequence with two overlapping traces. This is a more challenging multi-tab trace, which has never been systematically investigated in the literature. To facilitate understanding, we define several key concepts in follows.
A single-tab trace. A single-tab trace is a sequence of packets produced in the loading process of a web browser when a user pays a visit to one website. It is also known as a single trace.
A multi-tab trace. A multi-tab trace comprises more than one traces with or without overlapping, produced in the loading process of a web browser when a user opens multiple tabs to visit multiple websites concurrently.
Trace segment. A trace segment is a consecutive sub sequence of one (i.e., clean) and multiple (i.e., mixed) traces.
4.2 Problem Definition
Given a raw traffic data (i.e., a long untrimmed multi-tab trace), the objective of MT-WFA is to identify any single-tab traces of all monitored website, including their start, end, and website.
Ground-truth. In MT-WFA, a ground-truth is defined as a full monitored trace with the index of the first and last packet, and a website label , in a raw multi-tab trace. Its center point is , and the length is .
In order to train a MT-WFA model for detecting all the ground-truth traces from a given multi-tab trace, we often collect a labelled training set with each trace associated with a set of ground-truth traces . In attack (i.e., model inference), given a test multi-tab trace a trained model would output a set of top-scoring candidate traces . Ideally, matches the ground-truth traces exactly. This however is rarely possible. Typically, we hope that has high overlap (Eq. (4)) with .
A candidate trace. A candidate trace is a prediction outputted by a WFD model, including a specific website label , the center point position , the length , and a detection score .
Intersection over union of Traces (IoUT). The intersection over union of two traces is defined as the ratio between the length of their common segments to the length of their union.
To better understand the above concepts, we provide an example in Figure 2.
4.3 Performance Metrics
In this section, we introduce a new set of performance metrics for evaluating both the accuracy and efficiency of a MT-WFA method.
4.3.1 Accuracy metrics
The precision and recall metrics designed for conventional ST-WFA cannot fully reflect the requirements of MT-WFA.Precision and recall at thresholds. Given a predicted trace and a ground-truth, if their IoUT value exceeds a predefined threshold (e.g., 0.5), and the maximal classification probability corresponding to the ground-truth website and exceeds a predefined threshold , this prediction is then viewed as a true positive (). Otherwise it is a false positive (). If there is no predicted trace for a ground-truth, it is viewed as a false negative (). The precision and recall metrics for website are then defined as:
Average precision. By varying the classification threshold , we can draw a series of precision-recall score pairs. To summarize their performance, we consider the area under the precision-recall curve as a metric, namely Average Precision (AP). Formally, the AP for website at a IoUT threshold value is defined as:
mAP. To summarize the AP scores over a set of monitored websites , we define the mean average precision (mAP) metric as:
where is the number of monitored websites. For a specific IoUT threshold, e.g., , its corresponding mAP is denoted as mAP. By default, we suggest a range of IoUT threshold to with a step 0.05. Finally, we average all the mAP to obtain a single mAP score for overall performance.
4.3.2 Efficiency metrics
For the evaluation of efficiency, we suggest two speed metrics: the training time and the attack speed in Megabytes per second (MBps). The training time can be easily measured on a fixed machine.
Megabytes per second (MBps). Although the number of traces per second can also be used to evaluate the attack speed, this does not reflect the trace length. To overcome this issue, we suggest using Megabytes per Second (MBps) as the attack speed metric. In MT-WFA, MBps measures the amount of the anonymous network flow in the unit of in Megabytes a WFA method can detect per second.
Classification vs. forensics. ST-WFA is essentially a classification problem of clean, manually trimmed single-tab traces. This is a simplified form of MT-WFA. In forensics, one needs to process untrimmed multi-tab traces instead and detect the location of singe-tab traces with monitored websites, without any manual trimming.
Bootstrap time and attack speed. The WFA time is critical for an attacker, including bootstrap time  and attack time. The former is the time required to build a ready-to-use WFA model, including data collecting time and WFA model training time. The attack speed is also important due to the potentially infinite amount of WF data. In general, we aim to develop fast-to-train, fast-to-attack, and more accurate WFA models.
5 End-to-End WF Detection
In this section, we present our WFD model, including its motivation, overview, model architecture, model training, model testing (attack) and model optimization.
We associate the MT-WFA problem with object detection in computer vision due to their remarkable success achieved [25, 26, 27, 28, 29, 30, 31, 32]. This is based on our novel analogy that that each individual single-tab trace of any monitored website can be viewed as an object of interest in an image, whilst all the other traces as the background of the image. This perspective opens a door to explore the rich experiences of object detection models. However, it is still non-trivial to develop a strong MT-WFA model in analogy of object detectors due to different problem specific challenges.
Considering the demand of accuracy and efficiency, we choose the one-stage solution  as the architecture of our WF detection, which can solve the regression problem and the classification problem in a unified way, with a high speed.
For MT-WFA, we design a novel Website Fingerprinting Detection (WFD) model, as illustrated in Figure 3. It is end-to-end learnable, taking as input a raw multi-tab trace, and outputting a set of candidate traces. Both trace location and classification are jointly optimized with strong synergy in-between realized.
Model Architecture. The proposed WFD model consists of three functional components: feature extraction, scale perception, and localization and classification. Specifically, given a raw traffic datum of a whole multi-tab trace, the feature extractor generates a feature vector sequence and cell segments. Anchor segments are then generated based on cell segments and anchor lengths. The scale encoder further perceives different scales to learn scale-sensitive information for feature representation. At the end, a regression head outputs the offsets of proposal segments and a classification head conducts website classification, i.e., the two-head predictor. These outputs are used to generate candidate traces.
Model training. The main steps of model training can be summarized as follows: (1) Matching the proposal segments with the ground-truth; (2) Obtaining valid positive and negative segments; (3) For positive segments, computing the regression loss of the predicted offers against the ground-truth location; (4) For all valid proposal segments, computing the classification loss of the predicted probabilities against the ground-truth label; (5) Summing up the above two losses, computing the gradient of all learnable parameters for parameter update.
Model testing (attack). Given a raw traffic datum of a multi-trace, the trained model can output a number of candidate traces. As not all of them are good, we select the those with top classification scores as the output.
5.3 Model Architecture
5.3.1 Main components
A strong model is needed to extract a powerful feature representation for each raw traffic datum of multi-tab traces. We choose the 1-dimensional (1D) variant of residual convolution neural networks (ResNets). From a multi-tab trace with length , the feature extractor with a temporal down-sampling rate can efficiently extract a feature vector sequence with a length of . Each feature vector corresponds to one cell segment. For the i-th cell segment, its center point is and its length is .
Scale encoder. Given a feature vector sequence, we let feature vector take charge of predicting the ground-truths at different lengths. This allows to learn rich scale information. One intuitive method is that we deploy a stack of convolutional layers and extract feature sequences at multiple layers. However, this is less efficient and computationally demanding. To overcome this problem, we propose using dilation convolutions with different scales [34, 35, 12].
Specifically, we adopt the 1D residual dilation block as the basic component. Except the shortcut, such a block consists of three consecutive convolutions: a channel reduction convolution, a dilation convolution for enlarging the receptive length, a channel recovery convolution. By stacking multiple such blocks, different scales of segments can be perceived. As a consequence, more scale-sensitive information can be captured in feature representation learning. For cost-effective design, the number of dilation blocks needs to be tuned, along with the dilation rates. Our optimal design has 4 blocks with dilating rates [2,4,6,8].
Two-head predictor. One head is responsible for the classification task, and another head for the regression task. We adopt the anchor design . Specifically, we place anchor segments at each feature vector (i.e., a cell segment), resulting in a total of anchor segments. For all the anchor segments, the classification head outputs the probability vectors in shape of , and the regression head outputs the position offsets in shape of . Further, we deploy a convolution layer as a shortcut connection from the regression head to the classification head for facilitating learning.
5.3.2 Key designs
There are three key designs involved in generating proposal segments, as described below.
Anchor lengths. We define a set of fixed anchor lengths as the base, from which the predicted offsets are translated to the absolute positions of proposal segments. We cluster the lengths of ground-truth traces by the -means, and obtain anchor lengths.
Anchor segments. For each cell segment, we assign anchor segments. For a multi-tab trace with cell segments, a total of anchor segments can be resulted. For each anchor segment, its center point is at the corresponding cell segment. and its length is one of the anchor lengths.
Proposal segments. For an anchor segment, we generate one and only one proposal segment. Hence, there are a total of proposal segments generated. Each proposal segment shares the classification output of the corresponding anchor segment, and its position is calculated by the predicted position offsets and the anchor segment’s position.
Taking a proposal segment as example, suppose the length and center point position of the corresponding anchor segment are and , the predicted length offset and center point position are and , then its length and center point can be calculated as follows:
is the sigmoid function.
5.4 Model Training
For model training, we give details for validness, objective loss function, and parameter update.
Validness. The proposal segments cannot be used to calculate the objective loss until passing through a validation process. Firstly, for each proposal segment, we match it against all the ground-truths according to the IoUT metric. Then, for each ground-truth, we sort its matched proposal segments according to their L1-distance, and take the k-nearest neighbors as positive segments, and the remaining as negative segments. During training, the similarity between a positive segment and its ground-truth in website class probability and location offset is maximized. On the contrary, the similarity between a negative segment and its ground-truth is minimized. We discard invalid segments, including positive ones with too low IoUT and negative ones with too high IoUT. This is because they would harm the model training.
Objective loss function. For one ground-truth with website label , the w
-th estimated probability of its positive segments and negative ones is used to calculate the classification loss. For the regression loss calculation, only the positive ones are used. Overall, for a multi-tab traceand its ground-truths , the final loss can be written as follows:
where there are positive segments , and negative segments for the j-th ground-truth . We use the focal loss  for website classification, and the IoUT-based regression loss for segment localization.
For training, we adopt the standard supervised deep learning procedure. The stochastic gradient descent (SGD) algorithm is used to update the model parametersat every iteration as:
where denotes the training loss of the current mini-batch multi-traces with their ground-truths .
5.5 Model Testing
In model test (attacking mode), among the candidate segments predicted at each cell segment, there is at most one correct monitored trace. Hence, a filtering process is needed. We take two steps to filter out low quality proposal segments. (1) We first filter out those segments with maximum website predicted probability lower than a predefined threshold value, i.e., low confident segments; (2) For each remaining proposal segment, we then filter out those significantly overlapped and lower confidence segments. We use IoUT metric for overlapping.
5.6 Efficiency Optimization
For better efficiency, we further introduce three key designs: compact data representation, lightweight feature extractor, and 2-staged training.
5.6.1 Compact data representation
The raw traffic data of a multi-tab trace can be one or more of these types: time or directional representation of its cell sequence, time or length representation of its burst sequence. For robustness against the network dynamics, we do not consider the time representation.
Cell. Unlike packets, a cell is the basic data unit of Tor, whose length is the same 512 bytes in the TLS records, denoted as = (, ) where is the cell’s timestamp and is the cell’s direction, . means the cell is outgoing from the user to the website, and means the cell is incoming from the website back to the user.
Burst. A burst is the maximum accumulation of a consecutive cell sequence of in the same direction, denoted as = (, ) where , is the length operation of the sequence, and is the number of cells in the burst.
Although the directional representation of cell sequence is robust and good to represent a single-tab trace, its sequence would be very long. In contrast, the lengthy representation of burst sequence is more compact, typically about one-tenth of its cell sequence, as verified by the statistics on the single-tab trace dataset DS-19 . Compared with cell representation, burst reduces the training time, accelerating the WF attack process.
5.6.2 Lightweight feature extractor
Using burst sequences enables us to choose lightweight ResNets , with representative models 1D-ResNet18, 1D-ResNet34, and 1D-ResNet50. In ST-WFA, these models are similarly performing. However, we find that 1D-ResNet18 performs best among these in MT-WFA.
5.6.3 2-staged training
As discussed in Section 4.4, the bootstrap time for WFA is crucial. In particular, the model training time takes a big portion. We propose a 2-stage training pipeline for acceleration. In first stage, we ptr-train our feature extractor on a large ST-WFA dataset. This is based on our observation that clean trace segments of multi-tab traces are similar to single-tab traces. In second stage, we freeze the feature extractor and only train the remaining part (the scale encoder and the two-head predictor) of our model on the MT-WFA training set. This can reduce the whole training time due to less update on the feature extractor.
6.1 Datasets and Protocols
We use two existing single-tab trace datasets. (1) AWF dataset : This dataset AWF contains a total of 900 monitored target websites, each with 2,500 raw feature traces. All 900 websites are divided into a split of 576/144/180 for training (AWF), validation (AWF) and test (AWF), respectively. We use this dataset to pre-train the feature extractor of our WFD model. (2) DS-19 dataset : There are 10,000 raw feature samples from 100 monitored websites. Besides, another 10,000 raw feature unmonitored samples are collected. We divide DS-19 dataset into two disjoint parts: DS-19 (11,000 samples including 55 samples per monitored website and 5500 non-monitored samples) and DS-19 (9,000 samples including 45 samples per monitored website and 4500 non-monitored samples). We use DS-19 to generate the multi-tab training dataset for MT-WFA; DS-19 to synthesize the multi-tab test dataset.
We create two new synthetic MT-WFA datasets as there are no options in the literature, followed by a GLUE dataset. (1) DS-19: In this MT-WFA dataset, each simulated multi-tab trace is specified as a -tab trace consisting of single-tab traces with overlap between neighbours in align with realistic multi-tab visiting patterns. There are three overlapping positions: tail (clean front segment), front (clean tail segment), both ends (clean middle segment). For each case, we set six overlapping percentages from 10% to 60% with a step of 10%. As a result, there are a total of 18 different overlapping types. With the single-tab traces from DS-19, we generate -tab traces for each and the overlapping types are randomly selected. With the base rate , there are single monitored traces and single unmonitored traces for every case. All these generated traces with different values are used as MT-WFA training data. We generate the MT-WFA test data in the same way using DS-19.
(2) DS-19: This is a harder MT-WFA dataset. For the test data, we generate --traces for every combination of and . We use the same overlapping setting. We create two training situations: (i) normal training data size as in typical benchmarks, (ii) small training data size. For the latter, we set and the base rate , giving only 10800 =3-=1-traces for training.
(3) DS-19: This is a GLUE MT-WFA dataset . Each GLUE multi-tab trace is a -tab traces with single-tab traces glued by the GLUE defense. The same setup of as DS-19 is used, except no single-tab traces. For the FRONT setting, we follow Zero-delay’s setting (N = N = 1100, W = 10 s, W = 15 s) . This dataset is designed for testing our WFD model against the latest defense GLUE .
In the following experiments, we consider the more realistic open-world scenario.
6.2 Implementation Details
Benchmarks. We compare the state-of-art MT-WFA model  originally proposed for tackling the GLUE defense. In particular, Gong et al.  took a two-stage strategy of split decision and split finding in the Coarse-Decided Score-Based (CDSB) framework. After single-tab traces are obtained via splitting, kNN, CUMUL, kFP and DF are then used to classify the single-tab traces.
Raw input. For the training and test of CDSB series, the original cell sequence (both directional and time information) is used to represent trace samples. For our WFD model, the more compact burst sequence (lengthy information) is used.
WFD model. We choose 1D ResNet18 as feature extractor. For the scale encoder, we set 4 dilated blocks with the dilated rates of [2,4,6,8].
WFD Training. For training WFD, we pre-train the feature extractor on the AWF dataset in the first stage, freeze the pre-trained feature extractor when training the remaining parts in the second stage. We adopt the SGD optimizer with a learning rate of 0.12 and a batch-size of 48. We train a total of 6000 iterations. We use an NVIDIA 3090 GPU for all the experiments.
6.3 Evaluation on Realistic MT-WFA
Setting. In this experiment, we use the multi-tab dataset DS-19 to train and test our WFD model and the CDSB series. For the training of CDSB, we divide the single-tab trace dataset DS-19 into two parts: 2000 samples for SPLITTRAIN and another 9000 samples for ATTACKTRAIN. This way, we use the same original training data for CDSB and our WFD for fair comparison. We use the officially released code of CDSB in our experiments. We use both the accuracy metrics (precision, recall, and mAP) and efficiency metrics (training time in minutes, attack speed in MBps).
In all accuracy metrics (mAP, precision, and recall), our WFD model is consistently superior over the state-of-the-art MT-WFA alternatives, the CDSB series (kNN, CUMUL, kFP, and DF). Note that, the most significant margin is achieved in the most difficultcase, with a minimum edge of 61.49% of mAP, 78.23% of precision, 56.23% of recall, over the start-of-the-art alternative method. This implies the robust stability of our WFD across varying length multi-tab traces.
In AP metrics, AP results are always better than AP. This is as expected, as higher IoUT means a more strict metric, and it is often harder to find a more accurate location.
As shown in Figure 5, the position of clean segment (front, middle or tail) is important for detection. While the clean segment is in the front, the detection performance is best. These results are consistent with the fact that the front of a trace is most informative, which motivates the FRONT defense . Additionally, the detection performance would be better if the clean segment is located in the tail as compared to in the middle part.
As shown in Figure 6, the length of a clean trace is vital for detection, but not the entire length. The longer the clean segment is, the easier the trace would be detected. Detecting the test traces with short clean segments is harder than those with long clean segments, with mAP increasing from 75.94% to 90.48%. However, an opposite situation happens on the relation between the detection result and the trace’s entire length. The longer the entire trace is, it would be more challenging to be detected. This is because a long trace is more easily overlapped with others which challenges the detection model significantly. These results suggest that the detection performance of entire traces depends on the length of clean segments.
Except the absolute length of clean segments, their proportion is another vital factor for detection. As shown in Figure 7, when the percentage of clean segment increases, the detection becomes easier, with mAP increased from 68.67% to 72.28%.
For the training time, our WFD model consumes the fewest time (54 minutes), almost double the training speed of the previous fastest model CDSB+k-FP. Further, our attack speed is also the best, about 20 times faster than the previous most accurate model CDSB+DF.
On an ordinary NVIDIA 3090 GPU, WFD achieves a MT-WFA speed of 1208.30 MBps. In case of minimum 1 second webpage loading time and average 2.27 MB (average 4441 cells per single-tab trace in DS-19) traffic flow per webpage, our WFD can simultaneously attack at least 532 (e.g., 1208.3/ 2.27 = 532) users at real time. For the mean 16 to 19 seconds to load web pages over Tor browser , we can attack simultaneously 8516 to 10113 users at real time. This implies that our WFD has the potential for large-scale real-time WF attack.
|Training data size||Train-time||mAP|
6.4 Evaluation on Generality and Practicality
Setting. This experiment is conducted on DS-19. We consider two training cases: normal and small training data sizes. We test the two trained models on a large number of --traces generated with different single-tab traces numbers and different base rates . We use mAP as accuracy metrics. For efficiency metrics, we use minutes for training time and MBps for attack speed.
Given the normal training data size and testing with a different base rate ( ) WFD still achieves a similar mAP as compared to in the same base rate =10, 69.33% vs. 70.49%. This implies our WFD model is robust to the base rate change.
Testing with different -tab traces, our WFD trained with only =3 single-tab traces still performs well as in case of normal training data size, with mAP varying from 60.55% to 69.33%. This suggests good generality of our WFD across varying numbers of single-tab traces in a single multi-tab trace. Therefore, there is no tedious need for preparing a specific training data for every . On the contrary, the competitors all degrades clearly with the increase of single-tab traces.
Encouragingly, even with small training data, WFD can still perform well, giving at least 55.46% mAP, vs 62.02% (the least mAP for normal training data size). This suggests our WFD model has a good generality and practicality. This ability is important for realistic MT-WFA as the base rate and the number of single-tab traces would frequently change. Interestingly, the performance is even better when the base rate increases. This result is favorable for attacking the cases with significant base rates, often occurring in realistic environments.
For the bootstrap time, given small training data, the training only needs 64 minutes, vs 105 minutes in the normal counterpart. This suggests that with WFD, we do not need to take much time on training data collection.
6.5 Evaluation on GLUE Defense
Setting. In this experiment, we use the DS-19 dataset for our WFD model and the CDSB series. It is noteworthy that the CDSB needs two trained WF models: (1) a “noisy model” trained on traces with FRONT noise to classify the first single-tab trace defended with FRONT; (2) a “clean model” trained without FRONT noise to classify the other single-tab traces. In contrast, we train the WFD model in a single manner by viewing the traces of each monitored website with FRONT noise as the original website’s. We use precision, recall, and mAP as the accuracy metrics.
Although the single-tab traces are defended with the GLUE defense, our WFD model surpasses CDSB series (kNN, CUMUL, kFP, and DF) consistently, with a minimum margin of 84.5% in precision, 69.2% in recall, and 64.75% in mAP.
In all the metrics, all MT-WFA models yield lower accuracy on multi-tab traces with overlapping than on the GLUE-defended ones, consistently. This suggests that realistic MT-WFA is more challenging than the GLUE-defended traces.
With the increase of single-tab traces in the glued trace, the performance of CDSB series decreases. This is because split decision and split finding by CDSB series depend heavily on the number of single-tab traces.
For our WFD model, the performance is even better when the number of single-tab traces is increased. This is because the percentage of FRONT-defended single-tab trace becomes small given more single-tab traces. This shows that the FRONT defense is still powerful for our WFD model, although strong results have been obtained on these glued ones.
In this work, we investigate for the first time the realistic and more challenging MT-WFA problem. This eliminates the unrealistic assumptions of sequential single website visiting and manual trimming as made in conventional ST-WFA works. Going beyond classifying manually trimmed single traces, we take a novel detection perspective. We further propose the first end-to-end Website Fingerprint Detection (WFD) model, particularly designed for solving the MT-WFA problem. This is inspired by the success of object detection in computer vision. Our model is different drastically from existing alternative methods that consider trace localization and classification independently, without the ability to optimize their compatibility. We also introduce several designs to improve the model efficiency. Extensive experiments have conducted to validate the significant superiority of our WFD over the start-of-the-art models in the open-world setting with different overlapping cases, different numbers of single-tab traces, different base rates, with and without defense, with normal and small training data.
-  R. Dingledine, N. Mathewson, and P. Syverson, “Tor: The second-generation onion router,” in Proceedings of the 13th USENIX Security Symposium, 2004, pp. 303–320.
-  T. Developers, “Users - tor metrics.” https://metrics.torproject.org/userstats-relay-country.html, 2021.
-  K. Abe and S. Goto, “Fingerprinting attack on tor anonymity using deep learning,” in Proceedings of the Asia-Pacific Advanced Network Research Workshop, 2016, pp. 15–20.
-  J. Hayes and G. Danezis, “k-fingerprinting: a robust scalable website fingerprinting technique,” in Proceddings of the 25th USEUIX Security Symposium, 2016, pp. 1187–1203.
-  A. Panchenko, F. Lanze, and M. Henze, “Website fingerprinting at internet scale,” in Proceedings of the 16th Network and Distributed System Security Symposium, 2016.
-  A. Panchenko, L. Niessen, A. Zinnen, and T. Engel, “Website fingerprinting in onion routing based anonymization networks,” in Proceedings of the ACM Workshop on Privacy in the Electronic Society, 2011, pp. 103–114.
-  V. Rimmer, D. Preuveneers, M. Juarez, T. Van Goethem, and W. Joosen, “Automated website fingerprinting through deep learning,” in Proceedings of the Network and Distributed System Security Symposium, 2018.
-  P. Sirinam, M. Imani, M. Juarez, and M. Wright, “Deep fingerprinting: Undermining website fingerprinting defenses with deep learning,” in Proceedings of the ACM Conference on Computer and Communications Security, 2018, pp. 1928–1943.
-  T. Wang, X. Cai, and I. Johnson, Roband Goldberg, “Effective attacks and provable defenses for website fingerprinting,” in Proceedings of the 23rd USENIX Security Symposium, 2014, pp. 143–157.
-  T. Wang and I. Goldberg, “Improved website fingerprinting on tor,” in Proceedings of the ACM Workshop on Privacy in the Electronic Society, 2013, pp. 201–212.
-  ——, “On realistically attacking tor with website fingerprinting,” in Proceedings on Privacy Enhancing Technologies (PoPETs), 2016, pp. 21–36.
-  S. Bhat, D. Lu, A. Kwon, and S. Devadas, “Var-cnn: A data-efficient website fingerprinting attack based on deep learning,” Proceedings on Privacy Enhancing Technologies, vol. 2019, no. 4, pp. 292–310, 2019.
-  P. Sirinam, N. Mathews, M. S. Rahman, and M. Wright, “Triplet fingerprinting: More practical and portable website fingerprinting with n-shot learning,” in Proceedings of the ACM Conference on Computer and Communications Security, 2019, pp. 1131–1148.
-  M. Chen, Y. Wang, H. Xu, and X. Zhu, “Few-shot website fingerprinting attack,” Computer Networks, vol. 198, p. 108298, 2021.
-  M. Chen, Y. Wang, Z. Qin, and X. Zhu, “Few-shot website fingerprinting attack with data augmentation,” Security and Communication Networks, vol. 2021, pp. 1–13, 09 2021.
-  X. Gu, M. Yang, and J. Luo, “A novel website fingerprinting attack against multi-tab browsing behavior,” in 2015 IEEE 19th international conference on computer supported cooperative work in design (CSCWD). IEEE, 2015, pp. 234–239.
-  Y. Xu, T. Wang, Q. Li, Q. Gong, Y. Chen, and Y. Jiang, “A multi-tab website fingerprinting attack,” in annual computer security applications conference, 2018, Conference Proceedings, pp. 327–341.
-  J. Gong and T. Wang, “Zero-delay lightweight defenses against website fingerprinting,” in 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, Aug. 2020, pp. 717–734.
-  M. Juarez, M. Imani, M. Perry, C. Diaz, and M. Wright, “Toward an efficient website fingerprinting defense,” in Proceeding of the European Symposium on Research in Computer Security, vol. 9878, 2016, pp. 27–46.
-  T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.
-  X.-Y. Liu, J. Wu, and Z.-H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539–550, 2008.
P. Bermejo, J. Gámez, and J. Puerta, “Speeding up incremental wrapper feature subset selection with naive bayes classifier,”Knowledge-Based Systems, vol. 55, pp. 140–147, 01 2014.
-  M. Juarez, S. Afroz, G. Acar, C. Diaz, and R. Greenstadt, “A critical evaluation of website fingerprinting attacks,” in Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, 2014, pp. 263–274.
-  T. Wang, “Designing a better browser for tor with blast,” in Proceedings of the Network and Distributed System Security Symposium, 01 2020.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, 06 2014.
-  R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, pp. 91–99, 2015.
T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
“Feature pyramid networks for object detection,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
-  Q. Chen, Y. Wang, T. Yang, X. Zhang, J. Cheng, and J. Sun, “You only look one-level feature,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 039–13 048.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Computer Vision and Pattern Recognition, 2016, pp. 770–778.
-  A. V. Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
-  F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” 2016.
8 Appendix: Ablation Study
8.1 Data Representation
We study the impact of original feature representation (the lengthy representation of burst sequence and the directional representation of cell sequence) on the performance of the WFA model.
Setting. To simplify the experiment setting, we choose the single-tab trace attack in the open-world scenario and use the famous DF  as the WF attack model. On the single-tab trace dataset DS-19 , we trained DF with the above two types of feature representation. We use the officially released code of DF  in our experiments and follow their default settings. We use precision and recall as the performance metrics.
|Type||Cell Sequence||Burst Sequence|
The precision result on burst sequences is almost the same as the results on cells. For recall, the evaluation result can also be acceptable, even with a small gap of 4%. There are two possible reasons: firstly, the data pattern in the burst sequence is not informative as the cell sequence due to its compression; secondly, the DF model is designed for the cell sequence, and some specified models should be designed for the burst sequence.
The average length of burst sequence is about one-tenth of the average length of cell sequence. Even for the maximum length, the burst sequence’s is only about two-fifths of the cell sequence’s. Undoubtedly, the burst sequence is a compact original feature sequence.
The training time with burst sequence is almost two-fifths of the one with cell sequence, consistent with the lengthy ratio of burst sequence to cell sequence. The test/attack time with burst sequence is almost one-fifth of the one with cell sequence, which is less than the ratio of training time (two-fifth) but more than the ratio of average length (about one-ten). The biggest advantage of burst representation is its high efficiency of both train and test, a bit lower but quite well classification performance with quite potential for improvement. Hence, we try to choose a suitable model for the burst sequence.
8.2 Lightweight Pre-trained Feature Extractor
Since DF is specially designed for the cell sequence but not the burst sequence, we study the feature extraction ability of the 1D ResNet series (18, 34, 50) due to their wide applications. We select the most lightweight 1D ResNet18 whose detection performance is even the best. Using the pre-training and parameters-freezing technology, we save the bootstrap time cost to the maximum limit with the precondition of effectiveness.
Setting. We pre-train these models on the dataset AWF, and choose the best model performing on AWF, and finally report their 100-way 20-shot performance on AWF. The 100-way 20-shot performance means that, with 100 new websites of AWF, we freeze the feature extractor and only fine-tune its classifier with only 20 samples per website, and test the fine-tuned model in the closed-world scenario in which another 15 samples per website are randomly chosen. Except for the input of a 512-length burst sequence, we follow the same training and test setting of Chen 
, with the logistic regression classifier. About the baseline TLFA, we choose the 5000-length cell sequence as input, DF  as the feature extractor, and the logistic regression as the classifier. We use accuracy as the performance metric. For the WF detection evaluation of different feature extractors on DS-19, we keep its setting the same as section VII, but only training with 2000 iterations.
The training time per epoch TLFA (DF, cell sequence), 1D-ResNet series (burst sequence) on AWF Metrics: second.
|Method||ResNet 18||ResNet 34||ResNet 50|
For the few-shot WFA of these three pre-trained 1D ResNet models, their 100-way 20-shot accuracy is almost no different from each other, about 98%, which is almost the same as TLFA(DF, cell). From the view of the single-tab WF attack’s effectiveness, 1D ResNet series with the lengthy representation of burst sequence as input are comparable to TLFA(DF, cell) with the directional representation of cell sequence.
From the view of pre-training time, the 1D ResNet18’s is obviously less than the three others, about the half of ResNet34’s and TLFA(DF,cell)’s, and one-fifth of ResNet50’s. This suggests the pre-train time superiority of 1D ResNet18.
The parameters amount (1,575,296) of 1D ResNet18 is almost not different from the DF’s (1,438,784) in order of magnitude. Moreover, the burst sequence length is only about one-tenth of its corresponding cell sequence length. Hence, the combination of 1D ResNet18 and burst sequence is a best choice not only for ST-WFA but also for MT-WFA.
To save the training time of our WFD model, we freeze the parameters of these three pre-trained 1D ResNet models and use them as the feature extractor. With fewer parameters, the 1D ResNet18 performs better than its two competitors. Compared with the 1D ResNet18, the frozen parameters of 1D ResNet34 and 1D ResNet50 are more, the data patterns extracted in it are more complex, it would be more difficult for the scale encoder and predictor to train with them. This results support us to choice the parameter-frozen pre-trained 1D ResNet18 as the feature extractor of our WFD model.
Compared to training from scratch, with the same training interations, the WFD model with parameter-frozen pre-trained 1D ResNet18 as feature extractor performs obviously better, with a significant margin of 38.49 mAP. Maybe more training time is needed for the WFD model training from scratch to achieve the same performance as with the pre-trained feature extractor. This verifies the necessity and time-saving of pre-training.
8.3 Single-tab Trace Segment Attack
In this section, we study the performance of ST-WFA on different types of segments in the single-tab trace if its front, tail or both ends is overlapping with other traces. For all different percentages of segments in their single-tab trace, we focus on the impact of their types on the ST-WFA performance. With the performance of ST-WFA on these segments, the importance of clean segment is in significance. This is also the experience digit basis of the help of clean monitored segments to the MT-WFA.
Setting. In a partly-overlapping single-tab trace, there are seven types of trace segments: clean segment whose position is the front of trace, called as clean front segment and noted as Clean(Front), whose full trace is called as clean front full trace and noted as called as Full(Clean, Front); clean segment whose position is the tail of trace, called as clean tail segment and noted as Clean(Tail), whose full trace is called as clean tail full trace and noted as Full(Clean, Tail); clean segment whose position is the middle of trace, called as clean middle segment and noted as Clean(Middle), whose full trace is called as clean middle full trace and noted as Full(Clean, Middle); overlapping segment whose position is the front of trace, called as overlapping front segment and noted as Overlap(Front); overlapping segment whose position is the tail of trace, called as overlapping tail segment and noted as Overlap(Tail). On DS-19, with unmonitored traces, we overlapped its single monitored traces at 3 different positions: front, tail, and both ends. We set 9 overlapping percentages from 0.1 to 0.9 with a step of 0.1. For each type of segments with one fixed percentage, we set the training data set and test data set with the same data amount as the original DS-19, called as DS-19. There are overall 27 different types of DS-19 according to their different three positions and different nine percentages.
Results The comparative results are reported in Fig 10. We have the following observations and discussions.
For the clean front segments, evaluated with precision metrics, the ST-WFA achieves the best performance on the clean front segment and its corresponding full trace, and the poorest performance on the overlapping front segments. Similar results happen for the evaluation with the metrics of recall.
For the clean tail segments, the attack performance on them is better than their corresponding full trace, only with a slight advantage of precision but with a considerable margin of recall.
For the clean middle segments, the attack performance on them is also obviously better than their corresponding full trace, especially evaluating with the metrics of recall.
For the overlapping segments, their performance is almost the poorest, even with a strong point of the overlapped tail segment on precision.
For the position of segments, the front part is easiest to be attacked successfully. If this part is overlapped, it would be most difficult to be attacked. This is also the idea behind FRONT defense .
For the types of a segment, the clean segment is undoubtedly the best choice for ST-WFA. This is why we think that the clean monitored segments would help our WF detection.