Encrypted DNS --> Privacy? A Traffic Analysis Perspective

by   Sandra Siby, et al.

Virtually every connection to an Internet service is preceded by a DNS lookup. These lookups are performed in the clear without integrity protection, enabling manipulation, redirection, surveillance, and censorship. In parallel with standardization efforts that address these issues, large providers such as Google and Cloudflare are deploying solutions to encrypt lookups, such as DNS-over-TLS (DoT) and DNS-over-HTTPS (DoH). In this paper we examine whether encrypting DoH traffic can protect users from traffic analysis-based monitoring and censoring. We find that performing traffic analysis on DoH traces requires different features than those used to attack HTTPS or Tor traffic. We propose a new feature set tailored to the characteristics of DoH traffic. Our classifiers obtain an F1-score of 0.9 and 0.7 in closed and open world settings, respectively. We show that although factors such as location, resolver, platform, or client affect performance, they are far from completely deterring the attacks. We then study deployed countermeasures and show that, in contrast with web traffic, Tor effectively protects users. Specified defenses, however, still preserve patterns and leave some webs unprotected. Finally, we show that web censorship is still possible by analysing DoH traffic and discuss how to selectively block content with low collateral damage.



There are no comments yet.


page 9


This is not the padding you are looking for! On the ineffectiveness of QUIC PADDING against website fingerprinting

Website fingerprinting (WF) is a well-know threat to users' web privacy....

Using Google Analytics to Support Cybersecurity Forensics

Web traffic is a valuable data source, typically used in the marketing s...

Measurement and characterization of DNS over HTTPS traffic

Domain name system communication may provide sensitive information on us...

Watching the Watchers: Nonce-based Inverse Surveillance to Remotely Detect Monitoring

Internet users and service providers do not often know when traffic is b...

CyberBunker 2.0 – A Domain and Traffic Perspective on a Bulletproof Hoster

In September 2019, 600 armed German cops seized the physical premise of ...

Every Byte Matters: Traffic Analysis of Bluetooth Wearable Devices

Wearable devices such as smartwatches, fitness trackers, and blood-press...

A First Look at QUIC in the Wild

For the first time since the establishment of TCP and UDP, the Internet ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Domain Name System (DNS) is a critical subsystem of the Internet infrastructure, on which most Internet-applications depend. Only in the first quarter of 2019, more than 5 trillion DNS messages were exchanged per month [1]. The vast majority of such messages are sent in the clear [2], exposing the destination of communications to a number of entities: Internet Service Providers (ISPs), Autonomous Systems (ASes), or state-level agencies, can monitor users’ activities [3], hence enabling mass surveillance [4], and easing network censorship by filtering and redirecting DNS traffic [5, 6].

The lack of mechanisms to enhance DNS privacy raise serious concerns among advocates [7] and Internet governance and standardization bodies [8]. Among the solutions that have been proposed to prevent the inspection of domain names, two protocols have been standardized and deployed: DNS-over-TLS (DoT) [9] and DNS-over-HTTPS (DoH) [10]. These protocols protect the communication between the client and the recursive resolver. More specifically, DoH uses HTTP/2 over TLS, and thus, is well suited for encrypting browsing-related DNS lookups [11]. Companies such as Google and Cloudflare have launched public DoH resolvers [12, 13], and Mozilla recently added DoH support to Firefox [14].

Under the assumption that encryption is enough to provide lookup confidentiality, existing evaluations of DoH implementations have focused on understanding the impact of the underlying transport protocol and encryption on performance [15, 16]. Yet, it is known that traffic features such as volume and timing can reveal the destination of the communication [17, 18, 19, 20, 21, 22].

In this paper we perform, to the best of our knowledge, the first traffic analysis study of encrypted DNS from a security and privacy angle. We consider an adversary placed between the client and the DNS resolver that aims at identifying which web page is visited by users, to either perform surveillance on users’ traffic or censor access to certain resources. We focus on the case of DoH, as its adoption by large industry actors makes it prevalent in the wild.

The particularities of DNS traffic make it resistant to traditional traffic analysis techniques [17, 18, 19, 20, 21, 22]. We identify a novel set of features

based on n-grams that capture local characteristics of traces that enable successful traffic analysis for encrypted DNS. We show how this set of features is robust to changes in the environment (e.g., end-user location or evolution of pages over time) or in the client’s configuration (e.g., choice of client application, platform or recursive DNS resolver) Furthermore, we find that our new feature set provides

comparable or better results than the state-of-the-art in website fingerprinting.

Motivated by our exchange with Cloudflare after responsible disclosure, we evaluate existing traffic analysis defenses: the standardized EDNS0 padding 

[23] and the use of Tor [24]. We find that in our setup, contrary to what was suggested by Cloudflare engineers, EDNS padding strategies cannot completely deter our attack. Also, as opposed to traditional web traffic fingerprinting in which Tor offers little protection against traffic analysis, in the case of DoH, Tor is an extremely effective defense.

Finally, we measure the potential of encryption to hinder current DNS-based censorship practices. Using a novel information-theoretic model we show that, given that the size of the domain names associated with the resources embedded in a webpage visit (e.g., third-party services, or content-providers) are the primary source of information in DoH traffic, this information can be used by censors to maintain their practices without much impact on other traffic.

To summarize, our main contributions are as follows:

  • [noitemsep,nolistsep,leftmargin=11pt]

  • We conduct the first study of the vulnerability of DoH traffic to traffic analysis attacks. We show that traditional web fingerprinting techniques do not work on DoH and propose a new feature set to capture local characteristics (Section 5.1).

  • We show that traffic analysis is effective against DoH, achieving the same accuracy as regular web fingerprinting while requiring less volume of data. We show that factors such as end-user location, choice of recursive DNS resolver, client-side application, or platform affect, but do not stop, the attacks (Section 5).

  • We evaluate existing traffic analysis countermeasures and show that only Tor can fully protect DoH traces (Section 6).

  • We propose an information-theoretic model to evaluate the feasibility of DNS-based censorship when DNS lookups are encrypted (Section 7).

  • We gather the first dataset of encrypted DNS traffic collected in a wide range of environments (Section 4).111Our dataset and code will be made public upon acceptance.

2 Background and Related Work

In this section, we provide background on the Domain Name System (DNS) and existing work on DNS privacy.

The Domain Name System (DNS) is primarily used for translating easy-to-read domain names to numerical IP addresses 222Over time, other applications have been built on top of DNS [25, 26]. This translation is known as domain resolution. In order to resolve a domain, a client sends a DNS query to a recursive resolver, a server typically provided by the ISP with resolving and caching capabilities. If the domain resolution by a client is not cached by the recursive name server, it contacts a number of authoritative name servers which hold a distributed database of domain names to IP mappings. The recursive resolver traverses the hierarchy of authoritative name servers until it obtains an answer for the query, and sends it back to the client. The client can use the resolved IP address to connect to the destination host. Figure 1 summarizes this process.

Enhancing DNS Privacy. As with other network protocols, security was not a major consideration in the first versions of DNS, and thus DNS traffic has been sent in the clear over (in some cases, untrusted) networks. Over the last few years, security and privacy concerns have fostered the appearance of solutions aiming to make DNS traffic resistant to eavesdropping and tampering.

Early efforts for enhancing DNS security include protocols such as DNSSEC [27] and DNSCrypt [28]. DNSSEC introduces digital signatures to prevent manipulation of DNS data. It does not, however, provide confidentiality. DNSCrypt, first deployed by OpenDNS, both encrypts and authenticates DNS traffic between the client and the recursive resolver. However, it was never proposed to the IETF for standardization so it did not achieve wide adoption.

The IETF approved DNS-over-TLS (DoT) [9] and DNS-over-HTTPS (DoH) [10] as Standards Track protocols in 2016 and 2018, respectively. In DoT, a DNS client establishes a TLS session with a recursive resolver (usually on port TCP:853 [9] as standardized by IANA) and exchanges DNS queries and responses over the encrypted connection. To amortize costs, the TLS session between the client and the recursive DNS resolver is usually kept alive and reused for multiple queries.

In DoH, the local DNS resolver establishes an HTTPS connection to the recursive resolver and encodes the DNS queries as HTTP requests. DoH considers the use of HTTP/2’s Server Push mechanism. This enables the server to preemptively push DNS responses to clients that are likely to follow a DNS lookup [29], thus reducing communication latency. As opposed to DoT, which uses a dedicated TCP port for DNS traffic and thus it is easy to monitor and block, DoH lookups can be sent along non-DNS traffic using existing HTTPS connections (yet potentially blockable at the IP level). However, DoT may be more convenient for enterprise network administrators, as it allows keeping tighter control over the DNS traffic.

There are several available implementations of DoT and DoH. Cloudflare and Quad9 provide both DoH and DoT resolvers, Google supports DoH, and Android P (currently in beta version) has native support for both DoH and DoT. DoH enjoys widespread support from browser vendors. Firefox provides the option of directing DNS traffic to a trusted recursive resolver such as a DoH resolver, falling back to plaintext DNS if the resolution over DoH fails. Cloudflare also distributes a stand-alone DoH client and, in 2018, they released a hidden resolver that provides DNS over Tor, not only protecting lookups from eavesdroppers but also providing anonymity for clients towards the resolver. Other protocols, such as DNS-over-DTLS [30], an Experimental RFC proposed by Cisco in 2017, and DNS-over-QUIC [31], proposed to the IETF in 2017 by industry actors, are not widely deployed so far.

Several academic works study privacy issues related to DNS. Shulman suggests that encryption alone may not be sufficient to protect users [32]. Our results confirm her hypothesis that DNS response size variations can be a distinguishing feature. Herrmann et al. study the potential of DNS traces as identifiers to perform user tracking but do not consider encryption [33]. Finally, Imana et al. study privacy leaks on traffic between recursive and authoritative resolvers [34]. This is not protected by DoH and it is out of scope of our study.

3 Problem Statement

In this paper, we set to answer the question: is it possible to infer which websites a user visits from observing encrypted DNS traffic? This information is of interest to multiple actors, e.g., entities computing statistics on Internet usage [35, 36], entities looking to identify malicious activities [37, 38, 6], entities performing surveillance [39, 3], or entities performing censorship [40, 41].

We consider an adversary that can collect traffic between the user and the DNS recursive resolver (red dotted lines in Figure 1), and thus can link lookups to a specific origin IP address. Such an adversary could be present on the users’ local network, near the resolver, or anywhere along the path (e.g., an ISP or compromised network router).

Figure 1: DNS resolution: To visit www.google.com, a user queries the recursive resolver for its IP. If the record is not cached, the recursive resolver queries an authoritative resolver and forwards the response to the client. The client uses the IP in the response to connect to the server via HTTP. We consider an adversary placed between the client and the resolver (i.e., observes the red dotted lines).

Depending on her location, the adversary may or may not observe the subsequent HTTP connection to the destination host. For instance, an adversary could be located in an AS that lies between the user and the resolver —e.g., when using third-party DNS resolvers like Quad9 rather than their ISP-provided one—, but not between the user and the destination host. We performed measurements from our university network to verify that this is the case in a non-negligible number of cases. Furthermore, BGP hijacking attacks, which are becoming increasingly frequent [42], can be used to selectively intercept paths to DoH resolvers. In such cases, the adversary can only rely on DNS fingerprinting to learn which webpages are visited by a concrete user for monitoring, or censorship [39, 3].

In the case of an adversary that also has access to the HTTP connection, one could argue that the subsequent HTTP(S) connection reveals visited domains even when encrypted. Fields such as the destination IP or the Server Name Indicator (SNI) may reveal the visited domain to the adversary in the case of TLS traffic. That could be further aggravated by HTTP flows emanating without encryption from the same user machine [4]. However, with the increasing prevalence of virtual hosting and Content Delivery Networks, and the implementation of protocols such as IPv6 and TLS 1.3, determining the destination domain of the connection without traffic analysis becomes more difficult. Thus, data leaked by encrypted DNS becomes even more relevant. While the adversary could perform traditional website fingerprinting, we show that fingerprinting DoH achieves the same accuracy while requiring less volume of data: our DoH traces are in average 5 times shorter in number of packets than HTTPS traces for web traffic.

We assume that the adversary has access to encrypted DNS traffic traces that are generated when the user visits a website via HTTP/S using DoH to resolve the IPs of the resources. A DNS trace, which we also call DoH trace, comprises the resolution of the visited website first-party domain, and the subsequent resolutions for the resources contained in the website, e.g., images, or scripts. For instance, for visiting Reddit, after resolving www.reddit.com, the client would resolve domains such as cdn.taboola.com, doubleclick.net and thumbs.redditmedia.com, among others.

We consider two different adversaries depending on their goals: first, monitoring the browsing behavior of users, which we study in Section 5; and second censoring what pages users visit, which we study in Section 7. We note that there is a very important difference between these two goals regarding data collection. Monitoring does not require the adversary to take any action based on her observations. Thus, she can collect full traces to make their inferences as accurate as possible. In contrast, censorship adversaries need to find out which domain is being requested as fast as possible so as to interrupt the communication, so they must act on partial traces.

4 Data Collection

We collect traces for the top, middle, and bottom 500 webpages in Alexa’s top million websites list on 26 March 2018 (1,500 webpages in total). We visit each webpage in a round-robin fashion, obtaining up to 200 samples for every webpage. For our open world analysis, we collect traces of an additional 5,000 webpages from the top domains of the Alexa list. We collected data during two periods, from 26 August 2018 to 9 November 2018, and from 20 April 2019 to 14 May 2019. Data from these two periods is never mixed in the analysis

To collect the traces we set up Ubuntu 16.04 virtual machines with DoH clients that send DNS queries to a public DoH resolver. We use Selenium333https://www.seleniumhq.org/ (version 3.14.1) to automatically launch a full-fledged browser and visit a webpage from our list and trigger the DNS lookups. We repeat this process for every webpage in the list restarting the browser every time to ensure that the cache and profile do not affect collection. We run tcpdump to capture the network traffic between the DoH client and the resolver. We filter the traffic by destination port and IP to obtain the final DoH trace.

To study the influence of various parameters on DoH traffic, we collect data in different scenarios varying end user location and platform, DoH client and resolver, and different DNS traffic analysis defenses. Table 1 provides an overview of the collected datasets. To better understand the vulnerabilities of DNS encryption we opted for having heterogenous experiments rather than in-depth studies of few cases, resulting in the difference in samples among the datasets. In the following sections, we use the Identifier provided in the second column to refer to each of the datasets. Note that unless specified otherwise, we use Cloudflare’s DoH client.

Name Identifier # webpages # samples
Desktop (Location 1) LOC1 1,500 200
Desktop (Location 2) LOC2 1,500 60
Desktop (Location 3) LOC3 1,500 60
Raspberry Pi RPI 700 60
Firefox with Google resolver GOOGLE 700 60
Firefox with Cloudflare resolver CLOUD 700 60
Firefox with Cloudflare client CL-FF 700 60
Open World OW 5,000 3
DoH and web traffic WEB 700 60
DNS over Tor TOR 700 60
Cloudflare’s EDNS0 padding implementation EDNS0-128 700 60
Recommended EDNS0 padding EDNS0-468 700 60
Table 1: Overview of datasets.

Data curation. We curate the datasets to ensure that our results are not biased by spurious errors in collection, or website behaviors that are bound to generate classification errors unrelated to the characteristics of DNS traffic with respect to traffic analysis attacks.

Concretely, we aim at identifying two cases. First, the cases in which different domains generating the same exact DNS traces. These occur when webpages redirect to other pages or to the same resource, and when web servers return the same errors (e.g., 404 not found or 403 forbidden). Second, the case in which websites change during collection for reasons other than those variations due to their organic evolution. For instance, pages that go down during the collection period. When this happens, the captured traces do not represent the expected behavior of the page.

To identify these cases, we use the Chrome over Selenium crawler to collect the HTTP request/responses, not the DNS queries responses, of all the pages in our list in LOC1. Then we conduct two checks. First, we look at the HTTP response status of the top level domain, i.e., the URL that is being requested by the client. We identify the webpages that do not have an HTTP OK status. These could be caused by a number of factors, such as pages not found (404), anti-bot solutions, forbidden responses due to geoblocking [43] (403), internal server errors (500), and so on. We mark these domains as conflicting. Second, we confirm that the top level domain is present in the list of requests and responses. This ensures that the page the client is requesting is not redirecting the browser to other URLs. This check triggers some false alarms. For example, a webpage might redirect to a country-specific version (indeed.com redirecting to indeed.fr, results in indeed.com not being present in the list of requests); or in domain redirections (amazonaws.com redirecting to aws.amazon.com). We do not consider these cases as anomalies. Other cases are full redirections. Examples are malware that redirect browser requests to google.com, webpages that redirect to GDPR country restriction notices, or webpages that redirect to domains that specify that the site is closed. We consider these cases as invalid webpages and add them to our list of conflicting domains.

We repeat these checks multiple times over our collection period. We find that 70 webpages that had invalid statuses at some point during our crawl, and 16 that showed some fluctuation in their status (from conflicting to non-conflicting or vice versa). We study the effects of keeping and removing these conflicting webpages in Section 5.2.

5 Website Fingerprinting through DNS

Website fingerprinting attacks enable a local eavesdropper to determine which pages a user is accessing over an encrypted or/and anonyimized channel. Website fingerprinting has been shown to be effective on HTTPS [17, 18, 44], OpenSSH tunnels [45, 46], encrypted web proxies [47, 48] and VPNs [49], and even on anonymous communications systems such as Tor [50, 51, 52, 53, 19, 20, 21, 22].

Website fingerprinting exploits the fact that the size, timing, and order of TLS packets are a reflection of a website’s content. As resources are unique to each webpage, the traces identify the web. These patterns can be indirectly observed, even if the traffic has been encrypted or anonymized.

Some of the patterns exploited by website fingerprinting are correlated with patterns in DNS traffic. For instance, which resources are loaded and their order, determines the order of the corresponding DNS queries. Thus, it is likely that website fingerprinting can also be done on DNS traffic encrypted with protocols such as DNS-over-HTTPS (DoH). In this paper we call DNS fingerprinting the use of traffic analysis to identify the web page that generated a trace of encrypted DNS traffic, i.e., website fingerprinting on encrypted DNS traffic. In the following, whenever we do not explicitly specify whether the target of website fingerprinting is DNS or HTTPS traffic, we refer to traditional website fingerprinting on HTTPS traffic.

5.1 DNS traffic fingerprinting

As in website fingerprinting, we treat DNS fingerprinting as a supervised learning problem: the adversary first collects a training dataset of network traces for a set of pages, where the page (label) corresponding to a network trace is known. The adversary extracts features from the network traces (e.g., lengths of network packets) and trains a classifier to identify the page given a network trace. To deploy the attack, the adversary collects traffic from a target user and feeds it to the classifier to determine which page generated that traffic.

Traffic variability.

In website fingerprinting, conditions such as networks conditions and embedded third-party advertisements, introduce variance in traffic traces sampled for the same website. Similarly, DNS traces also vary over time. Thus, the adversary must collect multiple samples for each page in order to obtain a robust representation of the page.

Some of this variability has similar origin to that of web traffic. For instance, the dynamic nature of websites that results on varying the DNS lookups associated with third-party embedded resources; the platform where the client runs, the configuration of the DoH client, or the software using the client which may vary the DNS requests (e.g., mobile versions of websites, or browsers’ use of pre-fetching); or the effects of content localization and personalization, which determines which resources are served depending on the location of the user, or her actions (e.g., logged in or not).

Additionally, there are some factors specific to DNS traffic. Concretely, the effect of the local resolver, which depending on the state of the cache may or may not launch requests to the authoritative server, resulting in different traffic patterns; or the DNS-level load-balancing (e.g., CDNs) which may provide different IPs for a resource [54].

Feature engineering.

DNS traffic presents unique challenges with respect to web traffic for fingerprinting. Besides the extra traffic variability, DNS responses are smaller than web resources. In most cases, DNS requests and responses fit in one single TLS record, even if they are wrapped within HTTP requests like in DoH. These particularities hinder the use of traditional website fingerprinting features on DoH traffic.

As a matter of fact, in our preliminary experiments, we attempted to use features and techniques already used in the web traffic fingerprinting literature [19, 21]

. Most of such features are based on aggregate metrics of traffic traces such as the total number of packets, total bytes, and their statistics (e.g., average, standard deviation). We found that these features are not as relevant for DoH traffic. For instance, the accuracy of the k-fingerprinting attack 

[21], which includes most website fingerprinting features considered in the literature, drops from 95% to just 74% when applied on DoH traffic (see Table 4).

We present a novel feature set that is specifically designed for encrypted DNS traffic. The key idea is to represent traces as n-grams of TLS record lengths. The intuition is that n-grams capture patterns in request-response size pairs which are especially relevant for DoH, as TLS records often contain either a request or a response. To some extent, they also capture the local order of the length sequence. We take tuples of consecutive TLS record lengths in the DoH traces trace and count the number of their occurrences in each trace. For instance, for the trace , the uni-grams are and the bi-grams are . To the best of our knowledge, n-grams had never been considered as features in the website fingerprinting literature.

We extend the n-gram representation to traffic bursts. Bursts are sequences of consecutive packets in the same direction (either incoming or outgoing). Bursts correlate with the number and order of resources embedded in the page and thus are a good candidate feature for DoH traffic fingerprinting. Additionally, they are more robust to small changes in order than individual sizes because they aggregate several records in the same direction. We represent n-grams of bursts by taking tuples burst lengths in the burst sequence. In the previous example, the burst-length sequence of the trace above is and the burst bi-grams are .

We experimented with uni-, bi- and tri-grams for both types of features. We observed a marginal improvement in the classifier on using tri-grams at a substantial cost on the memory requirements of the classifier. We also experimented with the timing of packets but, as in website fingerprinting [51], we found it unreliable due to its dependence on the state of the network than on the content being served. Thus, they encode little information about the visited website.

In our experiments we use the concatenation of uni-grams and bi-grams of both TLS record sizes and bursts as feature set.

Algorithm selection.

After experimenting with different supervised classification algorithms, we decided to use Random Forests (RF), which have been demonstrated to be very effective for traffic analysis tasks 

[21, 55].

Random forests (RF) are ensembles of simpler classifiers called decision trees. Decision trees use a tree data structure to represent splits of the data: nodes represent a condition on one of the data features and branches represent decisions based on the evaluation of that condition. In decision trees, feature importance in classification is measured with respect to how well they split samples with respect to the target classes. The more skewed the distribution of samples into classes is, the better the feature discriminates. Thus, a common metric for importance is the Shannon’s entropy of this distribution. Decision trees, however, do not generalize well and tend to overfit the training data. RFs mitigate this issue by randomizing the data and features over a large amount of trees, so that different subsets of features and data are used in each tree. The final decision of the RF is an aggregate function on the individual decisions of its trees. In our experiments we use 100 trees and a majority vote of the trees as the aggregate function.

Validation. We evaluate the effectiveness of our classifier measuring the Precision, Recall and F1-Score (Appendix A) in two scenarios typically used in the web traffic analysis literature. A closed world, in which the adversary knows the set of all possible webpages that users may visit; and an open-world, in which the adversary only has access to a set of monitored sites, and the user may visit webpages outside of this set.

We use 10-fold cross-validation in all of our experiments to measure biases related to the overfitting of the classifier. Cross-validation is a standard methodology to evaluate overfitting in machine learning. In cross-validation, the samples of each class are divided in ten disjoint sets. The classifier is then trained on nine of the sets and tested in the remaining one, proving ten samples of the classifier performance on a set of samples on which it has not been trained on. This gives an idea of how the classifier generalizes to unseen examples.

5.2 Evaluating -grams features

In this section we evaluate the effectiveness of our n-grams based website fingerprinting attack on DNS traffic in the closed- and open-world scenarios, as well as on HTTPS traffic.

Closed world We first study a closed world setting in which the adversary knows the set of webpages visited by a user. Table 2 shows the classifier’s performance on the LOC1 dataset. We observe that considering the 1,414 curated webpages (see Section 4) instead of all 1,500 webpages results in just a 1% performance increase. Thus, in the remaining experiments we use the complete dataset.

Scenario Precision Recall F1-score
Curated traces
Full dataset
Combined labels
Table 2: Classifier performance for LOC1 dataset (mean and standard deviation for 10-fold cross validation).

We notice that the Alexa ranking contains URLs that refer to regional versions of he same service. For example, google.es and google.co.uk both point to Google, but are considered as two separate webpages in our dataset. Even though our classifier often misclassifies these cases, from an adversary’s point of view, they can be considered equivalent classes. The third row in the table shows that considering classifications within the equivalence class of a domain as a success results in a performance improvement of 3-4%. See Figures 11 and 13 in the Appendix for the confusion graphs of this evaluation.

As pointed out by prior work on website fingerprinting, average metrics can give an incomplete and biased view of the classification results [56]. This is because the classifier’s performance may vary significantly between different individual classes. We observe that this is also the case in DoH. Figure 2

depicts individual classes in a scatterplot: each dot is a website and its color represents the absolute difference between Precision and Recall: blue indicates 0 difference and red indicates maximum difference (i.e.,

). We see that, for some webpages the classifier obtains low Precision but high Recall (red dots on the right of the Precision scatterplot) and, conversely, there are pages with high Precision but low Recall (red dots on the right of the Recall scatterplot). The latter case is very relevant for privacy since, every time the adversary identifies one of these pages, she is absolutely sure her guess is correct. In censorship, for instance, this enables the censor to block with certainty.

Figure 2: Performance per class in LOC1. Each dot represents a class and its color the absolute difference between Precision and Recall (blue low, red high).

Adversary’s effort. To get an intuition about the data collection effort required by an adversary, we study the classifier’s performance improvement with the number of samples used for training . We see in Table 3 that after 20 samples there are diminishing returns in increasing the number of samples per domain. To minimize the data collection effort, we collected 60 samples per domain for all our datasets except for the unmonitored websites in the open world, for which we collected three samples per domain.

Number of samples Precision Recall F1-score
10 0.873 0.866 0.887
20 0.897 0.904 0.901
40 0.908 0.914 0.909
100 0.912 0.916 0.913
Table 3: Classifier performance for different number of samples in the LOC1 dataset averaged over 10-fold cross validation (standard deviations less than 1%).

We observe a difference with respect to prior work on web traffic analysis. Website fingerprinting studies in Tor report more than 10% increase between 10 and 20 samples [51] and between 2% and 10% between 100 and 200 samples [57, 22]. In DNS, we see a small increase between 10 and 20 samples, and a negligible difference after 20 samples.

We believe the reason why fingerprinting DoH requires fewer samples per domain is DoH’s lower intra-class variance with respect to encrypted web traffic. One reason for this difference could be the presence of advertisements, which are an important source of intra-class variance in web traffic. They often change across visits, varying the sizes of the resources associated to the advertisement. However, the variance that advertisements add to DNS traffic might be more limited. Some publishers rely on ad-networks for ad mediation and, in some cases, the ad-network’s domain and not the advertiser’s will appear when fetching all the advertisements in the page [58].

Open world. In the previous experiments, the adversary knew that the webpage visited by the victim was within the training dataset. We now evaluate the adversary’s capability to distinguish those webpages from other unseen traffic. Following prior work [55, 59] we consider two sets of webpages, one monitored and one unmonitored. The adversary’s goal is to determine whether a test trace belongs to a page within the monitored set.

We train a classifier with monitored and unmonitored samples. Since it is not realistic to assume that an adversary can have access to all unmonitored classes, we create unmonitored samples using 5,000 webpages traces formed by a mix of the OW and LOC1 datasets. We divide the classes such that 1% of all classes are in the monitored set and 10% of all classes are used for training. We ensure that the training dataset is balanced, i.e., it contains equal number of monitored and unmonitored samples; and the test set contains an equal number of samples from classes used in training and classes unseen by the classifier. To perform cross-validation, we run 10 folds. To ensure that our classifier generalizes well to any unseen data in every fold, we consider a different combination of the monitored and unmonitored classes for training and testing.

To decide whether a target trace is monitored or unmonitored, we use a method proposed by Stolerman et al. [60]

. We assign the target trace to the monitored class if and only if the classifier predicts this class with probability larger than a threshold

, and to unmonitored otherwise.

We show in Figure 3, the average Precision-Recall ROC curve for the monitored class over 10 iterations varying the discrimination threshold, , from 0 to 0.99 in steps of 0.1. We also show the random classifier, which indicates the probability of selecting the positive class uniformly at random, and acts as a baseline. We see that when , the classifier has an F1-score of 0.7. This result suggests that traffic analysis is a true threat to DNS privacy.

Figure 3: Precision-Recall ROC curve for open world classification, for the monitored class. The threshold, , is varied from 0.0 to 0.99 in steps of 0.1 (standard deviation less than 1%).

Web traffic fingerprinting. Finally, we evaluate our n-grams features suitability for performing traditional web traffic fingerprinting. We compare them to the features set in the k-Fingerprinting attack, which includes a comprehensive set of features used in the website fingerprinting literature [21]. For the comparison we scaled down the closed-world, but fixed the same number of websites and samples per website between both feature sets. We used a random forest with the same parameters as classification algorithm in both cases.

We use the WEB dataset to evaluate the performance of the classifiers on only DoH traffic (DoH-only), only HTTPS traffic corresponding to web content traffic (Web-only), and no filter (DoH+Web). As shown in Table 4, not only the n-grams achieve better performance than the k-Fingerprinting features on the DoH-only dataset but, surprisingly, they also outperform the k-Fingerprinting features on the Web-only and DoH-Web datasets.

DoH-only Web-only DoH + Web
n-grams 0.87 0.99 0.88
k-Fingerprinting [21] 0.74 0.95 0.79
Table 4: F1-Score of the n-grams and k-Fingerprinting features for different subsets of traffic: only DoH traffic (DoH-only), only HTTPS traffic corresponding to web traffic (Web-only) and and the full trace (DoH+Web).

In both cases, an adversary who is able to intercept all communications, both with the resolver and the web server, can improve the success of the attack by adding web traffic, as shown by the increase in F1-Score between the first and the last rows. However, such an adversary is better off by discarding DoH traffic. We hypothesize that the added variability of DoH adds noise in small sites increasing the classifier errors.

5.3 DNS Fingerprinting Robustness

In practice, the capability of the adversary to distinguish websites is very dependent on environmental characteristics and differences in the setup while collecting data [61]. To understand the impact of the environment on DNS fingerprinting success we run experiments exploring three environmental dimensions: time, space, and infrastructure.

5.3.1 Robustness over time

DNS traces vary due to the dynamism of webpage content and variations in DNS responses (e.g., service IP changes because of load-balancing). We now study how this variability impacts the performance of the classifier.

We consider collect data LOC1 for 10 weeks between the end of September to the beginning of November 2018. We divide this period into five intervals, each containing two consecutive weeks, and report in Table 5 the F1-score of the classifier when we train the classifier on data from a single interval and use the other intervals as test data (0 weeks old denotes data collected in November). In most cases, the F1-score does not significantly decrease within a period of 4 weeks. Longer periods result in a significant drops – more than 10% drop in F1-score when the training and testing are separated 8 weeks.

F1-score 0 weeks old 2 weeks old 4 weeks old 6 weeks old 8 weeks old
0 weeks old 0.880 > 0.8 * 10.8800.880 0.827 > 0.8 * 10.8270.827 0.816 > 0.8 * 10.8160.816 0.795 > 0.8 * 10.7950.795 0.745 > 0.8 * 10.7450.745
2 weeks old 0.886 > 0.8 * 10.8860.886 0.921 > 0.8 * 10.9210.921 0.903 > 0.8 * 10.9030.903 0.869 > 0.8 * 10.8690.869 0.805 > 0.8 * 10.8050.805
4 weeks old 0.868 > 0.8 * 10.8680.868 0.898 > 0.8 * 10.8980.898 0.910 > 0.8 * 10.9100.910 0.882 > 0.8 * 10.8820.882 0.817 > 0.8 * 10.8170.817
6 weeks old 0.775 > 0.8 * 10.7750.775 0.796 > 0.8 * 10.7960.796 0.815 > 0.8 * 10.8150.815 0.876 > 0.8 * 10.8760.876 0.844 > 0.8 * 10.8440.844
8 weeks old 0.770 > 0.8 * 10.7700.770 0.784 > 0.8 * 10.7840.784 0.801 > 0.8 * 10.8010.801 0.893 > 0.8 * 10.8930.893 0.906 > 0.8 * 10.9060.906
Table 5: F1-score when training on the interval indicated by the row and testing on the interval in the column (standard deviations less than 1%). We use 20 samples per webpage (the maximum number of samples collected in all intervals).

This indicates that to obtain best performance, the adversary should collect data at least once a month. However, it is unlikely that DNS traces change drastically. To account for gradual changes, the adversary can perform continuous collection and mix data across weeks. In our dataset, if we combine two- and three-week-old samples for training; we observe a slight decrease in performance. Thus, a continuous collection strategy can suffice to maintain the adversary’s performance without requiring large periodic collection efforts.

5.3.2 Robustness across locations

DNS traces may vary across locations due to several reasons. First, DNS lookups vary when websites adapt their content to specific geographic regions. Second, popular resources cached by resolvers vary across regions. Finally, resolvers and CDNs use geo-location methods for load-balancing requests, e.g., using anycast and EDNS [62, 63].

We collect data in three locations, two countries in Europe (LOC1 and LOC2) and a third in Asia (LOC3). Table 6 (leftmost) shows the classifier performance when crossing these datasets for training and testing. When trained and tested on the same location unsurprisingly the classifier yields results similar to the ones obtained in the base experiment. When we train and test on different locations, the F1-score decreases between a 16% and a 27%, the greatest drop happening for the farthest location, LOC3, in Asia.

Interestingly, even though LOC2 yield similar F1-Scores when cross-classified with LOC1 and LOC3, the similarity does not hold when looking at Precision and Recall individually. For example, training on LOC2 and testing on LOC1 results on around 77% Precision and Recall, but training on LOC1 and testing on LOC2 yields 84% Precision and 65% Recall. Aiming at understanding the reasons behind this asymmetry, we build a classifier trained to separate websites that obtain high recall (top 25% quartile) and low recall (bottom 25% quartile) when training with LOC1 and LOC3 and testing in LOC2. A feature importance analysis on this classifier that LOC2’s low-recall top features have a significantly lower importance in LOC1 and LOC2. Furthermore, we observe that the intersection between LOC1 and LOC3’s relevant feature sets is slightly larger than their respective intersections with LOC2. While it is clear that the asymmetry is caused by the configuration of the network in LOC2, its exact cause remains an open question.

5.3.3 Robustness across infrastructure

Influence of DoH Resolver. We study two commercial DoH resolvers, Cloudflare’s and Google’s. Contrary to Cloudflare, Google does not provide a stand-alone DoH client. To keep the comparison fair, we instrument a new collection setting using Firefox in its trusted recursive resolver configuration with both DoH resolvers.

Location LOC1 LOC2 LOC3
LOC1 0.906 > 0.8 * 10.9060.906 0.712 > 0.8 * 10.7120.712 0.663 > 0.8 * 10.6630.663
LOC2 0.748 > 0.8 * 10.7480.748 0.908 > 0.8 * 10.9080.908 0.646 > 0.8 * 10.6460.646
LOC3 0.680 > 0.8 * 10.6800.680 0.626 > 0.8 * 10.6260.626 0.917 > 0.8 * 10.9170.917
GOOGLE 0.880 > 0.8 * 10.8800.880 0.129 > 0.8 * 10.1290.129
CLOUD 0.862 > 0.8 * 10.8620.862 0.885 > 0.8 * 10.8850.885
DESKTOP 0.8802 > 0.8 * 10.88020.8802 0.0003 > 0.8 * 10.00030.0003
RPI 0.0002 > 0.8 * 10.00020.0002 0.8940 > 0.8 * 10.89400.8940
CLOUD 0.885 > 0.8 * 10.8850.885 0.349 > 0.8 * 10.3490.349 0.000 > 0.8 * 10.0000.000
CL-FF 0.109 > 0.8 * 10.1090.109 0.892 > 0.8 * 10.8920.892 0.069 > 0.8 * 10.0690.069
LOC2 0.001 > 0.8 * 10.0010.001 0.062 > 0.8 * 10.0620.062 0.908 > 0.8 * 10.9080.908
Table 6: Performance variation changes in location and infrastructure (F1-score, standard deviations less than 2%).

Table 6 (center-left) shows the result of the comparison. As expected, training and testing on the same resolver yields the best results. In particular, we note that even though Google hosts other services behind its resolver’s IP and thus DoH traffic may be mixed with the visited website’s traffic (e.g., if a web embeds Google third-party) the classifier performs equally for both resolvers.

As in the location setting, we observe an asymmetric decrease in one of the directions: training on GOOGLE dataset and attacking CLOUD results in 13% F1-socre, while attacking GOOGLE with a classifier trained on CLOUD yields similar results as training on GOOGLE itself.

Figure 4: Top 15 most important features in Google’s and Cloudflare’s datasets. On the left, features are sorted by the results on Google’s dataset and, on the right, by Cloudflare’s.

To investigate this asymmetry we rank the features according their importance for the classifiers. For simplicity, we only report the result on length unigrams, but we verified that our conclusions hold when considering all features together. Figure 4 shows the top-15 most important features for a classifier trained on Google’s resolver (left) and Cloudflare’s (right). The rightmost diagram of each column shows the importance of these features on the other classifier. Red tones indicate high importance, and dark colors represent irrelevant features. Grey indicates that the feature is not present.

We see that the most important features in Google are either not important or missing in Cloudflare (the right column in left-side heatmap is almost gray). As the missing features are very important, they induce erroneous splits early in the trees, and for a larger fraction of the data, causing the performance drop. However, only one top feature in the classifier trained on Cloudflare is missing in Google, and the others are also important (right column in right-side heatmap). Google does miss important features in Cloudflare, but they are of little importance and their effect on performance is negligible.

Influence of user’s platform.

We collect traces for the 700 top Alexa webpages on a Raspberry Pi (RPI dataset) and an Ubuntu desktop (DESKTOP dataset), both from LOC1. We see in Table 6 (center-right) that, as expected, the classifier has good performance when the training and testing data come from the same platform. However, it drops to almost zero when crossing the datasets.

Aiming at understanding this drop, we take a closer look at the TLS record sizes from both platforms. We found that TLS records in the DESKTOP dataset are on average 7.8 bytes longer than those in RPI (see Figure 10 in Appendix C). We repeated the cross classification after adding 8 bytes to all RPI TLS record sizes. Even though the classifiers do not reach the base experiment’s performance, we see a significant improvement in cross-classification F1-score to when training on DESKTOP and testing on RPI, and when training on RPI and testing on DESKTOP.

Influence of DNS client. Finally, we consider different client setups: Firefox’s trusted recursive resolver or TRR (CLOUD), Cloudlflare’s DoH client with Firefox (CL-FF) and Cloudflare’s DoH client with Chrome (LOC2). We collected these datasets in location LOC2 using Cloudflare’s resolver.

Table 6 (rightmost) shows that the classifier performs as expected when trained and tested on the same client setup. When the setup changes, the performance of the classifier drops dramatically, reaching zero when we use different browsers. We hypothesize that the decrease between CL-FF and LOC2 is due to differences in the implementation of the Firefox’s built-in and Cloudflare’s standalone DoH clients.

Regarding the difference when changing browser, we found that Firefox’ traces are on average 4 times longer than Chrome’s. We looked into the unencrypted traffic to understand this difference. We used a proxy to man-in-the-middle the DoH connection between the client and the resolver444https://github.com/facebookexperimental/doh-proxy, obtaining the OpenSSL TLS session keys with Lekensteyn’s scripts555https://git.lekensteyn.nl/peter/wireshark-notes. We use this proxy to decrypt DoH captures for Firefox configured to use Cloudflare’s resolver, but we could not do the same for Google. Instead, we man-in-the-middle a curl-doh client666https://github.com/curl/doh, which also has traces substantially shorter than Firefox. We find that Firefox, besides resolving domains related to the URL we visit, also issues resolutions related to OSCP servers, captive portal detection, user’s profile/account, web extensions, and other Mozilla servers. As a consequence, traces in CL-FF and CLOUD datasets are substantially larger and contain contain different TLS record sizes than any of our other datasets. We conjecture that Chrome performs similar requests, but since traces are shorter we believe the amount of checks seems to be smaller than Firefox’s.

5.3.4 Robustness Analysis Takeaways

The results in the previous section reveal that to obtain best results across different configurations the adversary would need to train a classifier for each targeted setting. Then, of course, she would need to be able to identify her victim’s configuration. Kotzias et al. demonstrated that identifying client or resolver is possible, for instance examining the IP (if the IP is dedicated to the resolver), or fields in the ClientHello of the TLS connection (such as the the Server Name Indication (SNI), cipher suites ordering, etc.) [64]. Even if in the future these features are not available, we found that the characteristics of the traffic itself are enough to identify a resolver. We built classifiers to distinguish resolver and client based on the TLS record length. We can identify resolvers with 95% accuracy, and we get no errors (100% accuracy) when identifying the client.

Regarding users’ platform, we see little difference between desktops, laptops, and servers in Amazon Web Services. Only when the devices are as different as a desktop and a constrained device the classifier’s accuracy drops.

Finally, our longitudinal analysis reveals that, even though webs change over time these changes are not drastic. Therefore, it should not be hard for the adversary to keep up with the changes by continuously collecting samples and incorporating them to her training set.

Survivors and Easy Preys. We study whether there are websites that are particularly good or bad at evading fingerprinting under all the configurations evaluated in this section. We compute the mean F1-Score across all configurations as an aggregate measure of the attack’s overall performance, and analyze the skew of its distribution on individual websites. We plot the CDF of the distribution of mean F1-scores over the websites in Figure 5. This distribution is heavily skewed: there are up to 15% of websites that had an F1-Score equal or lower than 0.5 and more than 50% of the websites have a mean F1-Score equal or lower than 0.7.

Figure 5: Cumulative Distribution Function (CDF) of the per-class mean F1-Score.

We looked into the tails of this distribution and ranked sites by lowest mean F1-Score and lowest standard deviation. On top of that ranking we have sites that survived the attack in all configurations. Among the survivors we found Google and errored sites that misclassify between each other. For other surviving sites, after manual inspection we did not find a pattern in the websites structure or the resource loads that explains why these sites survive. We leave a more in-depth analysis of the survival of these sites for future work. In Appendix E we list the top-10 sites in the tails of the distribution.

6 DNS Defenses against fingerprinting

In this section, we compare existing defenses aimed at preventing traffic analysis attacks on encrypted DNS traces.

EDNS(0) Padding. EDNS (Extension mechanisms for DNS) is a specification to increase the functionality of the DNS protocol [65]. One of the options is the addition of padding [23] by both DNS clients and resolvers in order to prevent size-correlation attacks on encrypted DNS. The recommended padding policy is to pad DNS requests to the nearest multiple of 128 bytes and DNS responses to the nearest multiple of 468 bytes [66]. Cloudflare’s DoH client provides functionality to set EDNS(0) padding to DNS queries, but leaves the specifics of the padding policy to the user. We modify the client source code to follow the recommended padding strategy. Google’s specification also mentions EDNS padding. However, we could not find any option to activate this feature.

When conducting our data collection, we discovered that Cloudflare’s DoH resolver does not implement server-side padding. We communicated this fact to Cloudflare. They implemented padding of responses in their next release. However, they followed a strategy of padding the responses to multiples of 128 bytes, as opposed to the recommended policy of 468 bytes.

In order to also evaluate the recommended policy, we set up an HTTPS proxy, mitmproxy, between the DoH client and the Cloudflare resolver. The proxy intercepts responses from Cloudflare’s DoH resolver, strips the existing padding, and pads responses to the nearest multiple of 468 bytes.

Below, we evaluate the effectiveness against traffic analysis of padding both queries and responses, using Cloudflare’s implementation of response padding (EDNS0-128) and the recommended response padding (EDNS0-468).

Constant padding. To fully understand the potential of padding, we also simulate a setting in which all packets are padded to the same length (that of the longest packet in the dataset, with a size of 825 bytes). This implies that the classifier cannot exploit the TLS record size information.

DNS over Tor. We finally evaluate the use of Tor as a deterrent for traffic analysis attack. We use Cloudflare’s DNS over Tor service as a target.

Results. Table 7 shows the classification results for all defenses. EDNS0 padding of both queries and responses, which was intended to alleviate traffic analysis attacks, is not as effective as expected.

Padding both requests and responses, which was intended to alleviate traffic analysis attacks, is not as effective as expected. Using a response padding strategy of padding to a multiple of 128 bytes, while reducing the F1-score, is not as effective as the recommended strategy of padding to a multiple of 468 bytes. Padding all record sizes to the same size value greatly reduces the F1-score. However, it is not as effective as using Tor, probably because some order information of the records is still maintained, even if the size information is no longer available to the classifier.

The success of Tor for DNS encrypted traffic is a huge difference with respect to web traffic, where website fingerprinting obtains remarkable performance [50, 21, 22]. The reason is that DNS lookups and responses are fairly small, they result in mostly one or two Tor cells which in turn materialize in few observed TLS record sizes, making it difficult to find features unique to a page. We see a similar effect in the number of TLS records per trace – TOR traces are generally shorter and have less variance. Thus lengths-related features, which have been proven to be very important in website fingerprinting, are of no help in the DNS scenario. They only provide a weak 1% performance improvement. Web traffic contains much bigger resources and as a result TLS traces present more variability and are easier to fingerprint. much more information.

While DNS over Tor obtains the best results, when we look closely at the misclassified webpages, we find that webpages get misclassified within six clusters (see Figure 12 in the Appendix). We train a classifier considering all domains within a cluster as equivalent classes. This classifier achieves 55% accuracy, compared to 16% accuracy for random guessing. This means that despite Tor’s protection the effective anonymity set for a webpage is much smaller than the total number of webpages in the dataset. We leave as future work a comprehensive analysis of what traffic characteristics contribute towards the formation of these clusters.

Method Precision Recall F1-score
Constant Padding
DNS over Tor
Table 7: Classification results for countermeasures.

Finally, we evaluate the trade-off between the defenses’ effectiveness and their communication overhead. To compute the overhead generated by each countermeasure, we collect 10 samples of 50 webpages with and without countermeasures.

Figure 6 shows the total volume distribution (sent and received data) for all cases. As expected, the EDNS0 padding (both 128 and 468) add the least overhead, but they also offer the least protection. DNS over Tor, in addition to being more effective than constant padding, also has a smaller overhead. We conclude that DNS repacketizing in addition to padding, as done in Tor, can be a promising avenue to explore.

Figure 6: Total volume of traffic with and without countermeasures.

7 DNS Encryption and Censorship

DNS-based blocking is a wide-spread method of censoring access to web content. Censors inspect DNS lookups and when they detect a blacklisted domain, they either reset the connection or inject their own DNS response [67]. DoH encrypts DNS by default, rendering content-based DNS blocking ineffective. If censors want to continue restricting access to content by blocking DNS, DoH forces them to block the resolver’s IP. While this would be very effective, some DoH resolvers, such as Google’s, do not necessarily have a dedicated IP. Thus, blocking their IP causes collateral damage that may be too expensive for the censor.

In this section, we study whether DoH traffic is really an effective countermeasure to deter DNS-based censorship. A censor aims at blocking access to a number of blacklisted domains. To achieve this goal, the censor needs to identify the domain as soon as possible to prevent the user from downloading any content. We aim at answering two questions: how long must the adversary observe the connection to uniquely identify the domain? Second, based on the answer to the first question, what strategy allows the censor to maximize censoring rates while minimizing collateral damage?

7.1 Uniqueness of DoH traces

In order for the censor to be able to uniquely identify domains given DoH traffic, the DoH traces need to be unique. In particular, to fulfill the censor’s goal the first packets of the trace need to be unique. In the the following, we study the uniqueness of DoH traffic when only the first TLS records (or packets, for short) have been observed.

Let us model the set of webpages in the world as a random variable

with sample space ; and the set of possible network traces generated by those websites as a random variable with sample space . A website’s trace is a sequence of non-zero integers: , , where represents the size (in bytes) of the -th TLS record in the traffic trace and its sign represents the direction – negative for incoming (DNS to client) and positive otherwise.

We measure uniqueness using the conditional entropy , defined as: where

is the Shannon entropy of the probability distribution

describing the likelihood that the adversary guesses websites in given the observation . This entropy measures distinguishability of traces up to packet . For instance, if every DoH trace started with a packet of a different size, then the entropy would be 0, i.e., sites would be perfectly distinct from the first packet.

Figure 7: Conditional entropy given partial observations of DoH traces for 10, 100, 500 and 1,500 webpages. Each data point is averaged over 10 samples.

We show in Figure 7 the conditional entropy for different number of webpages in the LOC1 dataset. Every point is an average over 10 samples of webs from the dataset selected uniformly at random with replacement. The shades represent the standard deviation across the 10 samples.

First, we observe that the conditional entropy decreases as the adversary observes more packets. For all cases, we observe a drop of up to 4 bits within the first four packets, and a drop below 0.1 bits after 20 packets (reaching zero when sites). We note that as we consider more websites, the likelihood of having two or more websites with identical traces increases. Thus, we observe a slower decay in entropy.

A second observation is that the standard deviation is lower for small and large ’s. The former is because the first packets correspond to the connection establishment. Thus, they are similar for all webpages. The latter is because as increases, the traces become more dissimilar and thus the entropy is close to zero regardless of which websites are sampled. We also observe larger variation when few websites are considered. This is because we only have 1,500 webs. As the number of websites per group increases, there is more overlap among the groups used in the experiment. For 1,500 there is no variance because all samples contain the full dataset.

When considering all 1,500 pages, the conditional entropy drops below 1 bit after packets. This means that after packets have been observed, there is one domain whose probability of having generated the trace is larger than 0.5. The average trace length in our dataset is packets. Thus, packets is just 15% of the whole trace. This means that, on average, the adversary only needs to observe the initial 15% of a DoH connection to determine a domain with more confidence than taking a random guess between two domains.

Figure 8: Histograms for domain name length (top) and fourth TLS record length (bottom) in the LOC1 dataset (normalized over the total sum of counts).

Next, we investigate the cause behind the consistent entropy decrease within the first four TLS records. We hypothesized that it might be caused by the fact that one of these records contains the DoH query. Since the DoH protocol does not specify padding, uniqueness in the domain length would be directly observable in the trace. To verify our hypothesis we plot the frequency of the domain’s and fourth record’s length in Figure 8

. We discarded TLS incoming packets – as they cannot contain a DoH query–, and TLS record sizes corresponding to HTTP2 control messages, e.g., the size “33” which corresponds to HTTP2 acknowledgements. We also removed outliers for sizes that occurred 5% or less times. We kept any size that could have contained a DoH query. For instance, we kept size “88” even though it appears too often to only be caused by DoH queries, as such packet size could be caused by queries containing 37-characters-long domain names.

The histogram of the sizes of the fourth TLS record in our dataset is almost identical to the histogram of domain name lengths. This confirms our hypothesis that the fourth packet often contains the first-party DoH query. We verified that the constant difference of 51 bytes between the two histograms is the size of the HTTPS header. We also observed that in some traces the DoH query is sent earlier, explaining the entropy decrease starting after the second packet.

7.2 Censor DNS-blocking strategy

Given that traces are not completely unique, the censor must act on guesses. When wrong, these guesses will cause collateral damage. Of course, the adversary can increase her confidence in her guesses by waiting to observe more TLS records. Thus, there is a trade-off between the collateral damage caused by erred guesses and the amount of content that can be accessed by users. We now discuss advantages and disadvantages of two strategies to censor a connection based solely on encrypted DNS traffic. We assume that upon decision, the censor uses standard techniques to block the connection [68].

High-confidence guesses. A possible strategy to minimize the likelihood of collateral damage is to act only upon seeing the entropy going lower than one bit. Following this strategy, the adversary would not block, on average, 15% of the TLS records in the DoH connection. Those packets include the resolution to the first-party domain. Thus, the client can download the content served from this domain. Yet, the censor can still disrupt access to subsequent queried domains (subdomains and third-parties). We note that quality degradation is a strategy already used in the wild as a stealthy form of censorship [68].

As a response to this censorship strategy, the client could just create a new connection for each DoH query so that the censor cannot distinguish DoH connections belonging to the censored webpage visit or to others. At the cost of generating more traffic for users and resolvers, this would force the censor to drop all DoH connections originating from a user’s IP or throttle their DoH traffic, causing more collateral damage.

Block on first DoH query. An alternative strategy is to drop the DoH connection before the first DoH response arrives. This guarantees that the client cannot access any content, not even index.html. However, it implies that all domains that result on the same trace up to the first DoH query, i.e., all domains with same name length, would also be censored. We illustrate this effect in Figure 9, where we show the entropy decrease for different pairs of sites. We see that sites with different lengths (facebook.com and nytimes.com) are distinguishable on the fourth packet. However, when domains have the same name length the entropy only drops after the the DoH response, which is different per domain, and hence distinguishable. We note that, even for cases when the same service has different domain names with equal length (e.g., google.es and google.be) the entropy eventually drops to zero. For these cases instead of waiting, the adversary can also combine all equivalent pages in the same class, which as shown in Section 5.2 increases the performance of the classifier.

Figure 9: Conditional entropy over the number of observed packets for pairs of sites. The black bold vertical line corresponds to the median position of the first incoming packet, likely to contain the first DoH response.

Finally, we quantify the collateral damage incurred when blocking the first DoH query. The histogram in Figure 8 (top) represents the anonymity sets of websites with same domain name length. For instance, when blocking nytimes.com, that has length 11, one would also block other 111 websites. In our data, anonymity set sizes are unevenly distributed. Only two websites have anonymity set one, and thus can be blocked with no collateral damage.

We also observe that popular domains (according to the Alexa rank on the March 26) tend to have more common domain name lengths. The Pearson correlation coefficient between the domain name length and its Alexa rank for the top 1,000 domains is 0.49, which indicates a moderate-to-high correlation. In particular, the first top-five domains all lie in the 9-13 name length range, the most popular lengths. This is because these lengths correspond to the average length of a word in English and are the easiest to remember. Also, less popular domains often have a second- or third-level domain name such as tumblr or Wordpress sites.

Internet traffic volume distribution over domains follows a power-law [69], i.e., the Alexa top domains accumulate a large fraction of the overall internet traffic. Thus, blocking those domains not only has large collateral damage in terms of number of webs, but also traffic volume. On the contrary, blacklisting unpopular domains with uncommon lengths (in our dataset shorter than 8 or longer than 20 characters), not only blocks less websites, but also affect less overall traffic. The correlation between name length and popularity deserves a deeper study, since we show it is advantageous for some types of censors that tackle non-popular domains such as sites trading drugs.

8 Conclusions

We have performed the first evaluation of DNS-over-HTTPS vulnerability to traffic analysis. We have proposed a new set of features that characterize local patterns in traces. We show that these features are also suitable for web traffic fingerprinting, obtaining results comparable to the state of the art classifiers on HTTPS.

Our experiments show that, encryption is not sufficient to protect users from surveillance or DNS-based censorship. We also demonstrated that changes in factors such as end-user location, local DNS resolver, or client’s platform negatively impact the attack performance, but in many cases traffic analysis is still pretty effective. Furthermore, it is easy for the adversary to recognize the setting of her target and select the most adequate classifier.

In terms of defenses, we show that the recommended EDNS0 padding strategies do not hinder traffic analysis. Repacketizing and padding, as done in anonymous communications, is required to defeat traffic analysis.

We hope that these results serve to influence the evolution of standards on DNS privacy, and prompt main providers to prioritize the addition of countermeasures in their next releases. This seems nowadays out of their plans [70, 71] even though they claim to strive for providing privacy.


  • [1] DNS Trends and Traffic. https://www.akamai.com/us/en/why-akamai/dns-trends-and-traffic.jsp. Accessed: 2018-12-26.
  • [2] Stephane Bortzmeyer. DNS privacy considerations. 2015.
  • [3] Christian Grothoff, Matthias Wachs, Monika Ermert, and Jacob Appelbaum. NSA’s MORECOWBELL: Knell for DNS. Unpublished technical report, 2017.
  • [4] The NSA files decoded. https://www.theguardian.com/us-news/the-nsa-files. Accessed: 2019-05-13.
  • [5] Earl Zmijewski. Turkish Internet Censorship Takes a New Turn. https://dyn.com/blog/turkish-internet-censorship, 2014. Accessed: 2018-12-06.
  • [6] Nicholas Weaver, Christian Kreibich, and Vern Paxson. Redirecting dns for ads and profit. In FOCI, 2011.
  • [7] DNS Privacy - Current Work. https://dnsprivacy.org/wiki/display/DP/DNS+Privacy+-+Current+Work. Accessed: 2018-12-26.
  • [8] S. Bortzmeyer. DNS Privacy Considerations. RFC 7626, 2015.
  • [9] Z. Hu, L. Zhu, J. Heidemann, A. Mankin, D. Wessels, and P. Hoffman. Specification for dns over transport layer security (tls). RFC 7858, RFC Editor, May 2016.
  • [10] P. Hoffman and P. McManus. Dns queries over https (doh). RFC 8484, RFC Editor, October 2018.
  • [11] Geoff Huston. DOH! DNS over HTTPS explained. https://labs.ripe.net/Members/gih/doh-dns-over-https-explained, 2018. Accessed: 2018-12-27.
  • [12] Google DNS-over-HTTPS. https://developers.google.com/speed/public-dns/docs/dns-over-https. Accessed: 2018-05-07.
  • [13] Cloudflare DNS over HTTPS. https://developers.cloudflare.com/ Accessed: 2018-05-07.
  • [14] Selena Deckelmann. DNS over HTTPS (DoH) – Testing on Beta. https://blog.mozilla.org/futurereleases/2018/09/13/dns-over-https-doh-testing-on-beta, 2018. Accessed: 2018-12-30.
  • [15] The DNS Privacy Project. Initial Performance Measurements (Q1 2018). https://dnsprivacy.org/wiki/pages/viewpage.action?pageId=14025132, 2018. Accessed: 2018-12-27.
  • [16] The DNS Privacy Project. Initial Performance Measurements (Q4 2018). https://dnsprivacy.org/wiki/pages/viewpage.action?pageId=17629326, 2018. Accessed: 2018-12-27.
  • [17] Xiapu Luo, Peng Zhou, Edmond W. W. Chan, Wenke Lee, Rocky K. C. Chang, and Roberto Perdisci. HTTPOS: Sealing information leaks with browser-side obfuscation of encrypted flows. In Network & Distributed System Security Symposium (NDSS). IEEE Computer Society, 2011.
  • [18] Brad Miller, Ling Huang, Anthony D Joseph, and J Doug Tygar. I know why you went to the clinic: Risks and realization of HTTPS traffic analysis. In Privacy Enhancing Technologies Symposium (PETS), pages 143–163. Springer, 2014.
  • [19] Andriy Panchenko, Fabian Lanze, Andreas Zinnen, Martin Henze, Jan Pennekamp, Klaus Wehrle, and Thomas Engel. Website fingerprinting at internet scale. In Network & Distributed System Security Symposium (NDSS), pages 1–15. IEEE Computer Society, 2016.
  • [20] Tao Wang and Ian Goldberg. On realistically attacking tor with website fingerprinting. In Privacy Enhancing Technologies Symposium (PETS), pages 21–36. De Gruyter Open, 2016.
  • [21] Jamie Hayes and George Danezis. k-fingerprinting: A robust scalable website fingerprinting technique. In USENIX Security Symposium, pages 1–17. USENIX Association, 2016.
  • [22] Payap Sirinam, Mohsen Imani, Marc Juarez, and Matthew Wright.

    Deep fingerprinting: Undermining website fingerprinting defenses with deep learning.

    In ACM Conference on Computer and Communications Security (CCS), pages 1928–1943. ACM, 2018.
  • [23] A. Mayrhofer. The edns(0) padding option. RFC 7830, RFC Editor, May 2016.
  • [24] DNS over Tor. https://developers.cloudflare.com/ Accessed: 2018-12-09.
  • [25] iodine. https://code.kryo.se/iodine/. Accessed: 2019-05-13.
  • [26] Spamhaus. https://www.spamhaus.org/zen/. Accessed: 2019-05-13.
  • [27] DNSSEC: DNS Security Extensions. https://www.dnssec.net/. Accessed: 2018-12-09.
  • [28] DNSCrypt. https://dnscrypt.info/. Accessed: 2018-12-09.
  • [29] G. Huston. DOH! DNS over HTTPS explained. https://blog.apnic.net/2018/10/12/doh-dns-over-https-explained/. Accessed: 2018-12-26.
  • [30] T. Reddy, D. Wing, and P. Patil. Dns over datagram transport layer security (dtls). RFC 8094, RFC Editor, February 2017.
  • [31] Specification of DNS over Dedicated QUIC Connections. https://www.ietf.org/id/draft-huitema-quic-dnsoquic-05.txt. Accessed: 2018-12-09.
  • [32] Haya Shulman. Pretty bad privacy: Pitfalls of DNS encryption. In Proceedings of the 13th Workshop on Privacy in the Electronic Society. ACM, 2014.
  • [33] Dominik Herrmann, Christian Banse, and Hannes Federrath. Behavior-based tracking: Exploiting characteristic patterns in DNS traffic. Computers & Security, 2013.
  • [34] Basileal Imana, Aleksandra Korolova, and John S. Heidemann. Enumerating Privacy Leaks in DNS Data Collected above the Recursive. 2017.
  • [35] DNS-OARC: Domain Name System Operations Analysis and Research Center. https://www.dns-oarc.net/tools/dsc. Accessed: 2018-11-26.
  • [36] DNS-STATS: ICANN’s IMRS DNS Statistics. https://www.dns.icann.org/imrs/stats. Accessed: 2018-12-06.
  • [37] Use DNS data to identify malware patient zero. https://docs.splunk.com/Documentation/ES/5.2.0/Usecases/PatientZero. Accessed: 2018-12-06.
  • [38] DNS Analytics. https://constellix.com/dns/dns-analytics/. Accessed: 2018-12-06.
  • [39] Saikat Guha and Paul Francis. Identity trail: Covert surveillance using DNS. In Privacy Enhancing Technologies Symposium (PETS), 2007.
  • [40] Anonymous. The collateral damage of internet censorship by dns injection. SIGCOMM Comput. Commun. Rev., 42(3):21–27, 2012.
  • [41] Paul Pearce, Ben Jones, Frank Li, Roya Ensafi, Nick Feamster, Nick Weaver, and Vern Paxson. Global measurement of dns manipulation. In USENIX Security Symposium. USENIX, page 22, 2017.
  • [42] Alerts about BGP hijacks, leaks, and outages. https://bgpstream.com/. Accessed: 2019-05-13.
  • [43] Allison McDonald, Matthew Bernhard, Luke Valenta, Benjamin VanderSloot, Will Scott, Nick Sullivan, J Alex Halderman, and Roya Ensafi. 403 forbidden: A global view of cdn geoblocking. In Proceedings of the Internet Measurement Conference 2018, pages 218–230. ACM, 2018.
  • [44] Roberto Gonzalez, Claudio Soriente, and Nikolaos Laoutaris. User profiling in the time of https. In Proceedings of the 2016 Internet Measurement Conference, pages 373–379. ACM, 2016.
  • [45] Marc Liberatore and Brian Neil Levine. "Inferring the source of encrypted HTTP connections". In ACM Conference on Computer and Communications Security (CCS), pages 255–263. ACM, 2006.
  • [46] Kevin P. Dyer, Scott E. Coull, Thomas Ristenpart, and Thomas Shrimpton. Peek-a-Boo, I still see you: Why efficient traffic analysis countermeasures fail. In IEEE Symposium on Security and Privacy (S&P), pages 332–346. IEEE, 2012.
  • [47] Qixiang Sun, Daniel R Simon, Yi-Min Wang, Wilf Russel, Venkata N. Padmanabhan, and Lili Qiu. Statistical identification of encrypted web browsing traffic. In IEEE Symposium on Security and Privacy (S&P), pages 19–30. IEEE, 2002.
  • [48] Andrew Hintz. Fingerprinting websites using traffic analysis. In Privacy Enhancing Technologies Symposium (PETS), pages 171–178. Springer, 2003.
  • [49] Dominik Herrmann, Rolf Wendolsky, and Hannes Federrath. Website fingerprinting: attacking popular privacy enhancing technologies with the multinomial Naïve-Bayes classifier. In ACM Workshop on Cloud Computing Security, pages 31–42. ACM, 2009.
  • [50] Andriy Panchenko, Lukas Niessen, Andreas Zinnen, and Thomas Engel. Website fingerprinting in onion routing based anonymization networks. In ACM Workshop on Privacy in the Electronic Society (WPES), pages 103–114. ACM, 2011.
  • [51] Tao Wang and Ian Goldberg. Improved Website Fingerprinting on Tor. In ACM Workshop on Privacy in the Electronic Society (WPES), pages 201–212. ACM, 2013.
  • [52] Xiang Cai, Xin Cheng Zhang, Brijesh Joshi, and Rob Johnson. Touching from a distance: Website fingerprinting attacks and defenses. In ACM Conference on Computer and Communications Security (CCS), pages 605–616. ACM, 2012.
  • [53] Tao Wang, Xiang Cai, Rishab Nithyanand, Rob Johnson, and Ian Goldberg. Effective attacks and provable defenses for website fingerprinting. In USENIX Security Symposium, pages 143–157. USENIX Association, 2014.
  • [54] Mario Almeida, Alessandro Finamore, Diego Perino, Narseo Vallina-Rodriguez, and Matteo Varvello. Dissecting dns stakeholders in mobile networks. In Proceedings of the 13th International Conference on emerging Networking EXperiments and Technologies, pages 28–34. ACM, 2017.
  • [55] Rob Jansen, Marc Juarez, Rafael Galvez, Tariq Elahi, and Claudia Diaz. Inside job: Applying traffic analysis to measure tor from within. In Network & Distributed System Security Symposium (NDSS). Internet Society, 2018.
  • [56] Rebekah Overdorf, Marc Juarez, Gunes Acar, Rachel Greenstadt, and Claudia Diaz. How unique is your onion? an analysis of the fingerprintability of tor onion services. In ACM Conference on Computer and Communications Security (CCS), pages 2021–2036. ACM, 2017.
  • [57] Vera Rimmer, Davy Preuveneers, Marc Juarez, Tom Van Goethem, and Wouter Joosen. Automated website fingerprinting through deep learning. In Network & Distributed System Security Symposium (NDSS). Internet Society, 2018.
  • [58] Muhammad Ahmad Bashir and Christo Wilson. Diffusion of user tracking data in the online advertising ecosystem. 2018.
  • [59] Marc Juarez, Mohsen Imani, Mike Perry, Claudia Diaz, and Matthew Wright. Toward an efficient website fingerprinting defense. In European Symposium on Research in Computer Security (ESORICS), pages 27–46. Springer, 2016.
  • [60] Ariel Stolerman, Rebekah Overdorf, Sadia Afroz, and Rachel Greenstadt. Breaking the closed-world assumption in stylometric authorship attribution. In IFIP Int. Conf. Digital Forensics, 2014.
  • [61] Marc Juarez, Sadia Afroz, Gunes Acar, Claudia Diaz, and Rachel Greenstadt. A critical evaluation of website fingerprinting attacks. In ACM Conference on Computer and Communications Security (CCS), pages 263–274. ACM, 2014.
  • [62] John S Otto, Mario A Sánchez, John P Rula, and Fabián E Bustamante. Content delivery and the natural evolution of DNS: remote dns trends, performance issues and alternative solutions. In Proceedings of the 2012 Internet Measurement Conference, pages 523–536. ACM, 2012.
  • [63] John P Rula and Fabian E Bustamante. Behind the curtain: Cellular dns and content replica selection. In Proceedings of the 2014 Conference on Internet Measurement Conference, pages 59–72. ACM, 2014.
  • [64] Platon Kotzias, Abbas Razaghpanah, Johanna Amann, Kenneth G Paterson, Narseo Vallina-Rodriguez, and Juan Caballero.

    Coming of age: A longitudinal study of tls deployment.

    In Proceedings of the Internet Measurement Conference, pages 415–428. ACM, 2018.
  • [65] J. Damas, M. Graff, and P. Vixie. Extension mechanisms for dns (edns(0)). RFC 6891, RFC Editor, April 2013.
  • [66] Padding Policies for Extension Mechanisms for DNS (EDNS(0)). https://tools.ietf.org/html/rfc8467. Accessed: 2019-05-10.
  • [67] Michael Carl Tschantz, Sadia Afroz, Anonymous, and Vern Paxson. Sok: Towards grounding censorship circumvention in empiricism. In IEEE Symposium on Security and Privacy (S&P), pages 914–933. IEEE, 2016.
  • [68] Sheharbano Khattak, Tariq Elahi, Laurent Simon, Colleen M Swanson, Steven J Murdoch, and Ian Goldberg. SoK: Making sense of censorship resistance systems. Privacy Enhancing Technologies Symposium (PETS), 2016(4):37–61, 2016.
  • [69] Luca Deri, Simone Mainardi, Maurizio Martinelli, and Enrico Gregori. Graph theoretical models of dns traffic. In 9th International Wireless Communications and Mobile Computing Conference (IWCMC), pages 1162–1167. IEEE, 2013.
  • [70] Google Public DNS position on DNS-over-HTTPS (DoH). https://mailarchive.ietf.org/arch/msg/dnsop/GE8v2Yz6zsl28clDvlshGh3rYlc. Accessed: 2019-05-13.
  • [71] Mozilla’s plans re: DoH. https://mailarchive.ietf.org/arch/msg/doh/po6GCAJ52BAKuyL-dZiU91v6hLw. Accessed: 2019-05-13.
  • [72] Davis Yoshida and Jordan Boyd-Graber. Using confusion graphs to understand classifier error. In Proceedings of the Workshop on Human-Computer Question Answering, pages 48–52, 2016.
  • [73] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008.

Appendix A Performance metrics.

We use standard metrics to evaluate the performance of our classifier: Precision, Recall and F1-Score

. We compute these metrics per class, where each class represents a webpage. We compute these metrics on a class as if it was a “one vs. all” binary classification: we call “positives” the samples that belong to that class and “negatives” the samples that belong to the rest of classes. Precision is the ratio of true positives to the total number of samples that were classified as positive (true positives and false positives). Recall is the ratio of true positives to the total number of positives (true positives and false negatives). The F1-score is the harmonic mean of precision and recall.

Appendix B Estimation of Probabilities

In this section we explain how we have estiamted the probabilities for for the entropy analysis in Sections 5 and 7.

We define the anonymity set of a trace as a multiset:

where is the multiplicity of a website in . The multiplicity is a function defined as the number of times that trace occurrs in .

The probability can be worked out using Bayes. For instance, for website ,


We assume the distribution of priors is uniform, i.e., the probability of observing a website is the same for all websites: .

We acknowledge that this is an unrealisitc assumption but we provide the mathematical model to incorporate the priors in case future work has the data to estimate them.

Assuming uniform priors allows us to simplify the Bayes rule formula since we can factor out in Equation 1

Regarding the likelihoods of observing the traces given a website, we can use the traffic trace samples in our dataset as observations to estimate them:

Since we have a large number of samples for all the sites, we can fix the same sample size for all sites: . A fixed sample size allows us to factor out in our likelihood estimates and, thus, the posterior can be estimated simply as:


That is the multiplicity of website divided by the size of the ’s anonymity set, which can be computed efficiently for all and using vectorial operations.

Appendix C Extra results on attack robustness

GOOGLE 0.886 > 0.8 * 10.8860.886 0.386 > 0.8 * 10.3860.386
CLOUD 0.881 > 0.8 * 10.8810.881 0.890 > 0.8 * 10.8900.890
GOOGLE 0.881 > 0.8 * 10.8810.881 0.083 > 0.8 * 10.0830.083
CLOUD 0.860 > 0.8 * 10.8600.860 0.886 > 0.8 * 10.8860.886
Table 8: Performance when training on the resolver indicated by the row and testing on the resolver indicated by the column (standard deviations less than 1%).
DESKTOP 0.8848 > 0.8 * 10.88480.8848 0.0003 > 0.8 * 10.00030.0003
RPI 0.0003 > 0.8 * 10.00030.0003 0.8970 > 0.8 * 10.89700.8970
DESKTOP 0.8816 > 0.8 * 10.88160.8816 0.0008 > 0.8 * 10.00080.0008
RPI 0.0010 > 0.8 * 10.00100.0010 0.8945 > 0.8 * 10.89450.8945
Table 9: Performance when training on the platform indicated by the row and testing on the platform indicated in the column (standard deviation less than 1% for same platform and less than 0.1% for cross-platform.
Figure 10: Distribution of user’s sent TLS record sizes in platform experiment.
Train Test Precision Recall F-score
Table 10: Improvement in cross platform performance when removing the shift (standard deviation less than 1%).
Precision CLOUD CL-FF LOC2
CLOUD 0.890 > 0.8 * 10.8900.890 0.646 > 0.8 * 10.6460.646 0.000 > 0.8 * 10.0000.000
CL-FF 0.257 > 0.8 * 10.2570.257 0.896 > 0.8 * 10.8960.896 0.089 > 0.8 * 10.0890.089
LOC2 0.001 > 0.8 * 10.0010.001 0.090 > 0.8 * 10.0900.090 0.911 > 0.8 * 10.9110.911
CLOUD 0.886 > 0.8 * 10.8860.886 0.267 > 0.8 * 10.2670.267 0.001 > 0.8 * 10.0010.001
CL-FF 0.080 > 0.8 * 10.0800.080 0.893 > 0.8 * 10.8930.893 0.073 > 0.8 * 10.0730.073
LOC2 0.004 > 0.8 * 10.0040.004 0.069 > 0.8 * 10.0690.069 0.909 > 0.8 * 10.9090.909
Table 11: Performance when training on the client setups indicated by the row and testing on the configuration indicated by the column (standard deviations less than 2%).

Appendix D Confusion Graphs

We have used confusion graphs to understand the errors of the classifier. Confusion graphs are the graph representation of confusion matrices. They allow to easily visualize large confusion matrices by representing misclassifications as directed graphs. Confusion graphs have been used in website fingerprinting [56] and other classification tasks to understand classifier error [72].

Figures 1112 and 13 show the classification errors in the form of confusion graphs for some of the experiments presented in Section 5. The graphs were drawn using Gephi, a software for graph manipulation and visualization. Nodes in the graph are domains and edges represent misclassifications between domains. The edge source is the true label of the sample and the destination is the domain that the classifier confused it with. The direction of the edge is encoded clockwise in the curvature of the edge. Node size is proportional to the node’s degree and nodes are colored according to the community they belong to, which is determined by the Lovain community detection algorithm [73].

Figure 11: Confusion graph for the misclassifications in LOC1 that happen in more than one fold of the cross-validation and have different domain name length. We observe domains that belong to the same CDN (e.g., tumblr) or belong to the same entity (e.g., BBC, Salesforce). For others, however, the cause of the misclassification remains an open question.
Figure 12: Confusion graph for all Tor misclassifications. We did not plot the labels to remove clutter. We observe that domains in one a “petal” of the graph tend to classify between each other.
Figure 13: Confusion graph for all misclassifications in LOC1. We observe clusters of domains such as Google and clusters of domains that have the same name length. Interestingly, the only inter-cluster edge we observe is between one of the Google clusters and a cluster that mostly contains Chinese domains.

Appendix E Survivors and Easy Preys

Alexa Rank Mean F1-Score Stdev F1-Score Domain name
777 0.95 0.08 militaryfamilygiftmarket.com
985 0.95 0.08 myffpc.com
874 0.95 0.08 montrealhealthygirl.com
712 0.95 0.08 mersea.restaurant
1496 0.95 0.08 samantha-wilson.com
1325 0.95 0.08 nadskofija-ljubljana.si
736 0.95 0.08 michaelnewnham.com
852 0.95 0.08 mollysatthemarket.net
758 0.95 0.08 midwestdiesel.com
1469 0.95 0.08 reclaimedbricktiles.blogspot.si
Table 13: Top-10 sites with lowest-mean and lowest-variance F1-Score
Alexa Rank Mean F1-Score Stdev F1-Score Domain name
822 0.11 0.10 mjtraders.com
1464 0.11 0.08 ravenfamily.org
853 0.14 0.09 moloneyhousedoolin.ie
978 0.14 0.17 mydeliverydoctor.com
999 0.17 0.10 myofascialrelease.com
826 0.17 0.11 mm-bbs.org
1128 0.17 0.10 inetgiant.com
889 0.18 0.14 motorize.com
791 0.18 0.15 mindshatter.com
1193 0.20 0.14 knjiznica-velenje.si
Table 14: Top-10 sites with highest-variance F1-Score
Alexa Rank Mean F1-Score Stdev F1-Score Domain name
1136 0.43 0.53 intothemysticseasons.tumblr.com
782 0.43 0.53 milliesdiner.com
766 0.43 0.53 mikaelson-imagines.tumblr.com
1151 0.43 0.53 japanese-porn-guidecom.tumblr.com
891 0.42 0.52 motorstylegarage.tumblr.com
909 0.42 0.52 mr-kyles-sluts.tumblr.com
918 0.44 0.52 mrsnatasharomanov.tumblr.com
1267 0.52 0.49 meander-the-world.com
238 0.48 0.49 caijing.com.cn
186 0.48 0.48 etsy.com
Table 12: Top-10 with highest-mean and lowest-variance F1-Score